1
|
Lee J, Mo HL, Ha Y, Nam DY, Lim G, Park JW, Park S, Choi WY, Lee HJ, Rhee JK. Unraveling the three-dimensional genome structure using machine learning. BMB Rep 2025; 58:203-208. [PMID: 40058875 PMCID: PMC12123201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 03/07/2024] [Accepted: 09/06/2024] [Indexed: 05/29/2025] Open
Abstract
The study of chromatin interactions has advanced considerably with technologies such as high-throughput chromosome conformation capture (Hi-C) sequencing, providing a genome-wide view of physical interactions within the nucleus. These techniques have revealed the existence of hierarchical chromatin structures such as compartments, topologically associating domains (TADs), and chromatin loops, which are crucial in genome organization and regulation. However, identifying and analyzing these structural features require advanced computational methods. In recent years, machine learning approaches, particularly deep learning, have emerged as powerful tools for detecting and analyzing structural information. In this review, we present an overview of various machine learning-based techniques for determining chromosomal organization. Starting with the progress in predicting interactions from DNA sequences, we describe methods for identifying various hierarchical structures from Hi-C data. Additionally, we present advances in enhancing the chromosome contact frequency map resolution to overcome the limitations of Hi-C data. Finally, we identify the remaining challenges and propose potential solutions and future directions. [BMB Reports 2025; 58(5): 203-208].
Collapse
Affiliation(s)
- Jiho Lee
- School of Systems Biomedical Science, Soongsil University, Seoul 06978, Korea
| | - Hye-Lim Mo
- Department of Bioinformatics & Life Science, Soongsil University, Seoul 06978, Korea
| | - Yoon Ha
- Department of Bioinformatics & Life Science, Soongsil University, Seoul 06978, Korea
| | - Dong Yeon Nam
- Department of Bioinformatics & Life Science, Soongsil University, Seoul 06978, Korea
| | - Geumnim Lim
- Department of Bioinformatics & Life Science, Soongsil University, Seoul 06978, Korea
| | - Jeong-Woon Park
- Department of Bioinformatics & Life Science, Soongsil University, Seoul 06978, Korea
| | - Seoyoung Park
- Department of Bioinformatics & Life Science, Soongsil University, Seoul 06978, Korea
| | - Woo-Young Choi
- Department of Bioinformatics & Life Science, Soongsil University, Seoul 06978, Korea
| | - Hyun Ji Lee
- Department of Bioinformatics & Life Science, Soongsil University, Seoul 06978, Korea
| | - Je-Keun Rhee
- School of Systems Biomedical Science, Soongsil University, Seoul 06978, Korea
- Department of Bioinformatics & Life Science, Soongsil University, Seoul 06978, Korea
| |
Collapse
|
2
|
Zeng Y, You Z, Guo J, Zhao J, Zhou Y, Huang J, Lyu X, Chen L, Li Q. Chrombus-XMBD: a graph convolution model predicting 3D-genome from chromatin features. Brief Bioinform 2025; 26:bbaf183. [PMID: 40315432 PMCID: PMC12047703 DOI: 10.1093/bib/bbaf183] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2024] [Revised: 03/11/2025] [Accepted: 03/26/2025] [Indexed: 05/04/2025] Open
Abstract
The 3D conformation of the chromatin is crucial for transcriptional regulation. However, current experimental techniques for detecting the 3D structure of the genome are costly and limited to the biological conditions. Here, we described "ChrombusXMBD," a graph convolution model capable of predicting chromatin interactions ab initio based on available chromatin features. Using dynamic edge convolution with multihead attention mechanism, Chrombus encodes the 2D-chromatin features into a learnable embedding space, thereby generating a genome-wide 3D-contactmap. In validation, Chrombus effectively recapitulated the topological associated domains, expression quantitative trait loci, and promoter/enhancer interactions. Especially, Chrombus outperforms existing algorithms in predicting chromatin interactions over 1-2 Mb, increasing prediction correlation by 11.8%-48.7%, and predicts long-range interactions over 2 Mb (Pearson's coefficient 0.243-0.582). Chrombus also exhibits strong generalizability across human and mouse-derived cell lines. Additionally, the parameters of Chrombus inform the biological mechanisms underlying cistrome. Our model provides a new, generalizable analytical tool for understanding the complex dynamics of chromatin interactions and the landscape of cis-regulation of gene expression.
Collapse
Affiliation(s)
- Yuanyuan Zeng
- Department of Hematology, The First Affiliated Hospital of Xiamen University and Institute of Hematology, School of Medicine, Xiamen University, Xiamen, Fujian 361102, China
- National Institute for Data Science in Health and Medicine, School of Medicine, Xiamen University, Xiamen, Fujian 361102, China
| | - Zhiyu You
- National Institute for Data Science in Health and Medicine, School of Medicine, Xiamen University, Xiamen, Fujian 361102, China
| | - Jiayang Guo
- National Institute for Data Science in Health and Medicine, School of Medicine, Xiamen University, Xiamen, Fujian 361102, China
| | - Jialin Zhao
- National Institute for Data Science in Health and Medicine, School of Medicine, Xiamen University, Xiamen, Fujian 361102, China
| | - Ying Zhou
- National Institute for Data Science in Health and Medicine, School of Medicine, Xiamen University, Xiamen, Fujian 361102, China
| | - Jialiang Huang
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, Xiamen, Fujian 361102, China
| | - Xiaowen Lyu
- State Key Laboratory of Cellular Stress Biology, Fujian Provincial Key Laboratory of Reproductive Health Research, School of Medicine, Xiamen University, Xiamen, Fujian 361102, China
| | - Longbiao Chen
- Fujian Key Laboratory of Sensing and Computing for Smart Cities (SCSC), School of Informatics, Xiamen University, Xiamen, Fujian 361102, China
| | - Qiyuan Li
- Department of Hematology, The First Affiliated Hospital of Xiamen University and Institute of Hematology, School of Medicine, Xiamen University, Xiamen, Fujian 361102, China
- National Institute for Data Science in Health and Medicine, School of Medicine, Xiamen University, Xiamen, Fujian 361102, China
| |
Collapse
|
3
|
Kumar Halder A, Agarwal A, Jodkowska K, Plewczynski D. A systematic analyses of different bioinformatics pipelines for genomic data and its impact on deep learning models for chromatin loop prediction. Brief Funct Genomics 2024; 23:538-548. [PMID: 38555493 DOI: 10.1093/bfgp/elae009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 02/07/2024] [Accepted: 03/04/2024] [Indexed: 04/02/2024] Open
Abstract
Genomic data analysis has witnessed a surge in complexity and volume, primarily driven by the advent of high-throughput technologies. In particular, studying chromatin loops and structures has become pivotal in understanding gene regulation and genome organization. This systematic investigation explores the realm of specialized bioinformatics pipelines designed specifically for the analysis of chromatin loops and structures. Our investigation incorporates two protein (CTCF and Cohesin) factor-specific loop interaction datasets from six distinct pipelines, amassing a comprehensive collection of 36 diverse datasets. Through a meticulous review of existing literature, we offer a holistic perspective on the methodologies, tools and algorithms underpinning the analysis of this multifaceted genomic feature. We illuminate the vast array of approaches deployed, encompassing pivotal aspects such as data preparation pipeline, preprocessing, statistical features and modelling techniques. Beyond this, we rigorously assess the strengths and limitations inherent in these bioinformatics pipelines, shedding light on the interplay between data quality and the performance of deep learning models, ultimately advancing our comprehension of genomic intricacies.
Collapse
Affiliation(s)
- Anup Kumar Halder
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| | - Abhishek Agarwal
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| | - Karolina Jodkowska
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| | - Dariusz Plewczynski
- Laboratory of Bioinformatics and Computational Genomics, Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
- Laboratory of Functional and Structural Genomics, Centre of New Technologies, University of Warsaw, Banacha 2c, 02-097 Warsaw, Poland
| |
Collapse
|
4
|
Woo BJ, Moussavi-Baygi R, Karner H, Karimzadeh M, Yousefi H, Lee S, Garcia K, Joshi T, Yin K, Navickas A, Gilbert LA, Wang B, Asgharian H, Feng FY, Goodarzi H. Integrative identification of non-coding regulatory regions driving metastatic prostate cancer. Cell Rep 2024; 43:114764. [PMID: 39276353 PMCID: PMC11466230 DOI: 10.1016/j.celrep.2024.114764] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Revised: 07/08/2024] [Accepted: 08/29/2024] [Indexed: 09/17/2024] Open
Abstract
Large-scale sequencing efforts have been undertaken to understand the mutational landscape of the coding genome. However, the vast majority of variants occur within non-coding genomic regions. We designed an integrative computational and experimental framework to identify recurrently mutated non-coding regulatory regions that drive tumor progression. Applying this framework to sequencing data from a large prostate cancer patient cohort revealed a large set of candidate drivers. We used (1) in silico analyses, (2) massively parallel reporter assays, and (3) in vivo CRISPR interference screens to systematically validate metastatic castration-resistant prostate cancer (mCRPC) drivers. One identified enhancer region, GH22I030351, acts on a bidirectional promoter to simultaneously modulate expression of the U2-associated splicing factor SF3A1 and chromosomal protein CCDC157. SF3A1 and CCDC157 promote tumor growth in vivo. We nominated a number of transcription factors, notably SOX6, to regulate expression of SF3A1 and CCDC157. Our integrative approach enables the systematic detection of non-coding regulatory regions that drive human cancers.
Collapse
Affiliation(s)
- Brian J Woo
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA
| | - Ruhollah Moussavi-Baygi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
| | - Heather Karner
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA
| | - Mehran Karimzadeh
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Vector Institute, Toronto, ON, Canada; Peter Munk Cardiac Centre, University Health Network, Toronto, ON, Canada; Arc Institute, Palo Alto, CA 94305, USA
| | - Hassan Yousefi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA
| | - Sean Lee
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA
| | - Kristle Garcia
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
| | - Tanvi Joshi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
| | - Keyi Yin
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
| | - Albertas Navickas
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA
| | - Luke A Gilbert
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA
| | - Bo Wang
- Vector Institute, Toronto, ON, Canada; Peter Munk Cardiac Centre, University Health Network, Toronto, ON, Canada
| | - Hosseinali Asgharian
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
| | - Felix Y Feng
- Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Department of Radiation Oncology, University of California, San Francisco, San Francisco, CA, USA.
| | - Hani Goodarzi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, CA, USA; Department of Urology, University of California, San Francisco, San Francisco, CA, USA; Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, CA, USA; Arc Institute, Palo Alto, CA 94305, USA; Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA.
| |
Collapse
|
5
|
Schlegel L, Bhardwaj R, Shahryary Y, Demirtürk D, Marand A, Schmitz R, Johannes F. GenomicLinks: deep learning predictions of 3D chromatin interactions in the maize genome. NAR Genom Bioinform 2024; 6:lqae123. [PMID: 39318505 PMCID: PMC11420838 DOI: 10.1093/nargab/lqae123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Revised: 07/25/2024] [Accepted: 08/30/2024] [Indexed: 09/26/2024] Open
Abstract
Gene regulation in eukaryotes is partly shaped by the 3D organization of chromatin within the cell nucleus. Distal interactions between cis-regulatory elements and their target genes are widespread, and many causal loci underlying heritable agricultural traits have been mapped to distal non-coding elements. The biology underlying chromatin loop formation in plants is poorly understood. Dissecting the sequence features that mediate distal interactions is an important step toward identifying putative molecular mechanisms. Here, we trained GenomicLinks, a deep learning model, to identify DNA sequence features predictive of 3D chromatin interactions in maize. We found that the presence of binding motifs of specific transcription factor classes, especially bHLH, is predictive of chromatin interaction specificities. Using an in silico mutagenesis approach we show the removal of these motifs from loop anchors leads to reduced interaction probabilities. We were able to validate these predictions with single-cell co-accessibility data from different maize genotypes that harbor natural substitutions in these TF binding motifs. GenomicLinks is currently implemented as an open-source web tool, which should facilitate its wider use in the plant research community.
Collapse
Affiliation(s)
- Luca Schlegel
- TUM School of Life Sciences, Plant Epigenomics, Technical University of Munich, Freising, 85354, Germany
| | - Rohan Bhardwaj
- TUM School of Life Sciences, Plant Epigenomics, Technical University of Munich, Freising, 85354, Germany
| | - Yadollah Shahryary
- TUM School of Life Sciences, Plant Epigenomics, Technical University of Munich, Freising, 85354, Germany
| | - Defne Demirtürk
- TUM School of Life Sciences, Plant Epigenomics, Technical University of Munich, Freising, 85354, Germany
| | - Alexandre P Marand
- Department of Molecular, Cellular, and Developmental Biology, University of Michigan, Ann Arbor, MI 48109, USA
| | - Robert J Schmitz
- Department of Genetics, University of Georgia, Athens, GA 30602, USA
| | - Frank Johannes
- TUM School of Life Sciences, Plant Epigenomics, Technical University of Munich, Freising, 85354, Germany
| |
Collapse
|
6
|
Sokolova K, Chen KM, Hao Y, Zhou J, Troyanskaya OG. Deep Learning Sequence Models for Transcriptional Regulation. Annu Rev Genomics Hum Genet 2024; 25:105-122. [PMID: 38594933 DOI: 10.1146/annurev-genom-021623-024727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/11/2024]
Abstract
Deciphering the regulatory code of gene expression and interpreting the transcriptional effects of genome variation are critical challenges in human genetics. Modern experimental technologies have resulted in an abundance of data, enabling the development of sequence-based deep learning models that link patterns embedded in DNA to the biochemical and regulatory properties contributing to transcriptional regulation, including modeling epigenetic marks, 3D genome organization, and gene expression, with tissue and cell-type specificity. Such methods can predict the functional consequences of any noncoding variant in the human genome, even rare or never-before-observed variants, and systematically characterize their consequences beyond what is tractable from experiments or quantitative genetics studies alone. Recently, the development and application of interpretability approaches have led to the identification of key sequence patterns contributing to the predicted tasks, providing insights into the underlying biological mechanisms learned and revealing opportunities for improvement in future models.
Collapse
Affiliation(s)
- Ksenia Sokolova
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Kathleen M Chen
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| | - Yun Hao
- Flatiron Institute, Simons Foundation, New York, NY, USA;
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA;
| | - Olga G Troyanskaya
- Princeton Precision Health, Princeton University, Princeton, New Jersey, USA
- Flatiron Institute, Simons Foundation, New York, NY, USA;
- Department of Computer Science and Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, USA; , ,
| |
Collapse
|
7
|
Alagarswamy K, Shi W, Boini A, Messaoudi N, Grasso V, Cattabiani T, Turner B, Croner R, Kahlert UD, Gumbs A. Should AI-Powered Whole-Genome Sequencing Be Used Routinely for Personalized Decision Support in Surgical Oncology—A Scoping Review. BIOMEDINFORMATICS 2024; 4:1757-1772. [DOI: 10.3390/biomedinformatics4030096] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/12/2025]
Abstract
In this scoping review, we delve into the transformative potential of artificial intelligence (AI) in addressing challenges inherent in whole-genome sequencing (WGS) analysis, with a specific focus on its implications in oncology. Unveiling the limitations of existing sequencing technologies, the review illuminates how AI-powered methods emerge as innovative solutions to surmount these obstacles. The evolution of DNA sequencing technologies, progressing from Sanger sequencing to next-generation sequencing, sets the backdrop for AI’s emergence as a potent ally in processing and analyzing the voluminous genomic data generated. Particularly, deep learning methods play a pivotal role in extracting knowledge and discerning patterns from the vast landscape of genomic information. In the context of oncology, AI-powered methods exhibit considerable potential across diverse facets of WGS analysis, including variant calling, structural variation identification, and pharmacogenomic analysis. This review underscores the significance of multimodal approaches in diagnoses and therapies, highlighting the importance of ongoing research and development in AI-powered WGS techniques. Integrating AI into the analytical framework empowers scientists and clinicians to unravel the intricate interplay of genomics within the realm of multi-omics research, paving the way for more successful personalized and targeted treatments.
Collapse
Affiliation(s)
| | - Wenjie Shi
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
| | - Aishwarya Boini
- Davao Medical School Foundation, Davao City 8000, Philippines
| | - Nouredin Messaoudi
- Department of Hepatopancreatobiliary Surgery, Vrije Universiteit Brussel (VUB), Universitair Ziekenhuis Brussel (UZ Brussel), Europe Hospitals, 1090 Brussels, Belgium
| | - Vincent Grasso
- Department of Electrical and Computer Engineering, University of New Mexico, Albuquerque, NM 87131, USA
| | | | | | - Roland Croner
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
| | - Ulf D. Kahlert
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
| | - Andrew Gumbs
- Department of General-, Visceral-, Vascular and Transplantation Surgery, University of Magdeburg, Haus 60a, Leipziger Str. 44, 39120 Magdeburg, Germany
- Talos Surgical, Inc., New Castle, DE 19720, USA
- Department of Surgery, American Hospital of Tbilisi, 0102 Tbilisi, Georgia
| |
Collapse
|
8
|
Shen J, Wang Y, Luo J. CD-Loop: a chromatin loop detection method based on the diffusion model. Front Genet 2024; 15:1393406. [PMID: 38770419 PMCID: PMC11102972 DOI: 10.3389/fgene.2024.1393406] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2024] [Accepted: 04/11/2024] [Indexed: 05/22/2024] Open
Abstract
Motivation In recent years, there have been significant advances in various chromatin conformation capture techniques, and annotating the topological structure from Hi-C contact maps has become crucial for studying the three-dimensional structure of chromosomes. However, the structure and function of chromatin loops are highly dynamic and diverse, influenced by multiple factors. Therefore, obtaining the three-dimensional structure of the genome remains a challenging task. Among many chromatin loop prediction methods, it is difficult to fully extract features from the contact map and make accurate predictions at low sequencing depths. Results In this study, we put forward a deep learning framework based on the diffusion model called CD-Loop for predicting accurate chromatin loops. First, by pre-training the input data, we obtain prior probabilities for predicting the classification of the Hi-C contact map. Then, by combining the denoising process based on the diffusion model and the prior probability obtained by pre-training, candidate loops were predicted from the input Hi-C contact map. Finally, CD-Loop uses a density-based clustering algorithm to cluster the candidate chromatin loops and predict the final chromatin loops. We compared CD-Loop with the currently popular methods, such as Peakachu, Chromosight, and Mustache, and found that in different cell types, species, and sequencing depths, CD-Loop outperforms other methods in loop annotation. We conclude that CD-Loop can accurately predict chromatin loops and reveal cell-type specificity. The code is available at https://github.com/wangyang199897/CD-Loop.
Collapse
Affiliation(s)
| | | | - Junwei Luo
- School of Software, Henan Polytechnic University, Jiaozuo, China
| |
Collapse
|
9
|
Wall BPG, Nguyen M, Harrell JC, Dozmorov MG. Machine and deep learning methods for predicting 3D genome organization. ARXIV 2024:arXiv:2403.03231v1. [PMID: 38495565 PMCID: PMC10942493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Subscribe] [Scholar Register] [Indexed: 03/19/2024]
Abstract
Three-Dimensional (3D) chromatin interactions, such as enhancer-promoter interactions (EPIs), loops, Topologically Associating Domains (TADs), and A/B compartments play critical roles in a wide range of cellular processes by regulating gene expression. Recent development of chromatin conformation capture technologies has enabled genome-wide profiling of various 3D structures, even with single cells. However, current catalogs of 3D structures remain incomplete and unreliable due to differences in technology, tools, and low data resolution. Machine learning methods have emerged as an alternative to obtain missing 3D interactions and/or improve resolution. Such methods frequently use genome annotation data (ChIP-seq, DNAse-seq, etc.), DNA sequencing information (k-mers, Transcription Factor Binding Site (TFBS) motifs), and other genomic properties to learn the associations between genomic features and chromatin interactions. In this review, we discuss computational tools for predicting three types of 3D interactions (EPIs, chromatin interactions, TAD boundaries) and analyze their pros and cons. We also point out obstacles of computational prediction of 3D interactions and suggest future research directions.
Collapse
Affiliation(s)
- Brydon P. G. Wall
- Center for Biological Data Science, Virginia Commonwealth University, Richmond, VA, 23284, USA
| | - My Nguyen
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
| | - J. Chuck Harrell
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
- Massey Comprehensive Cancer Center, Virginia Commonwealth University, Richmond, VA 23298, USA
- Center for Pharmaceutical Engineering, Virginia Commonwealth University, Richmond, VA 23298, USA
| | - Mikhail G. Dozmorov
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, 23298, USA
- Department of Pathology, Virginia Commonwealth University, Richmond, VA, 23284, USA
| |
Collapse
|
10
|
Wang Y, Guo X, Niu Z, Huang X, Wang B, Gao L. DeepCBS: shedding light on the impact of mutations occurring at CTCF binding sites. Front Genet 2024; 15:1354208. [PMID: 38463168 PMCID: PMC10920299 DOI: 10.3389/fgene.2024.1354208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Accepted: 01/30/2024] [Indexed: 03/12/2024] Open
Abstract
CTCF-mediated chromatin loops create insulated neighborhoods that constrain promoter-enhancer interactions, serving as a unit of gene regulation. Disruption of the CTCF binding sites (CBS) will lead to the destruction of insulated neighborhoods, which in turn can cause dysregulation of the contained genes. In a recent study, it is found that CTCF/cohesin binding sites are a major mutational hotspot in the cancer genome. Mutations can affect CTCF binding, causing the disruption of insulated neighborhoods. And our analysis reveals a significant enrichment of well-known proto-oncogenes in insulated neighborhoods with mutations specifically occurring in anchor regions. It can be assumed that some mutations disrupt CTCF binding, leading to the disruption of insulated neighborhoods and subsequent activation of proto-oncogenes within these insulated neighborhoods. To explore the consequences of such mutations, we develop DeepCBS, a computational tool capable of analyzing mutations at CTCF binding sites, predicting their influence on insulated neighborhoods, and investigating the potential activation of proto-oncogenes. Futhermore, DeepCBS is applied to somatic mutation data of liver cancer. As a result, 87 mutations that disrupt CTCF binding sites are identified, which leads to the identification of 237 disrupted insulated neighborhoods containing a total of 135 genes. Integrative analysis of gene expression differences in liver cancer further highlights three genes: ARHGEF39, UBE2C and DQX1. Among them, ARHGEF39 and UBE2C have been reported in the literature as potential oncogenes involved in the development of liver cancer. The results indicate that DQX1 may be a potential oncogene in liver cancer and may contribute to tumor immune escape. In conclusion, DeepCBS is a promising method to analyze impacts of mutations occurring at CTCF binding sites on the insulator function of CTCF, with potential extensions to shed light on the effects of mutations on other functions of CTCF.
Collapse
Affiliation(s)
| | - Xingli Guo
- School of Computer Science and Technology, Xidian University, Xi’an, China
| | | | | | | | | |
Collapse
|
11
|
Tan W, Shen Y. Multimodal learning of noncoding variant effects using genome sequence and chromatin structure. Bioinformatics 2023; 39:btad541. [PMID: 37669132 PMCID: PMC10502240 DOI: 10.1093/bioinformatics/btad541] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2022] [Revised: 08/28/2023] [Accepted: 09/04/2023] [Indexed: 09/07/2023] Open
Abstract
MOTIVATION A growing amount of noncoding genetic variants, including single-nucleotide polymorphisms, are found to be associated with complex human traits and diseases. Their mechanistic interpretation is relatively limited and can use the help from computational prediction of their effects on epigenetic profiles. However, current models often focus on local, 1D genome sequence determinants and disregard global, 3D chromatin structure that critically affects epigenetic events. RESULTS We find that noncoding variants of unexpected high similarity in epigenetic profiles, with regards to their relatively low similarity in local sequences, can be largely attributed to their proximity in chromatin structure. Accordingly, we have developed a multimodal deep learning scheme that incorporates both data of 1D genome sequence and 3D chromatin structure for predicting noncoding variant effects. Specifically, we have integrated convolutional and recurrent neural networks for sequence embedding and graph neural networks for structure embedding despite the resolution gap between the two types of data, while utilizing recent DNA language models. Numerical results show that our models outperform competing sequence-only models in predicting epigenetic profiles and their use of long-range interactions complement sequence-only models in extracting regulatory motifs. They prove to be excellent predictors for noncoding variant effects in gene expression and pathogenicity, whether in unsupervised "zero-shot" learning or supervised "few-shot" learning. AVAILABILITY AND IMPLEMENTATION Codes and data can be accessed at https://github.com/Shen-Lab/ncVarPred-1D3D and https://zenodo.org/record/7975777.
Collapse
Affiliation(s)
- Wuwei Tan
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
| | - Yang Shen
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, TX 77843, United States
- Department of Computer Science and Engineering, Texas A&M University, College Station, TX 77843, United States
- Institute of Biosciences and Technology and Department of Translational Medical Sciences, College of Medicine, Texas A&M University, Houston, TX 77030, United States
| |
Collapse
|
12
|
Li Z, Portillo-Ledesma S, Schlick T. Techniques for and challenges in reconstructing 3D genome structures from 2D chromosome conformation capture data. Curr Opin Cell Biol 2023; 83:102209. [PMID: 37506571 PMCID: PMC10529954 DOI: 10.1016/j.ceb.2023.102209] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Revised: 06/07/2023] [Accepted: 06/26/2023] [Indexed: 07/30/2023]
Abstract
Chromosome conformation capture technologies that provide frequency information for contacts between genomic regions have been crucial for increasing our understanding of genome folding and regulation. However, such data do not provide direct evidence of the spatial 3D organization of chromatin. In this opinion article, we discuss the development and application of computational methods to reconstruct chromatin 3D structures from experimental 2D contact data, highlighting how such modeling provides biological insights and can suggest mechanisms anchored to experimental data. By applying different reconstruction methods to the same contact data, we illustrate some state-of-the-art of these techniques and discuss our gene resolution approach based on Brownian dynamics and Monte Carlo sampling.
Collapse
Affiliation(s)
- Zilong Li
- Department of Chemistry, New York University, 100 Washington Square East, Silver Building, New York, 10003, NY, USA; Simons Center for Computational Physical Chemistry, New York University, 24 Waverly Place, Silver Building, New York, NY, 10003, USA
| | - Stephanie Portillo-Ledesma
- Department of Chemistry, New York University, 100 Washington Square East, Silver Building, New York, 10003, NY, USA; Simons Center for Computational Physical Chemistry, New York University, 24 Waverly Place, Silver Building, New York, NY, 10003, USA
| | - Tamar Schlick
- Department of Chemistry, New York University, 100 Washington Square East, Silver Building, New York, 10003, NY, USA; Courant Institute of Mathematical Sciences, New York University, 251 Mercer St., New York, 10012, NY, USA; New York University-East China Normal University Center for Computational Chemistry, New York University Shanghai, Room 340, Geography Building, 3663 North Zhongshan Road, Shanghai, 200122, China; Simons Center for Computational Physical Chemistry, New York University, 24 Waverly Place, Silver Building, New York, NY, 10003, USA.
| |
Collapse
|
13
|
Ma W, Fu Y, Bao Y, Wang Z, Lei B, Zheng W, Wang C, Liu Y. DeepSATA: A Deep Learning-Based Sequence Analyzer Incorporating the Transcription Factor Binding Affinity to Dissect the Effects of Non-Coding Genetic Variants. Int J Mol Sci 2023; 24:12023. [PMID: 37569400 PMCID: PMC10418434 DOI: 10.3390/ijms241512023] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 07/13/2023] [Accepted: 07/24/2023] [Indexed: 08/13/2023] Open
Abstract
Utilizing large-scale epigenomics data, deep learning tools can predict the regulatory activity of genomic sequences, annotate non-coding genetic variants, and uncover mechanisms behind complex traits. However, these tools primarily rely on human or mouse data for training, limiting their performance when applied to other species. Furthermore, the limited exploration of many species, particularly in the case of livestock, has led to a scarcity of comprehensive and high-quality epigenetic data, posing challenges in developing reliable deep learning models for decoding their non-coding genomes. The cross-species prediction of the regulatory genome can be achieved by leveraging publicly available data from extensively studied organisms and making use of the conserved DNA binding preferences of transcription factors within the same tissue. In this study, we introduced DeepSATA, a novel deep learning-based sequence analyzer that incorporates the transcription factor binding affinity for the cross-species prediction of chromatin accessibility. By applying DeepSATA to analyze the genomes of pigs, chickens, cattle, humans, and mice, we demonstrated its ability to improve the prediction accuracy of chromatin accessibility and achieve reliable cross-species predictions in animals. Additionally, we showcased its effectiveness in analyzing pig genetic variants associated with economic traits and in increasing the accuracy of genomic predictions. Overall, our study presents a valuable tool to explore the epigenomic landscape of various species and pinpoint regulatory deoxyribonucleic acid (DNA) variants associated with complex traits.
Collapse
Affiliation(s)
- Wenlong Ma
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Yang Fu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
| | - Yongzhou Bao
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- School of Life Sciences, Henan University, Kaifeng 475004, China
| | - Zhen Wang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- School of Life Sciences, Henan University, Kaifeng 475004, China
| | - Bowen Lei
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Lab of Swine Genetics and Breeding of Ministry of Agriculture and Rural Affairs, Huazhong Agricultural University, Wuhan 430070, China
| | - Weigang Zheng
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Lab of Swine Genetics and Breeding of Ministry of Agriculture and Rural Affairs, Huazhong Agricultural University, Wuhan 430070, China
| | - Chao Wang
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Key Laboratory of Agricultural Animal Genetics, Breeding and Reproduction of Ministry of Education & Key Lab of Swine Genetics and Breeding of Ministry of Agriculture and Rural Affairs, Huazhong Agricultural University, Wuhan 430070, China
| | - Yuwen Liu
- Shenzhen Branch, Guangdong Laboratory for Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-Omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China; (W.M.); (Y.F.); (Y.B.); (Z.W.); (B.L.); (W.Z.); (C.W.)
- Innovation Group of Pig Genome Design and Breeding, Research Centre for Animal Genome, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen 518124, China
- Kunpeng Institute of Modern Agriculture at Foshan, Chinese Academy of Agricultural Sciences, Foshan 528226, China
| |
Collapse
|
14
|
Yang R, Das A, Gao VR, Karbalayghareh A, Noble WS, Bilmes JA, Leslie CS. Epiphany: predicting Hi-C contact maps from 1D epigenomic signals. Genome Biol 2023; 24:134. [PMID: 37280678 PMCID: PMC10242996 DOI: 10.1186/s13059-023-02934-9] [Citation(s) in RCA: 28] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Accepted: 04/06/2023] [Indexed: 06/08/2023] Open
Abstract
Recent deep learning models that predict the Hi-C contact map from DNA sequence achieve promising accuracy but cannot generalize to new cell types and or even capture differences among training cell types. We propose Epiphany, a neural network to predict cell-type-specific Hi-C contact maps from widely available epigenomic tracks. Epiphany uses bidirectional long short-term memory layers to capture long-range dependencies and optionally a generative adversarial network architecture to encourage contact map realism. Epiphany shows excellent generalization to held-out chromosomes within and across cell types, yields accurate TAD and interaction calls, and predicts structural changes caused by perturbations of epigenomic signals.
Collapse
Affiliation(s)
- Rui Yang
- Memorial Sloan Kettering Cancer Center, New York, USA
| | - Arnav Das
- University of Washington, Seattle, USA
| | - Vianne R Gao
- Memorial Sloan Kettering Cancer Center, New York, USA
| | | | | | | | | |
Collapse
|
15
|
Woo BJ, Moussavi-Baygi R, Karner H, Karimzadeh M, Garcia K, Joshi T, Yin K, Navickas A, Gilbert LA, Wang B, Asgharian H, Feng FY, Goodarzi H. Integrative identification of non-coding regulatory regions driving metastatic prostate cancer. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.04.14.535921. [PMID: 37398273 PMCID: PMC10312451 DOI: 10.1101/2023.04.14.535921] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/04/2023]
Abstract
Large-scale sequencing efforts of thousands of tumor samples have been undertaken to understand the mutational landscape of the coding genome. However, the vast majority of germline and somatic variants occur within non-coding portions of the genome. These genomic regions do not directly encode for specific proteins, but can play key roles in cancer progression, for example by driving aberrant gene expression control. Here, we designed an integrative computational and experimental framework to identify recurrently mutated non-coding regulatory regions that drive tumor progression. Application of this approach to whole-genome sequencing (WGS) data from a large cohort of metastatic castration-resistant prostate cancer (mCRPC) revealed a large set of recurrently mutated regions. We used (i) in silico prioritization of functional non-coding mutations, (ii) massively parallel reporter assays, and (iii) in vivo CRISPR-interference (CRISPRi) screens in xenografted mice to systematically identify and validate driver regulatory regions that drive mCRPC. We discovered that one of these enhancer regions, GH22I030351, acts on a bidirectional promoter to simultaneously modulate expression of U2-associated splicing factor SF3A1 and chromosomal protein CCDC157. We found that both SF3A1 and CCDC157 are promoters of tumor growth in xenograft models of prostate cancer. We nominated a number of transcription factors, including SOX6, to be responsible for higher expression of SF3A1 and CCDC157. Collectively, we have established and confirmed an integrative computational and experimental approach that enables the systematic detection of non-coding regulatory regions that drive the progression of human cancers.
Collapse
Affiliation(s)
- Brian J Woo
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, California, USA
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
| | - Ruhollah Moussavi-Baygi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, California, USA
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
| | - Heather Karner
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, California, USA
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
| | - Mehran Karimzadeh
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, California, USA
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
- Vector Institute, Toronto, ON, Canada
- Peter Munk Cardiac Centre, University Health Network, Toronto, ON, Canada
- Arc Institute, Palo Alto 94305, USA
| | - Kristle Garcia
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, California, USA
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
| | - Tanvi Joshi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, California, USA
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
| | - Keyi Yin
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, California, USA
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
| | - Albertas Navickas
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, California, USA
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
| | - Luke A. Gilbert
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, California, USA
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
- Arc Institute, Palo Alto 94305, USA
| | - Bo Wang
- Vector Institute, Toronto, ON, Canada
- Peter Munk Cardiac Centre, University Health Network, Toronto, ON, Canada
| | - Hosseinali Asgharian
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, California, USA
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, US
| | - Felix Y. Feng
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California, USA
| | - Hani Goodarzi
- Department of Biochemistry & Biophysics, University of California, San Francisco, San Francisco, California, USA
- Department of Urology, University of California, San Francisco, San Francisco, California, USA
- Helen Diller Family Comprehensive Cancer Center, University of California, San Francisco, San Francisco, California, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, CA, US
| |
Collapse
|
16
|
Hamamoto R, Takasawa K, Shinkai N, Machino H, Kouno N, Asada K, Komatsu M, Kaneko S. Analysis of super-enhancer using machine learning and its application to medical biology. Brief Bioinform 2023; 24:bbad107. [PMID: 36960780 PMCID: PMC10199775 DOI: 10.1093/bib/bbad107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 02/11/2023] [Accepted: 03/01/2023] [Indexed: 03/25/2023] Open
Abstract
The analysis of super-enhancers (SEs) has recently attracted attention in elucidating the molecular mechanisms of cancer and other diseases. SEs are genomic structures that strongly induce gene expression and have been reported to contribute to the overexpression of oncogenes. Because the analysis of SEs and integrated analysis with other data are performed using large amounts of genome-wide data, artificial intelligence technology, with machine learning at its core, has recently begun to be utilized. In promoting precision medicine, it is important to consider information from SEs in addition to genomic data; therefore, machine learning technology is expected to be introduced appropriately in terms of building a robust analysis platform with a high generalization performance. In this review, we explain the history and principles of SE, and the results of SE analysis using state-of-the-art machine learning and integrated analysis with other data are presented to provide a comprehensive understanding of the current status of SE analysis in the field of medical biology. Additionally, we compared the accuracy between existing machine learning methods on the benchmark dataset and attempted to explore the kind of data preprocessing and integration work needed to make the existing algorithms work on the benchmark dataset. Furthermore, we discuss the issues and future directions of current SE analysis.
Collapse
Affiliation(s)
- Ryuji Hamamoto
- Division Chief in the Division of Medical AI Research and Development, National Cancer Center Research Institute; a Professor in the Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University and a Team Leader of the Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project
| | - Ken Takasawa
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project and an External Research Staff in the Medical AI Research and Development, National Cancer Center Research Institute
| | - Norio Shinkai
- Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University
| | - Hidenori Machino
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project and an External Research Staff in the Medical AI Research and Development, National Cancer Center Research Institute
| | - Nobuji Kouno
- Department of Surgery, Graduate School of Medicine, Kyoto University
| | - Ken Asada
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project and an External Research Staff of Medical AI Research and Development, National Cancer Center Research Institute
| | - Masaaki Komatsu
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project and an External Research Staff of Medical AI Research and Development, National Cancer Center Research Institute
| | - Syuzo Kaneko
- Division of Medical AI Research and Development, National Cancer Center Research Institute and a Visiting Scientist in the Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project
| |
Collapse
|
17
|
Agarwal A, Chen L. DeepPHiC: predicting promoter-centered chromatin interactions using a novel deep learning approach. Bioinformatics 2023; 39:6887158. [PMID: 36495179 PMCID: PMC9825766 DOI: 10.1093/bioinformatics/btac801] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2022] [Revised: 11/23/2022] [Accepted: 12/09/2022] [Indexed: 12/14/2022] Open
Abstract
MOTIVATION Promoter-centered chromatin interactions, which include promoter-enhancer (PE) and promoter-promoter (PP) interactions, are important to decipher gene regulation and disease mechanisms. The development of next-generation sequencing technologies such as promoter capture Hi-C (pcHi-C) leads to the discovery of promoter-centered chromatin interactions. However, pcHi-C experiments are expensive and thus may be unavailable for tissues/cell types of interest. In addition, these experiments may be underpowered due to insufficient sequencing depth or various artifacts, which results in a limited finding of interactions. Most existing computational methods for predicting chromatin interactions are based on in situ Hi-C and can detect chromatin interactions across the entire genome. However, they may not be optimal for predicting promoter-centered chromatin interactions. RESULTS We develop a supervised multi-modal deep learning model, which utilizes a comprehensive set of features such as genomic sequence, epigenetic signal, anchor distance, evolutionary features and DNA structural features to predict tissue/cell type-specific PE and PP interactions. We further extend the deep learning model in a multi-task learning and a transfer learning framework and demonstrate that the proposed approach outperforms state-of-the-art deep learning methods. Moreover, the proposed approach can achieve comparable prediction performance using predefined biologically relevant tissues/cell types compared to using all tissues/cell types in the pretraining especially for predicting PE interactions. The prediction performance can be further improved by using computationally inferred biologically relevant tissues/cell types in the pretraining, which are defined based on the common genes in the proximity of two anchors in the chromatin interactions. AVAILABILITY AND IMPLEMENTATION https://github.com/lichen-lab/DeepPHiC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Aman Agarwal
- Department of Computer Science, Indiana University, Bloomington, IN 47405, USA
| | - Li Chen
- To whom correspondence should be addressed.
| |
Collapse
|
18
|
Lan AY, Corces MR. Deep learning approaches for noncoding variant prioritization in neurodegenerative diseases. Front Aging Neurosci 2022; 14:1027224. [PMID: 36466610 PMCID: PMC9716280 DOI: 10.3389/fnagi.2022.1027224] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Accepted: 10/24/2022] [Indexed: 11/19/2022] Open
Abstract
Determining how noncoding genetic variants contribute to neurodegenerative dementias is fundamental to understanding disease pathogenesis, improving patient prognostication, and developing new clinical treatments. Next generation sequencing technologies have produced vast amounts of genomic data on cell type-specific transcription factor binding, gene expression, and three-dimensional chromatin interactions, with the promise of providing key insights into the biological mechanisms underlying disease. However, this data is highly complex, making it challenging for researchers to interpret, assimilate, and dissect. To this end, deep learning has emerged as a powerful tool for genome analysis that can capture the intricate patterns and dependencies within these large datasets. In this review, we organize and discuss the many unique model architectures, development philosophies, and interpretation methods that have emerged in the last few years with a focus on using deep learning to predict the impact of genetic variants on disease pathogenesis. We highlight both broadly-applicable genomic deep learning methods that can be fine-tuned to disease-specific contexts as well as existing neurodegenerative disease research, with an emphasis on Alzheimer's-specific literature. We conclude with an overview of the future of the field at the intersection of neurodegeneration, genomics, and deep learning.
Collapse
Affiliation(s)
- Alexander Y. Lan
- Gladstone Institute of Neurological Disease, San Francisco, CA, United States
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, United States
- Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| | - M. Ryan Corces
- Gladstone Institute of Neurological Disease, San Francisco, CA, United States
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, United States
- Department of Neurology, University of California San Francisco, San Francisco, CA, United States
| |
Collapse
|
19
|
DLoopCaller: A deep learning approach for predicting genome-wide chromatin loops by integrating accessible chromatin landscapes. PLoS Comput Biol 2022; 18:e1010572. [PMID: 36206320 PMCID: PMC9581407 DOI: 10.1371/journal.pcbi.1010572] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 10/19/2022] [Accepted: 09/14/2022] [Indexed: 11/20/2022] Open
Abstract
In recent years, major advances have been made in various chromosome conformation capture technologies to further satisfy the needs of researchers for high-quality, high-resolution contact interactions. Discriminating the loops from genome-wide contact interactions is crucial for dissecting three-dimensional(3D) genome structure and function. Here, we present a deep learning method to predict genome-wide chromatin loops, called DLoopCaller, by combining accessible chromatin landscapes and raw Hi-C contact maps. Some available orthogonal data ChIA-PET/HiChIP and Capture Hi-C were used to generate positive samples with a wider contact matrix which provides the possibility to find more potential genome-wide chromatin loops. The experimental results demonstrate that DLoopCaller effectively improves the accuracy of predicting genome-wide chromatin loops compared to the state-of-the-art method Peakachu. Moreover, compared to two of most popular loop callers, such as HiCCUPS and Fit-Hi-C, DLoopCaller identifies some unique interactions. We conclude that a combination of chromatin landscapes on the one-dimensional genome contributes to understanding the 3D genome organization, and the identified chromatin loops reveal cell-type specificity and transcription factor motif co-enrichment across different cell lines and species.
Collapse
|
20
|
Yang M, Ma J. Machine Learning Methods for Exploring Sequence Determinants of 3D Genome Organization. J Mol Biol 2022; 434:167666. [PMID: 35659533 DOI: 10.1016/j.jmb.2022.167666] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Revised: 05/23/2022] [Accepted: 05/27/2022] [Indexed: 01/25/2023]
Abstract
In higher eukaryotic cells, chromosomes are folded inside the nucleus. Recent advances in whole-genome mapping technologies have revealed the multiscale features of 3D genome organization that are intertwined with fundamental genome functions. However, DNA sequence determinants that modulate the formation of 3D genome organization remain poorly characterized. In the past few years, predicting 3D genome organization based on DNA sequence features has become an active area of research. Here, we review the recent progress in computational approaches to unraveling important sequence elements for 3D genome organization. In particular, we discuss the rapid development of machine learning-based methods that facilitate the connections between DNA sequence features and 3D genome architectures at different scales. While much progress has been made in developing predictive models for revealing important sequence features for 3D genome organization, new research is urgently needed to incorporate multi-omic data and enhance model interpretability, further advancing our understanding of gene regulation mechanisms through the lens of 3D genome organization.
Collapse
Affiliation(s)
- Muyu Yang
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, United States. https://twitter.com/muyu_wendy_yang
| | - Jian Ma
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, United States.
| |
Collapse
|
21
|
Alharbi WS, Rashid M. A review of deep learning applications in human genomics using next-generation sequencing data. Hum Genomics 2022; 16:26. [PMID: 35879805 PMCID: PMC9317091 DOI: 10.1186/s40246-022-00396-x] [Citation(s) in RCA: 33] [Impact Index Per Article: 11.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2021] [Accepted: 07/12/2022] [Indexed: 12/02/2022] Open
Abstract
Genomics is advancing towards data-driven science. Through the advent of high-throughput data generating technologies in human genomics, we are overwhelmed with the heap of genomic data. To extract knowledge and pattern out of this genomic data, artificial intelligence especially deep learning methods has been instrumental. In the current review, we address development and application of deep learning methods/models in different subarea of human genomics. We assessed over- and under-charted area of genomics by deep learning techniques. Deep learning algorithms underlying the genomic tools have been discussed briefly in later part of this review. Finally, we discussed briefly about the late application of deep learning tools in genomic. Conclusively, this review is timely for biotechnology or genomic scientists in order to guide them why, when and how to use deep learning methods to analyse human genomic data.
Collapse
Affiliation(s)
- Wardah S Alharbi
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia
| | - Mamoon Rashid
- Department of AI and Bioinformatics, King Abdullah International Medical Research Center (KAIMRC), King Saud Bin Abdulaziz University for Health Sciences (KSAU-HS), King Abdulaziz Medical City, Ministry of National Guard Health Affairs, P.O. Box 22490, Riyadh, 11426, Saudi Arabia.
| |
Collapse
|
22
|
Piecyk RS, Schlegel L, Johannes F. Predicting 3D chromatin interactions from DNA sequence using Deep Learning. Comput Struct Biotechnol J 2022; 20:3439-3448. [PMID: 35832620 PMCID: PMC9271978 DOI: 10.1016/j.csbj.2022.06.047] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2022] [Revised: 06/21/2022] [Accepted: 06/21/2022] [Indexed: 11/22/2022] Open
Abstract
Gene regulation in eukaryotes is profoundly shaped by the 3D organization of chromatin within the cell nucleus. Distal regulatory interactions between enhancers and their target genes are widespread and many causal loci underlying heritable agricultural or clinical traits have been mapped to distal cis-regulatory elements. Dissecting the sequence features that mediate such distal interactions is key to understanding their underlying biology. Deep Learning (DL) models coupled with genome-wide 3C-based sequencing data have emerged as powerful tools to infer the DNA sequence grammar underlying such distal interactions. In this review we show that most DL models have remarkably high prediction accuracy, which indicates that DNA sequence features are important determinants of chromatin looping. However, DL model training has so far been limited to a small set of human cell lines, raising questions about the generalization of these predictions to other tissue-types and species. Furthermore, we find that the model architecture seems less relevant for model performance than the training strategy and the data preparation step. Transfer learning, coupled with functionally curated interactions, appear to be the most promising approach to learn cell-type specific and possibly species- specific sequence features in future applications.
Collapse
Affiliation(s)
- Robert S. Piecyk
- Department of Molecular Life Sciences, Technical University of Munich, Freising, Germany
| | - Luca Schlegel
- Department of Molecular Life Sciences, Technical University of Munich, Freising, Germany
| | - Frank Johannes
- Department of Molecular Life Sciences, Technical University of Munich, Freising, Germany
- TUM Institute for Advanced Study, Garching, Germany
| |
Collapse
|
23
|
Yang D, Chung T, Kim D. DeepLUCIA: predicting tissue-specific chromatin loops using Deep Learning-based Universal Chromatin Interaction Annotator. Bioinformatics 2022; 38:3501-3512. [PMID: 35640981 DOI: 10.1093/bioinformatics/btac373] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2021] [Revised: 04/17/2022] [Accepted: 05/27/2022] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION The importance of chromatin loops in gene regulation is broadly accepted. There are mainly two approaches to predict chromatin loops: transcription factor (TF) binding-dependent approach and genomic variation-based approach. However, neither of these approaches provides an adequate understanding of gene regulation in human tissues. To address this issue, we developed a deep learning-based chromatin loop prediction model called DeepLUCIA (Deep Learning-based Universal Chromatin Interaction Annotator). RESULTS Although DeepLUCIA does not use TF binding profile data which previous TF binding-dependent methods critically rely on, its prediction accuracies are comparable to those of the previous TF binding-dependent methods. More importantly, DeepLUCIA enables the tissue-specific chromatin loop predictions from tissue-specific epigenomes that cannot be handled by genomic variation-based approach. We demonstrated the utility of the DeepLUCIA by predicting several novel target genes of SNPs identified in genome-wide association studies targeting Brugada syndrome, COVID-19 severity, and age-related macular degeneration. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dongchan Yang
- Department of Bio and Brain Engineering, KAIST, Daejeon, 34141, Republic of Korea
| | - Taesu Chung
- Biotechnology & Healthcare Examination Division, KIPO, Daejeon, 35208, Republic of Korea
| | - Dongsup Kim
- Department of Bio and Brain Engineering, KAIST, Daejeon, 34141, Republic of Korea
| |
Collapse
|
24
|
Avdeyev P, Zhou J. Computational Approaches for Understanding Sequence Variation Effects on the 3D Genome Architecture. Annu Rev Biomed Data Sci 2022; 5:183-204. [PMID: 35537461 DOI: 10.1146/annurev-biodatasci-102521-012018] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Decoding how genomic sequence and its variations affect 3D genome architecture is indispensable for understanding the genetic architecture of various traits and diseases. The 3D genome organization can be significantly altered by genome variations and in turn impact the function of the genomic sequence. Techniques for measuring the 3D genome architecture across spatial scales have opened up new possibilities for understanding how the 3D genome depends upon the genomic sequence and how it can be altered by sequence variations. Computational methods have become instrumental in analyzing and modeling the sequence effects on 3D genome architecture, and recent development in deep learning sequence models have opened up new opportunities for studying the interplay between sequence variations and the 3D genome. In this review, we focus on computational approaches for both the detection and modeling of sequence variation effects on the 3D genome, and we discuss the opportunities presented by these approaches. Expected final online publication date for the Annual Review of Biomedical Data Science, Volume 5 is August 2022. Please see http://www.annualreviews.org/page/journal/pubdates for revised estimates.
Collapse
Affiliation(s)
- Pavel Avdeyev
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA;
| | - Jian Zhou
- Lyda Hill Department of Bioinformatics, University of Texas Southwestern Medical Center, Dallas, Texas, USA;
| |
Collapse
|
25
|
Sefer E. ProbC: joint modeling of epigenome and transcriptome effects in 3D genome. BMC Genomics 2022; 23:287. [PMID: 35397520 PMCID: PMC8994916 DOI: 10.1186/s12864-022-08498-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Accepted: 03/23/2022] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND Hi-C and its high nucleosome resolution variant Micro-C provide a window into the spatial packing of a genome in 3D within the cell. Even though both techniques do not directly depend on the binding of specific antibodies, previous work has revealed enriched interactions and domain structures around multiple chromatin marks; epigenetic modifications and transcription factor binding sites. However, the joint impact of chromatin marks in Hi-C and Micro-C interactions have not been globally characterized, which limits our understanding of 3D genome characteristics. An emerging question is whether it is possible to deduce 3D genome characteristics and interactions by integrative analysis of multiple chromatin marks and associate interactions to functionality of the interacting loci. RESULT We come up with a probabilistic method PROBC to decompose Hi-C and Micro-C interactions by known chromatin marks. PROBC is based on convex likelihood optimization, which can directly take into account both interaction existence and nonexistence. Through PROBC, we discover histone modifications (H3K27ac, H3K9me3, H3K4me3, H3K4me1) and CTCF as particularly predictive of Hi-C and Micro-C contacts across cell types and species. Moreover, histone modifications are more effective than transcription factor binding sites in explaining the genome's 3D shape through these interactions. PROBC can successfully predict Hi-C and Micro-C interactions in given species, while it is trained on different cell types or species. For instance, it can predict missing nucleosome resolution Micro-C interactions in human ES cells trained on mouse ES cells only from these 5 chromatin marks with above 0.75 AUC. Additionally, PROBC outperforms the existing methods in predicting interactions across almost all chromosomes. CONCLUSION Via our proposed method, we optimally decompose Hi-C interactions in terms of these chromatin marks at genome and chromosome levels. We find a subset of histone modifications and transcription factor binding sites to be predictive of both Hi-C and Micro-C interactions and TADs across human, mouse, and different cell types. Through learned models, we can predict interactions on species just from chromatin marks for which Hi-C data may be limited.
Collapse
Affiliation(s)
- Emre Sefer
- Department of Computer Science, Ozyegin University, Istanbul, Turkey.
| |
Collapse
|
26
|
InsuLock: A Weakly Supervised Learning Approach for Accurate Insulator Prediction, and Variant Impact Quantification. Genes (Basel) 2022; 13:genes13040621. [PMID: 35456427 PMCID: PMC9026820 DOI: 10.3390/genes13040621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2022] [Revised: 03/24/2022] [Accepted: 03/25/2022] [Indexed: 02/01/2023] Open
Abstract
Mapping chromatin insulator loops is crucial to investigating genome evolution, elucidating critical biological functions, and ultimately quantifying variant impact in diseases. However, chromatin conformation profiling assays are usually expensive, time-consuming, and may report fuzzy insulator annotations with low resolution. Therefore, we propose a weakly supervised deep learning method, InsuLock, to address these challenges. Specifically, InsuLock first utilizes a Siamese neural network to predict the existence of insulators within a given region (up to 2000 bp). Then, it uses an object detection module for precise insulator boundary localization via gradient-weighted class activation mapping (~40 bp resolution). Finally, it quantifies variant impacts by comparing the insulator score differences between the wild-type and mutant alleles. We applied InsuLock on various bulk and single-cell datasets for performance testing and benchmarking. We showed that it outperformed existing methods with an AUROC of ~0.96 and condensed insulator annotations to ~2.5% of their original size while still demonstrating higher conservation scores and better motif enrichments. Finally, we utilized InsuLock to make cell-type-specific variant impacts from brain scATAC-seq data and identified a schizophrenia GWAS variant disrupting an insulator loop proximal to a known risk gene, indicating a possible new mechanism of action for the disease.
Collapse
|
27
|
Shen Y, Zhong Q, Liu T, Wen Z, Shen W, Li L. CharID: a two-step model for universal prediction of interactions between chromatin accessible regions. Brief Bioinform 2022; 23:6514800. [PMID: 35077535 DOI: 10.1093/bib/bbab602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2021] [Revised: 12/23/2021] [Accepted: 12/24/2021] [Indexed: 11/14/2022] Open
Abstract
Open chromatin regions (OCRs) allow direct interaction between cis-regulatory elements and trans-acting factors. Therefore, predicting all potential OCR-mediated loops is essential for deciphering the regulation mechanism of gene expression. However, existing loop prediction tools are restricted to specific anchor types. Here, we present CharID (Chromatin Accessible Region Interaction Detector), a two-step model that combines neural network and ensemble learning to predict OCR-mediated loops. In the first step, CharID-Anchor, an attention-based hybrid CNN-BiGRU network is constructed to discriminate between the anchor and nonanchor OCRs. In the second step, CharID-Loop uses gradient boosting decision tree with chromosome-split strategy to predict the interactions between anchor OCRs. The performance was assessed in three human cell lines, and CharID showed superior prediction performance compared with other algorithms. In contrast to the methods designed to predict a particular type of loops, CharID can detect varieties of chromatin loops not limited to enhancer-promoter loops or architectural protein-mediated loops. We constructed the OCR-mediated interaction network using the predicted loops and identified hub anchors, which are highlighted by their proximity to housekeeping genes. By analyzing loops containing SNPs associated with cardiovascular disease, we identified an SNP-gene loop indicating the regulation mechanism of the GFOD1. Taken together, CharID universally predicts diverse chromatin loops beyond other state-of-the-art methods, which are limited by anchor types, and experimental techniques, which are limited by sensitivities drastically decaying with the genomic distance of anchors. Finally, we hosted Peaksniffer, a user-friendly web server that provides online prediction, query and visualization of OCRs and associated loops.
Collapse
Affiliation(s)
- Yin Shen
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
- 3D Genomics Research Center, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Quan Zhong
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
- 3D Genomics Research Center, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Tian Liu
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Zi Wen
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
- 3D Genomics Research Center, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Wei Shen
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
- 3D Genomics Research Center, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| | - Li Li
- Hubei Key Laboratory of Agricultural Bioinformatics, College of Informatics, Huazhong Agricultural University, Wuhan, 430070, P. R. China
- 3D Genomics Research Center, Huazhong Agricultural University, Wuhan, 430070, P. R. China
| |
Collapse
|
28
|
Kodali S, Meyer-Nava S, Landry S, Chakraborty A, Rivera-Mulia JC, Feng W. Epigenomic signatures associated with spontaneous and replication stress-induced DNA double strand breaks. Front Genet 2022; 13:907547. [PMID: 36506300 PMCID: PMC9730818 DOI: 10.3389/fgene.2022.907547] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 11/07/2022] [Indexed: 11/25/2022] Open
Abstract
Common fragile sites (CFSs) are specific regions of all individuals' genome that are predisposed to DNA double strand breaks (DSBs) and undergo subsequent rearrangements. CFS formation can be induced in vitro by mild level of DNA replication stress, such as DNA polymerase inhibition or nucleotide pool disturbance. The mechanisms of CFS formation have been linked to DNA replication timing control, transcription activities, as well as chromatin organization. However, it is unclear what specific cis- or trans-factors regulate the interplay between replication and transcription that determine CFS formation. We recently reported genome-wide mapping of DNA DSBs under replication stress induced by aphidicolin in human lymphoblastoids for the first time. Here, we systematically compared these DSBs with regards to nearby epigenomic features mapped in the same cell line from published studies. We demonstrate that aphidicolin-induced DSBs are strongly correlated with histone 3 lysine 36 trimethylation, a marker for active transcription. We further demonstrate that this DSB signature is a composite effect by the dual treatment of aphidicolin and its solvent, dimethylsulfoxide, the latter of which potently induces transcription on its own. We also present complementing evidence for the association between DSBs and 3D chromosome architectural domains with high density gene cluster and active transcription. Additionally, we show that while DSBs were detected at all but one of the fourteen finely mapped CFSs, they were not enriched in the CFS core sequences and rather demarcated the CFS core region. Related to this point, DSB density was not higher in large genes of greater than 300 kb, contrary to reported enrichment of CFS sites at these large genes. Finally, replication timing analyses demonstrate that the CFS core region contain initiation events, suggesting that altered replication dynamics are responsible for CFS formation in relatively higher level of replication stress.
Collapse
Affiliation(s)
- Sravan Kodali
- Department of Biochemistry and Molecular Biology, Upstate Medical University, Syracuse, NY, United States
| | - Silvia Meyer-Nava
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, United States
| | - Stephen Landry
- Department of Biochemistry and Molecular Biology, Upstate Medical University, Syracuse, NY, United States
| | - Arijita Chakraborty
- Department of Biochemistry and Molecular Biology, Upstate Medical University, Syracuse, NY, United States
| | - Juan Carlos Rivera-Mulia
- Department of Biochemistry, Molecular Biology and Biophysics, University of Minnesota, Minneapolis, MN, United States
| | - Wenyi Feng
- Department of Biochemistry and Molecular Biology, Upstate Medical University, Syracuse, NY, United States
- *Correspondence: Wenyi Feng,
| |
Collapse
|
29
|
Li R, Li L, Xu Y, Yang J. Machine learning meets omics: applications and perspectives. Brief Bioinform 2021; 23:6425809. [PMID: 34791021 DOI: 10.1093/bib/bbab460] [Citation(s) in RCA: 60] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 09/29/2021] [Accepted: 10/07/2021] [Indexed: 02/07/2023] Open
Abstract
The innovation of biotechnologies has allowed the accumulation of omics data at an alarming rate, thus introducing the era of 'big data'. Extracting inherent valuable knowledge from various omics data remains a daunting problem in bioinformatics. Better solutions often need some kind of more innovative methods for efficient handlings and effective results. Recent advancements in integrated analysis and computational modeling of multi-omics data helped address such needs in an increasingly harmonious manner. The development and application of machine learning have largely advanced our insights into biology and biomedicine and greatly promoted the development of therapeutic strategies, especially for precision medicine. Here, we propose a comprehensive survey and discussion on what happened, is happening and will happen when machine learning meets omics. Specifically, we describe how artificial intelligence can be applied to omics studies and review recent advancements at the interface between machine learning and the ever-widest range of omics including genomics, transcriptomics, proteomics, metabolomics, radiomics, as well as those at the single-cell resolution. We also discuss and provide a synthesis of ideas, new insights, current challenges and perspectives of machine learning in omics.
Collapse
Affiliation(s)
- Rufeng Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Lixin Li
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China
| | - Yungang Xu
- School of Electronics and Information, Northwestern Polytechnical University, Xi'an, 710129, China
| | - Juan Yang
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi'an Jiaotong University Health Science Center, Xi'an 710061, P. R. China.,Key Laboratory of Environment and Genes Related to Diseases (Xi'an Jiaotong University), Ministry of Education of China, Xi'an 710061, P. R. China
| |
Collapse
|
30
|
Belokopytova P, Fishman V. Predicting Genome Architecture: Challenges and Solutions. Front Genet 2021; 11:617202. [PMID: 33552135 PMCID: PMC7862721 DOI: 10.3389/fgene.2020.617202] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Accepted: 12/15/2020] [Indexed: 12/22/2022] Open
Abstract
Genome architecture plays a pivotal role in gene regulation. The use of high-throughput methods for chromatin profiling and 3-D interaction mapping provide rich experimental data sets describing genome organization and dynamics. These data challenge development of new models and algorithms connecting genome architecture with epigenetic marks. In this review, we describe how chromatin architecture could be reconstructed from epigenetic data using biophysical or statistical approaches. We discuss the applicability and limitations of these methods for understanding the mechanisms of chromatin organization. We also highlight the emergence of new predictive approaches for scoring effects of structural variations in human cells.
Collapse
Affiliation(s)
- Polina Belokopytova
- Natural Sciences Department, Novosibirsk State University, Novosibirsk, Russia
- Institute of Cytology and Genetics Siberian Branch of Russian Academy of Sciences (SB RAS), Novosibirsk, Russia
| | - Veniamin Fishman
- Natural Sciences Department, Novosibirsk State University, Novosibirsk, Russia
- Institute of Cytology and Genetics Siberian Branch of Russian Academy of Sciences (SB RAS), Novosibirsk, Russia
| |
Collapse
|
31
|
Tao H, Li H, Xu K, Hong H, Jiang S, Du G, Wang J, Sun Y, Huang X, Ding Y, Li F, Zheng X, Chen H, Bo X. Computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles. Brief Bioinform 2021; 22:6102668. [PMID: 33454752 PMCID: PMC8424394 DOI: 10.1093/bib/bbaa405] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 11/26/2020] [Accepted: 12/10/2020] [Indexed: 12/14/2022] Open
Abstract
The exploration of three-dimensional chromatin interaction and organization provides insight into mechanisms underlying gene regulation, cell differentiation and disease development. Advances in chromosome conformation capture technologies, such as high-throughput chromosome conformation capture (Hi-C) and chromatin interaction analysis by paired-end tag (ChIA-PET), have enabled the exploration of chromatin interaction and organization. However, high-resolution Hi-C and ChIA-PET data are only available for a limited number of cell lines, and their acquisition is costly, time consuming, laborious and affected by theoretical limitations. Increasing evidence shows that DNA sequence and epigenomic features are informative predictors of regulatory interaction and chromatin architecture. Based on these features, numerous computational methods have been developed for the prediction of chromatin interaction and organization, whereas they are not extensively applied in biomedical study. A systematical study to summarize and evaluate such methods is still needed to facilitate their application. Here, we summarize 48 computational methods for the prediction of chromatin interaction and organization using sequence and epigenomic profiles, categorize them and compare their performance. Besides, we provide a comprehensive guideline for the selection of suitable methods to predict chromatin interaction and organization based on available data and biological question of interest.
Collapse
Affiliation(s)
- Huan Tao
- Beijing Institute of Radiation Medicine
| | - Hao Li
- Beijing Institute of Radiation Medicine
| | - Kang Xu
- Beijing Institute of Radiation Medicine
| | - Hao Hong
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Shuai Jiang
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Guifang Du
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | | | - Yu Sun
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Xin Huang
- Beijing Institute of Radiation Medicine, Department of Biotechnology
| | - Yang Ding
- Beijing Institute of Radiation Medicine
| | - Fei Li
- Chinese Academy of Sciences, Department of Computer Network Information Center
| | | | | | | |
Collapse
|
32
|
Kuang S, Wang L. Deep Learning of Sequence Patterns for CCCTC-Binding Factor-Mediated Chromatin Loop Formation. J Comput Biol 2020; 28:133-145. [PMID: 33232622 DOI: 10.1089/cmb.2020.0225] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
The three-dimensional (3D) organization of the human genome is of crucial importance for gene regulation, and the CCCTC-binding factor (CTCF) plays an important role in chromatin interactions. However, it is still unclear what sequence patterns in addition to CTCF motif pairs determine chromatin loop formation. To discover the underlying sequence patterns, we have developed a deep learning model, called DeepCTCFLoop, to predict whether a chromatin loop can be formed between a pair of convergent or tandem CTCF motifs using only the DNA sequences of the motifs and their flanking regions. Our results suggest that DeepCTCFLoop can accurately distinguish the CTCF motif pairs forming chromatin loops from the ones not forming loops. It significantly outperforms CTCF-MP, a machine learning model based on word2vec and boosted trees, when using DNA sequences only. Furthermore, we show that DNA motifs binding to several transcription factors, including ZNF384, ZNF263, ASCL1, SP1, and ZEB1, may constitute the complex sequence patterns for CTCF-mediated chromatin loop formation. DeepCTCFLoop has also been applied to disease-associated sequence variants to identify candidates that may disrupt chromatin loop formation. Therefore, our results provide useful information for understanding the mechanism of 3D genome organization and may also help annotate and prioritize the noncoding sequence variants associated with human diseases.
Collapse
Affiliation(s)
- Shuzhen Kuang
- Department of Genetics and Biochemistry, Clemson University, Clemson, South Carolina, USA.,Department of Biological Sciences, Clemson University, Clemson, South Carolina, USA
| | - Liangjiang Wang
- Department of Genetics and Biochemistry, Clemson University, Clemson, South Carolina, USA
| |
Collapse
|
33
|
Fudenberg G, Kelley DR, Pollard KS. Predicting 3D genome folding from DNA sequence with Akita. Nat Methods 2020; 17:1111-1117. [PMID: 33046897 PMCID: PMC8211359 DOI: 10.1038/s41592-020-0958-x] [Citation(s) in RCA: 169] [Impact Index Per Article: 33.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 08/20/2020] [Indexed: 02/07/2023]
Abstract
In interphase, the human genome sequence folds in three dimensions into a rich variety of locus-specific contact patterns. Cohesin and CTCF (CCCTC-binding factor) are key regulators; perturbing the levels of either greatly disrupts genome-wide folding as assayed by chromosome conformation capture methods. Still, how a given DNA sequence encodes a particular locus-specific folding pattern remains unknown. Here we present a convolutional neural network, Akita, that accurately predicts genome folding from DNA sequence alone. Representations learned by Akita underscore the importance of an orientation-specific grammar for CTCF binding sites. Akita learns predictive nucleotide-level features of genome folding, revealing effects of nucleotides beyond the core CTCF motif. Once trained, Akita enables rapid in silico predictions. Accounting for this, we demonstrate how Akita can be used to perform in silico saturation mutagenesis, interpret eQTLs, make predictions for structural variants and probe species-specific genome folding. Collectively, these results enable decoding genome function from sequence through structure.
Collapse
Affiliation(s)
- Geoff Fudenberg
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA.
| | | | - Katherine S Pollard
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA.
- Department of Epidemiology and Biostatistics, Institute for Human Genetics, Quantitative Biology Institute, and Institute for Computational Health Sciences, University of California, San Francisco, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
34
|
Fudenberg G, Kelley DR, Pollard KS. Predicting 3D genome folding from DNA sequence with Akita. Nat Methods 2020; 17:1111-1117. [PMID: 33046897 DOI: 10.1101/800060] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2019] [Accepted: 08/20/2020] [Indexed: 05/20/2023]
Abstract
In interphase, the human genome sequence folds in three dimensions into a rich variety of locus-specific contact patterns. Cohesin and CTCF (CCCTC-binding factor) are key regulators; perturbing the levels of either greatly disrupts genome-wide folding as assayed by chromosome conformation capture methods. Still, how a given DNA sequence encodes a particular locus-specific folding pattern remains unknown. Here we present a convolutional neural network, Akita, that accurately predicts genome folding from DNA sequence alone. Representations learned by Akita underscore the importance of an orientation-specific grammar for CTCF binding sites. Akita learns predictive nucleotide-level features of genome folding, revealing effects of nucleotides beyond the core CTCF motif. Once trained, Akita enables rapid in silico predictions. Accounting for this, we demonstrate how Akita can be used to perform in silico saturation mutagenesis, interpret eQTLs, make predictions for structural variants and probe species-specific genome folding. Collectively, these results enable decoding genome function from sequence through structure.
Collapse
Affiliation(s)
- Geoff Fudenberg
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA.
| | | | - Katherine S Pollard
- Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA.
- Department of Epidemiology and Biostatistics, Institute for Human Genetics, Quantitative Biology Institute, and Institute for Computational Health Sciences, University of California, San Francisco, CA, USA.
- Chan Zuckerberg Biohub, San Francisco, CA, USA.
| |
Collapse
|
35
|
Trieu T, Martinez-Fundichely A, Khurana E. DeepMILO: a deep learning approach to predict the impact of non-coding sequence variants on 3D chromatin structure. Genome Biol 2020; 21:79. [PMID: 32216817 PMCID: PMC7098089 DOI: 10.1186/s13059-020-01987-4] [Citation(s) in RCA: 35] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2019] [Accepted: 03/06/2020] [Indexed: 12/17/2022] Open
Abstract
Non-coding variants have been shown to be related to disease by alteration of 3D genome structures. We propose a deep learning method, DeepMILO, to predict the effects of variants on CTCF/cohesin-mediated insulator loops. Application of DeepMILO on variants from whole-genome sequences of 1834 patients of twelve cancer types revealed 672 insulator loops disrupted in at least 10% of patients. Our results show mutations at loop anchors are associated with upregulation of the cancer driver genes BCL2 and MYC in malignant lymphoma thus pointing to a possible new mechanism for their dysregulation via alteration of insulator loops.
Collapse
Affiliation(s)
- Tuan Trieu
- Meyer Cancer Center, Weill Cornell Medicine, New York, NY, 10065, USA.
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, 10065, USA.
- Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, 10021, USA.
| | - Alexander Martinez-Fundichely
- Meyer Cancer Center, Weill Cornell Medicine, New York, NY, 10065, USA
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, 10065, USA
- Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, 10021, USA
| | - Ekta Khurana
- Meyer Cancer Center, Weill Cornell Medicine, New York, NY, 10065, USA.
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York, NY, 10065, USA.
- Institute for Computational Biomedicine, Weill Cornell Medicine, New York, NY, 10021, USA.
- Caryl and Israel Englander Institute for Precision Medicine, New York Presbyterian Hospital-Weill Cornell Medicine, New York, NY, 10065, USA.
| |
Collapse
|