1
|
Smaruj PN, Xiao Y, Fudenberg G. Recipes and ingredients for deep learning models of 3D genome folding. Curr Opin Genet Dev 2025; 91:102308. [PMID: 39862604 PMCID: PMC11867851 DOI: 10.1016/j.gde.2024.102308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2024] [Revised: 12/19/2024] [Accepted: 12/31/2024] [Indexed: 01/27/2025]
Abstract
Three-dimensional genome folding plays roles in gene regulation and disease. In this review, we compare and contrast recent deep learning models for predicting genome contact maps. We survey preprocessing, architecture, training, evaluation, and interpretation methods, highlighting the capabilities and limitations of different models. In each area, we highlight challenges, opportunities, and potential future directions for genome-folding models.
Collapse
Affiliation(s)
- Paulina N Smaruj
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Yao Xiao
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA
| | - Geoffrey Fudenberg
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA, USA.
| |
Collapse
|
2
|
An Z, Jiang A, Chen J. Toward understanding the role of genomic repeat elements in neurodegenerative diseases. Neural Regen Res 2025; 20:646-659. [PMID: 38886931 PMCID: PMC11433896 DOI: 10.4103/nrr.nrr-d-23-01568] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Revised: 12/21/2023] [Accepted: 03/02/2024] [Indexed: 06/20/2024] Open
Abstract
Neurodegenerative diseases cause great medical and economic burdens for both patients and society; however, the complex molecular mechanisms thereof are not yet well understood. With the development of high-coverage sequencing technology, researchers have started to notice that genomic repeat regions, previously neglected in search of disease culprits, are active contributors to multiple neurodegenerative diseases. In this review, we describe the association between repeat element variants and multiple degenerative diseases through genome-wide association studies and targeted sequencing. We discuss the identification of disease-relevant repeat element variants, further powered by the advancement of long-read sequencing technologies and their related tools, and summarize recent findings in the molecular mechanisms of repeat element variants in brain degeneration, such as those causing transcriptional silencing or RNA-mediated gain of toxic function. Furthermore, we describe how in silico predictions using innovative computational models, such as deep learning language models, could enhance and accelerate our understanding of the functional impact of repeat element variants. Finally, we discuss future directions to advance current findings for a better understanding of neurodegenerative diseases and the clinical applications of genomic repeat elements.
Collapse
Affiliation(s)
- Zhengyu An
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Aidi Jiang
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
| | - Jingqi Chen
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China
- MOE Frontiers Center for Brain Science, Fudan University, Shanghai, China
- MOE Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence, Fudan University, Shanghai, China
- Zhangjiang Fudan International Innovation Center, Shanghai, China
| |
Collapse
|
3
|
Cheng W, Song Z, Zhang Y, Wang S, Wang D, Yang M, Li L, Ma J. DNALongBench: A Benchmark Suite for Long-Range DNA Prediction Tasks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.01.06.631595. [PMID: 39829833 PMCID: PMC11741265 DOI: 10.1101/2025.01.06.631595] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/22/2025]
Abstract
Modeling long-range DNA dependencies is crucial for understanding genome structure and function across a wide range of biological contexts. However, effectively capturing these extensive dependencies, which may span millions of base pairs in tasks such as three-dimensional (3D) chromatin folding prediction, remains a significant challenge. Furthermore, a comprehensive benchmark suite for evaluating tasks that rely on long-range dependencies is notably absent. To address this gap, we introduce DNALongBench, a benchmark dataset encompassing five important genomics tasks that consider long-range dependencies up to 1 million base pairs: enhancer-target gene interaction, expression quantitative trait loci, 3D genome organization, regulatory sequence activity, and transcription initiation signals. To comprehensively assess DNALongBench, we evaluate the performance of five methods: a task-specific expert model, a convolutional neural network (CNN)-based model, and three fine-tuned DNA foundation models - HyenaDNA, Caduceus-Ph, and Caduceus-PS. We envision DNALongBench as a standardized resource with the potential to facilitate comprehensive comparisons and rigorous evaluations of emerging DNA sequence-based deep learning models that account for long-range dependencies.
Collapse
Affiliation(s)
- Wenduo Cheng
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Zhenqiao Song
- Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Yang Zhang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Shike Wang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Danqing Wang
- Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Muyu Yang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Lei Li
- Language Technologies Institute, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Jian Ma
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
4
|
Cheng W, Shen J, Khodak M, Ma J, Talwalkar A. L2G: Repurposing Language Models for Genomics Tasks. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.12.09.627422. [PMID: 39713364 PMCID: PMC11661171 DOI: 10.1101/2024.12.09.627422] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/24/2024]
Abstract
Pre-trained language models have transformed the field of natural language processing (NLP), and their success has inspired efforts in genomics to develop domain-specific foundation models (FMs). However, creating high-quality genomic FMs from scratch is resource-intensive, requiring significant computational power and high-quality pre-training data. The success of large language models (LLMs) in NLP has largely been driven by industrial-scale efforts leveraging vast, diverse corpora and massive computing infrastructure. In this work, we aim to bypass the data and computational bottlenecks of creating genomic FMs from scratch and instead propose repurposing existing LLMs for genomics tasks. Inspired by the recently observed 'cross-modal transfer' phenomenon - where transformers pre-trained on natural language can generalize to other modalities - we introduce L2G, which adapts a pre-trained LLM architecture for genomics using neural architecture search (NAS) and a novel three-stage training procedure. Remarkably, without requiring extensive pre-training on DNA sequence data, L2G achieves superior performance to fine-tuned genomic FMs and task-specific models on more than half of tasks across multiple genomics benchmarks. In an enhancer activity prediction task, L2G further demonstrates its capacity to identify significant transcription factor motifs. Our work not only highlights the generalizability and efficacy of language models in out-of-domain tasks such as genomics, but also opens new avenues for more efficient and less resource-intensive methodologies in genomic research.
Collapse
Affiliation(s)
- Wenduo Cheng
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Junhong Shen
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Mikhail Khodak
- Princeton Language & Intelligence, Princeton AI Lab, Princeton University, Princeton, NJ 08544, USA
| | - Jian Ma
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Ameet Talwalkar
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
5
|
Wang Y, Kong S, Zhou C, Wang Y, Zhang Y, Fang Y, Li G. A review of deep learning models for the prediction of chromatin interactions with DNA and epigenomic profiles. Brief Bioinform 2024; 26:bbae651. [PMID: 39708837 DOI: 10.1093/bib/bbae651] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2024] [Revised: 10/29/2024] [Accepted: 12/03/2024] [Indexed: 12/23/2024] Open
Abstract
Advances in three-dimensional (3D) genomics have revealed the spatial characteristics of chromatin interactions in gene expression regulation, which is crucial for understanding molecular mechanisms in biological processes. High-throughput technologies like ChIA-PET, Hi-C, and their derivatives methods have greatly enhanced our knowledge of 3D chromatin architecture. However, the chromatin interaction mechanisms remain largely unexplored. Deep learning, with its powerful feature extraction and pattern recognition capabilities, offers a promising approach for integrating multi-omics data, to build accurate predictive models of chromatin interaction matrices. This review systematically summarizes recent advances in chromatin interaction matrix prediction models. By integrating DNA sequences and epigenetic signals, we investigate the latest developments in these methods. This article details various models, focusing on how one-dimensional (1D) information transforms into the 3D structure chromatin interactions, and how the integration of different deep learning modules specifically affects model accuracy. Additionally, we discuss the critical role of DNA sequence information and epigenetic markers in shaping 3D genome interaction patterns. Finally, this review addresses the challenges in predicting chromatin interaction matrices, in order to improve the precise mapping of chromatin interaction matrices and DNA sequence, and supporting the transformation and theoretical development of 3D genomics across biological systems.
Collapse
Affiliation(s)
- Yunlong Wang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China
| | - Siyuan Kong
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China
| | - Cong Zhou
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
| | - Yanfang Wang
- State Key Laboratory of Animal Biotech Breeding, Institute of Animal Science, Chinese Academy of Agricultural Sciences (CAAS), No. 2 West Yuanmingyuan Rd, Haidian District, Beijing 100193, China
| | - Yubo Zhang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Key Laboratory of Livestock and Poultry Multi-omics of MARA, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, No. 97 Buxin Road, Dapeng New District, Shenzhen 518120, China
- Sequencing Facility, Frederick National Laboratory for Cancer Research, 8560 Progress Drive, Frederick, MD 21701, United States
| | - Yaping Fang
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
| | - Guoliang Li
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
- College of Informatics, Huazhong Agricultural University, No. 1 Shizishan Street, Hongshan District, Wuhan 430070, China
| |
Collapse
|
6
|
Ozcelik F, Dundar MS, Yildirim AB, Henehan G, Vicente O, Sánchez-Alcázar JA, Gokce N, Yildirim DT, Bingol NN, Karanfilska DP, Bertelli M, Pojskic L, Ercan M, Kellermayer M, Sahin IO, Greiner-Tollersrud OK, Tan B, Martin D, Marks R, Prakash S, Yakubi M, Beccari T, Lal R, Temel SG, Fournier I, Ergoren MC, Mechler A, Salzet M, Maffia M, Danalev D, Sun Q, Nei L, Matulis D, Tapaloaga D, Janecke A, Bown J, Cruz KS, Radecka I, Ozturk C, Nalbantoglu OU, Sag SO, Ko K, Arngrimsson R, Belo I, Akalin H, Dundar M. The impact and future of artificial intelligence in medical genetics and molecular medicine: an ongoing revolution. Funct Integr Genomics 2024; 24:138. [PMID: 39147901 DOI: 10.1007/s10142-024-01417-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 08/01/2024] [Accepted: 08/05/2024] [Indexed: 08/17/2024]
Abstract
Artificial intelligence (AI) platforms have emerged as pivotal tools in genetics and molecular medicine, as in many other fields. The growth in patient data, identification of new diseases and phenotypes, discovery of new intracellular pathways, availability of greater sets of omics data, and the need to continuously analyse them have led to the development of new AI platforms. AI continues to weave its way into the fabric of genetics with the potential to unlock new discoveries and enhance patient care. This technology is setting the stage for breakthroughs across various domains, including dysmorphology, rare hereditary diseases, cancers, clinical microbiomics, the investigation of zoonotic diseases, omics studies in all medical disciplines. AI's role in facilitating a deeper understanding of these areas heralds a new era of personalised medicine, where treatments and diagnoses are tailored to the individual's molecular features, offering a more precise approach to combating genetic or acquired disorders. The significance of these AI platforms is growing as they assist healthcare professionals in the diagnostic and treatment processes, marking a pivotal shift towards more informed, efficient, and effective medical practice. In this review, we will explore the range of AI tools available and show how they have become vital in various sectors of genomic research supporting clinical decisions.
Collapse
Affiliation(s)
- Firat Ozcelik
- Department of Medical Genetics, Faculty of Medicine, Erciyes University, Kayseri, Turkey
| | - Mehmet Sait Dundar
- Department of Electrical and Computer Engineering, Graduate School of Engineering and Sciences, Abdullah Gul University, Kayseri, Turkey
| | - A Baki Yildirim
- Department of Medical Genetics, Faculty of Medicine, Erciyes University, Kayseri, Turkey
| | - Gary Henehan
- School of Food Science and Environmental Health, Technological University of Dublin, Dublin, Ireland
| | - Oscar Vicente
- Institute for the Conservation and Improvement of Valencian Agrodiversity (COMAV), Universitat Politècnica de València, Valencia, Spain
| | - José A Sánchez-Alcázar
- Centro de Investigación Biomédica en Red: Enfermedades Raras, Centro Andaluz de Biología del Desarrollo (CABD-CSIC-Universidad Pablo de Olavide), Instituto de Salud Carlos III, Sevilla, Spain
| | - Nuriye Gokce
- Department of Medical Genetics, Faculty of Medicine, Erciyes University, Kayseri, Turkey
| | - Duygu T Yildirim
- Department of Medical Genetics, Faculty of Medicine, Erciyes University, Kayseri, Turkey
| | - Nurdeniz Nalbant Bingol
- Department of Translational Medicine, Institute of Health Sciences, Bursa Uludag University, Bursa, Turkey
| | - Dijana Plaseska Karanfilska
- Research Centre for Genetic Engineering and Biotechnology, Macedonian Academy of Sciences and Arts, Skopje, Macedonia
| | | | - Lejla Pojskic
- Institute for Genetic Engineering and Biotechnology, University of Sarajevo, Sarajevo, Bosnia and Herzegovina
| | - Mehmet Ercan
- Department of Medical Genetics, Faculty of Medicine, Erciyes University, Kayseri, Turkey
| | - Miklos Kellermayer
- Department of Biophysics and Radiation Biology, Faculty of Medicine, Semmelweis University, Budapest, Hungary
| | - Izem Olcay Sahin
- Department of Medical Genetics, Faculty of Medicine, Erciyes University, Kayseri, Turkey
| | | | - Busra Tan
- Department of Medical Genetics, Faculty of Medicine, Erciyes University, Kayseri, Turkey
| | - Donald Martin
- University Grenoble Alpes, CNRS, TIMC-IMAG/SyNaBi (UMR 5525), Grenoble, France
| | - Robert Marks
- Avram and Stella Goldstein-Goren Department of Biotechnology Engineering, Ben-Gurion University of the Negev, Be'er Sheva, Israel
| | - Satya Prakash
- Department of Biomedical Engineering, University of McGill, Montreal, QC, Canada
| | - Mustafa Yakubi
- Department of Medical Genetics, Faculty of Medicine, Erciyes University, Kayseri, Turkey
| | - Tommaso Beccari
- Department of Pharmeceutical Sciences, University of Perugia, Perugia, Italy
| | - Ratnesh Lal
- Neuroscience Research Institute, University of California, Santa Barbara, USA
| | - Sehime G Temel
- Department of Translational Medicine, Institute of Health Sciences, Bursa Uludag University, Bursa, Turkey
- Department of Medical Genetics, Bursa Uludag University Faculty of Medicine, Bursa, Turkey
- Department of Histology and Embryology, Faculty of Medicine, Bursa Uludag University, Bursa, Turkey
| | - Isabelle Fournier
- Réponse Inflammatoire et Spectrométrie de Masse-PRISM, University of Lille, Lille, France
| | - M Cerkez Ergoren
- Department of Medical Genetics, Near East University Faculty of Medicine, Nicosia, Cyprus
| | - Adam Mechler
- Department of Chemistry, La Trobe Institute for Molecular Science, La Trobe University, Melbourne, VIC, Australia
| | - Michel Salzet
- Réponse Inflammatoire et Spectrométrie de Masse-PRISM, University of Lille, Lille, France
| | - Michele Maffia
- Department of Experimental Medicine, University of Salento, Via Lecce-Monteroni, Lecce, 73100, Italy
| | - Dancho Danalev
- University of Chemical Technology and Metallurgy, Sofia, Bulgaria
| | - Qun Sun
- Department of Food Science and Technology, Sichuan University, Chengdu, China
| | - Lembit Nei
- School of Engineering Tallinn University of Technology, Tartu College, Tartu, Estonia
| | - Daumantas Matulis
- Department of Biothermodynamics and Drug Design, Institute of Biotechnology, Life Sciences Center, Vilnius University, Vilnius, Lithuania
| | - Dana Tapaloaga
- Faculty of Veterinary Medicine, University of Agronomic Sciences and Veterinary Medicine of Bucharest, Bucharest, Romania
| | - Andres Janecke
- Department of Paediatrics I, Medical University of Innsbruck, Innsbruck, Austria
- Division of Human Genetics, Medical University of Innsbruck, Innsbruck, Austria
| | - James Bown
- School of Science, Engineering and Technology, Abertay University, Dundee, UK
| | | | - Iza Radecka
- School of Science, Faculty of Science and Engineering, University of Wolverhampton, Wolverhampton, UK
| | - Celal Ozturk
- Department of Software Engineering, Erciyes University, Kayseri, Turkey
| | - Ozkan Ufuk Nalbantoglu
- Department of Computer Engineering, Engineering Faculty, Erciyes University, Kayseri, Turkey
| | - Sebnem Ozemri Sag
- Department of Medical Genetics, Bursa Uludag University Faculty of Medicine, Bursa, Turkey
| | - Kisung Ko
- Department of Medicine, College of Medicine, Chung-Ang University, Seoul, Korea
| | - Reynir Arngrimsson
- Iceland Landspitali University Hospital, University of Iceland, Reykjavik, Iceland
| | - Isabel Belo
- Centre of Biological Engineering, University of Minho, Braga, Portugal
| | - Hilal Akalin
- Department of Medical Genetics, Faculty of Medicine, Erciyes University, Kayseri, Turkey.
| | - Munis Dundar
- Department of Medical Genetics, Faculty of Medicine, Erciyes University, Kayseri, Turkey.
| |
Collapse
|
7
|
Chen V, Yang M, Cui W, Kim JS, Talwalkar A, Ma J. Applying interpretable machine learning in computational biology-pitfalls, recommendations and opportunities for new developments. Nat Methods 2024; 21:1454-1461. [PMID: 39122941 PMCID: PMC11348280 DOI: 10.1038/s41592-024-02359-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2022] [Accepted: 06/24/2024] [Indexed: 08/12/2024]
Abstract
Recent advances in machine learning have enabled the development of next-generation predictive models for complex computational biology problems, thereby spurring the use of interpretable machine learning (IML) to unveil biological insights. However, guidelines for using IML in computational biology are generally underdeveloped. We provide an overview of IML methods and evaluation techniques and discuss common pitfalls encountered when applying IML methods to computational biology problems. We also highlight open questions, especially in the era of large language models, and call for collaboration between IML and computational biology researchers.
Collapse
Affiliation(s)
- Valerie Chen
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Muyu Yang
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Wenbo Cui
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Joon Sik Kim
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Ameet Talwalkar
- Machine Learning Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
| | - Jian Ma
- Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
8
|
Xu J, Xu X, Huang D, Luo Y, Lin L, Bai X, Zheng Y, Yang Q, Cheng Y, Huang A, Shi J, Bo X, Gu J, Chen H. A comprehensive benchmarking with interpretation and operational guidance for the hierarchy of topologically associating domains. Nat Commun 2024; 15:4376. [PMID: 38782890 PMCID: PMC11116433 DOI: 10.1038/s41467-024-48593-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 05/03/2024] [Indexed: 05/25/2024] Open
Abstract
Topologically associating domains (TADs), megabase-scale features of chromatin spatial architecture, are organized in a domain-within-domain TAD hierarchy. Within TADs, the inner and smaller subTADs not only manifest cell-to-cell variability, but also precisely regulate transcription and differentiation. Although over 20 TAD callers are able to detect TAD, their usability in biomedicine is confined by a disagreement of outputs and a limit in understanding TAD hierarchy. We compare 13 computational tools across various conditions and develop a metric to evaluate the similarity of TAD hierarchy. Although outputs of TAD hierarchy at each level vary among callers, data resolutions, sequencing depths, and matrices normalization, they are more consistent when they have a higher similarity of larger TADs. We present comprehensive benchmarking of TAD hierarchy callers and operational guidance to researchers of life science researchers. Moreover, by simulating the mixing of different types of cells, we confirm that TAD hierarchy is generated not simply from stacking Hi-C heatmaps of heterogeneous cells. Finally, we propose an air conditioner model to decipher the role of TAD hierarchy in transcription.
Collapse
Affiliation(s)
- Jingxuan Xu
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Gastrointestinal Surgery, Peking University Cancer Hospital & Institute, Beijing, 100142, China
| | - Xiang Xu
- Academy of Military Medical Science, Beijing, 100850, China
| | - Dandan Huang
- Department of Oncology, Peking University Shougang Hospital, Beijing, China
- Center for Precision Diagnosis and Treatment of Colorectal Cancer and Inflammatory Diseases, Peking University Health Science Center, Beijing, China
| | - Yawen Luo
- Academy of Military Medical Science, Beijing, 100850, China
| | - Lin Lin
- Academy of Military Medical Science, Beijing, 100850, China
- School of Computer Science and Information Technology& KLAS, Northeast Normal University, Changchun, China
| | - Xuemei Bai
- Academy of Military Medical Science, Beijing, 100850, China
| | - Yang Zheng
- Academy of Military Medical Science, Beijing, 100850, China
| | - Qian Yang
- Academy of Military Medical Science, Beijing, 100850, China
| | - Yu Cheng
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Gastrointestinal Surgery, Peking University Cancer Hospital & Institute, Beijing, 100142, China
| | - An Huang
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Gastrointestinal Surgery, Peking University Cancer Hospital & Institute, Beijing, 100142, China
| | - Jingyi Shi
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Gastrointestinal Surgery, Peking University Cancer Hospital & Institute, Beijing, 100142, China
| | - Xiaochen Bo
- Academy of Military Medical Science, Beijing, 100850, China.
| | - Jin Gu
- Key Laboratory of Carcinogenesis and Translational Research (Ministry of Education/Beijing), Department of Gastrointestinal Surgery, Peking University Cancer Hospital & Institute, Beijing, 100142, China.
- Department of Oncology, Peking University Shougang Hospital, Beijing, China.
- Center for Precision Diagnosis and Treatment of Colorectal Cancer and Inflammatory Diseases, Peking University Health Science Center, Beijing, China.
- Peking-Tsinghua Center for Life Sciences, Peking University, Beijing, China.
- Peking University International Cancer Institute, Beijing, China.
| | - Hebing Chen
- Academy of Military Medical Science, Beijing, 100850, China.
| |
Collapse
|
9
|
Zhang Y, Boninsegna L, Yang M, Misteli T, Alber F, Ma J. Computational methods for analysing multiscale 3D genome organization. Nat Rev Genet 2024; 25:123-141. [PMID: 37673975 PMCID: PMC11127719 DOI: 10.1038/s41576-023-00638-1] [Citation(s) in RCA: 23] [Impact Index Per Article: 23.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/12/2023] [Indexed: 09/08/2023]
Abstract
Recent progress in whole-genome mapping and imaging technologies has enabled the characterization of the spatial organization and folding of the genome in the nucleus. In parallel, advanced computational methods have been developed to leverage these mapping data to reveal multiscale three-dimensional (3D) genome features and to provide a more complete view of genome structure and its connections to genome functions such as transcription. Here, we discuss how recently developed computational tools, including machine-learning-based methods and integrative structure-modelling frameworks, have led to a systematic, multiscale delineation of the connections among different scales of 3D genome organization, genomic and epigenomic features, functional nuclear components and genome function. However, approaches that more comprehensively integrate a wide variety of genomic and imaging datasets are still needed to uncover the functional role of 3D genome structure in defining cellular phenotypes in health and disease.
Collapse
Affiliation(s)
- Yang Zhang
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Lorenzo Boninsegna
- Department of Microbiology, Immunology and Molecular Genetics and Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, CA, USA
| | - Muyu Yang
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Tom Misteli
- Center for Cancer Research, National Cancer Institute, Bethesda, MD, USA.
| | - Frank Alber
- Department of Microbiology, Immunology and Molecular Genetics and Institute for Quantitative and Computational Biosciences, University of California Los Angeles, Los Angeles, CA, USA.
| | - Jian Ma
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
10
|
Chen LF, Long HK. Topology regulatory elements: From shaping genome architecture to gene regulation. Curr Opin Struct Biol 2023; 83:102723. [PMID: 37931379 PMCID: PMC7615376 DOI: 10.1016/j.sbi.2023.102723] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 09/15/2023] [Accepted: 09/28/2023] [Indexed: 11/08/2023]
Abstract
The importance of 3D genome topology in the control of gene expression is becoming increasingly apparent, while regulatory mechanisms remain incompletely understood. Several recent studies have identified architectural elements that influence developmental gene expression by shaping locus topology. We refer to these elements as topological regulatory elements (TopoREs) to reflect their dual roles in genome organisation and gene expression. Importantly, these elements do not harbour autonomous transcriptional activation capacity, and instead appear to facilitate enhancer-promoter interactions, contributing to robust and precise timing of transcription. We discuss examples of TopoREs from two classes that are either dependent or independent of CTCF binding. Importantly, identification and interpretation of TopoRE function may shed light on multiple aspects of gene regulation, including the relationship between enhancer-promoter proximity and transcription, and enhancer-promoter specificity. Ultimately, understanding TopoRE diversity and function will aid in the interpretation of how human sequence variation can impact transcription and contribute to disease phenotypes.
Collapse
Affiliation(s)
- Liang-Fu Chen
- Department of Chemical and Systems Biology, Stanford University School of Medicine, Stanford, CA 94305, USA
| | - Hannah Katherine Long
- MRC Human Genetics Unit, Institute of Genetics and Cancer, University of Edinburgh, Crewe Road, Edinburgh, UK.
| |
Collapse
|
11
|
Toussaint PA, Leiser F, Thiebes S, Schlesner M, Brors B, Sunyaev A. Explainable artificial intelligence for omics data: a systematic mapping study. Brief Bioinform 2023; 25:bbad453. [PMID: 38113073 PMCID: PMC10729786 DOI: 10.1093/bib/bbad453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/23/2022] [Revised: 07/28/2023] [Accepted: 11/08/2023] [Indexed: 12/21/2023] Open
Abstract
Researchers increasingly turn to explainable artificial intelligence (XAI) to analyze omics data and gain insights into the underlying biological processes. Yet, given the interdisciplinary nature of the field, many findings have only been shared in their respective research community. An overview of XAI for omics data is needed to highlight promising approaches and help detect common issues. Toward this end, we conducted a systematic mapping study. To identify relevant literature, we queried Scopus, PubMed, Web of Science, BioRxiv, MedRxiv and arXiv. Based on keywording, we developed a coding scheme with 10 facets regarding the studies' AI methods, explainability methods and omics data. Our mapping study resulted in 405 included papers published between 2010 and 2023. The inspected papers analyze DNA-based (mostly genomic), transcriptomic, proteomic or metabolomic data by means of neural networks, tree-based methods, statistical methods and further AI methods. The preferred post-hoc explainability methods are feature relevance (n = 166) and visual explanation (n = 52), while papers using interpretable approaches often resort to the use of transparent models (n = 83) or architecture modifications (n = 72). With many research gaps still apparent for XAI for omics data, we deduced eight research directions and discuss their potential for the field. We also provide exemplary research questions for each direction. Many problems with the adoption of XAI for omics data in clinical practice are yet to be resolved. This systematic mapping study outlines extant research on the topic and provides research directions for researchers and practitioners.
Collapse
Affiliation(s)
- Philipp A Toussaint
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
- HIDSS4Health – Helmholtz Information and Data Science School for Health, Karlsruhe, Heidelberg, Germany
| | - Florian Leiser
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Scott Thiebes
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Matthias Schlesner
- Biomedical Informatics, Data Mining and Data Analytics, Faculty of Applied Computer Science and Medical Faculty, University of Augsburg, Augsburg, Germany
| | - Benedikt Brors
- Division of Applied Bioinformatics, German Cancer Research Center (DKFZ), Heidelberg, Germany
- Translational Oncology, National Center for Tumor Diseases, German Cancer Research Center (DKFZ), Heidelberg, Germany
| | - Ali Sunyaev
- Department of Economics and Management, Karlsruhe Institute of Technology, Karlsruhe, Germany
| |
Collapse
|
12
|
Gunsalus LM, Keiser MJ, Pollard KS. In silico discovery of repetitive elements as key sequence determinants of 3D genome folding. CELL GENOMICS 2023; 3:100410. [PMID: 37868032 PMCID: PMC10589630 DOI: 10.1016/j.xgen.2023.100410] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/12/2022] [Revised: 11/08/2022] [Accepted: 08/31/2023] [Indexed: 10/24/2023]
Abstract
Natural and experimental genetic variants can modify DNA loops and insulating boundaries to tune transcription, but it is unknown how sequence perturbations affect chromatin organization genome wide. We developed a deep-learning strategy to quantify the effect of any insertion, deletion, or substitution on chromatin contacts and systematically scored millions of synthetic variants. While most genetic manipulations have little impact, regions with CTCF motifs and active transcription are highly sensitive, as expected. Our unbiased screen and subsequent targeted experiments also point to noncoding RNA genes and several families of repetitive elements as CTCF-motif-free DNA sequences with particularly large effects on nearby chromatin interactions, sometimes exceeding the effects of CTCF sites and explaining interactions that lack CTCF. We anticipate that our disruption tracks may be of broad interest and utility as a measure of 3D genome sensitivity, and our computational strategies may serve as a template for biological inquiry with deep learning.
Collapse
Affiliation(s)
- Laura M. Gunsalus
- Gladstone Institutes, San Francisco, CA, USA
- Institute for Neurodegenerative Diseases, University of California, San Francisco, San Francisco, CA, USA
| | - Michael J. Keiser
- Institute for Neurodegenerative Diseases, University of California, San Francisco, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Department of Bioengineering and Therapeutic Sciences, University of California, San Francisco, San Francisco, CA, USA
- Kavli Institute for Fundamental Neuroscience, University of California, San Francisco, San Francisco, CA, USA
- Department of Pharmaceutical Chemistry, University of California, San Francisco, San Francisco, CA, USA
| | - Katherine S. Pollard
- Gladstone Institutes, San Francisco, CA, USA
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, CA, USA
- Chan Zuckerberg Biohub, San Francisco, CA, USA
- Department of Epidemiology & Biostatistics, University of California, San Francisco, San Francisco, CA, USA
| |
Collapse
|
13
|
Yang M, Ma J. UNADON: transformer-based model to predict genome-wide chromosome spatial position. Bioinformatics 2023; 39:i553-i562. [PMID: 37387176 PMCID: PMC10311299 DOI: 10.1093/bioinformatics/btad246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION The spatial positioning of chromosomes relative to functional nuclear bodies is intertwined with genome functions such as transcription. However, the sequence patterns and epigenomic features that collectively influence chromatin spatial positioning in a genome-wide manner are not well understood. RESULTS Here, we develop a new transformer-based deep learning model called UNADON, which predicts the genome-wide cytological distance to a specific type of nuclear body, as measured by TSA-seq, using both sequence features and epigenomic signals. Evaluations of UNADON in four cell lines (K562, H1, HFFc6, HCT116) show high accuracy in predicting chromatin spatial positioning to nuclear bodies when trained on a single cell line. UNADON also performed well in an unseen cell type. Importantly, we reveal potential sequence and epigenomic factors that affect large-scale chromatin compartmentalization in nuclear bodies. Together, UNADON provides new insights into the principles between sequence features and large-scale chromatin spatial localization, which has important implications for understanding nuclear structure and function. AVAILABILITY AND IMPLEMENTATION The source code of UNADON can be found at https://github.com/ma-compbio/UNADON.
Collapse
Affiliation(s)
- Muyu Yang
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh PA 15213, USA
| | - Jian Ma
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh PA 15213, USA
| |
Collapse
|
14
|
Shokraneh N, Arab M, Libbrecht M. Integrative chromatin domain annotation through graph embedding of Hi-C data. Bioinformatics 2022; 39:6935783. [PMID: 36534827 PMCID: PMC9848054 DOI: 10.1093/bioinformatics/btac813] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Revised: 11/02/2022] [Accepted: 12/16/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The organization of the genome into domains plays a central role in gene expression and other cellular activities. Researchers identify genomic domains mainly through two views: 1D functional assays such as ChIP-seq, and chromatin conformation assays such as Hi-C. Fully understanding domains requires integrative modeling that combines these two views. However, the predominant form of integrative modeling uses segmentation and genome annotation (SAGA) along with the rigid assumption that loci in contact are more likely to share the same domain type, which is not necessarily true for epigenomic domain types and genome-wide chromatin interactions. RESULTS Here, we present an integrative approach that annotates domains using both 1D functional genomic signals and Hi-C measurements of genome-wide 3D interactions without the use of a pairwise prior. We do so by using a graph embedding to learn structural features corresponding to each genomic region, then inputting learned structural features along with functional genomic signals to a SAGA algorithm. We show that our domain types recapitulate well-known subcompartments with an additional granularity that distinguishes a combination of the spatial and functional states of the genomic regions. In particular, we identified a division of the previously identified A2 subcompartment such that the divided domain types have significantly varying expression levels. AVAILABILITY AND IMPLEMENTATION https://github.com/nedashokraneh/IChDA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Neda Shokraneh
- Computing Science Department, Simon Fraser University, Burnaby V5A 1S6, Canada
| | - Mariam Arab
- Computing Science Department, Simon Fraser University, Burnaby V5A 1S6, Canada
| | | |
Collapse
|
15
|
Editorial: Artificial Intelligence, machine learning and the changing landscape of molecular biology. J Mol Biol 2022; 434:167712. [PMID: 35777464 DOI: 10.1016/j.jmb.2022.167712] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/17/2022]
|