1
|
Chen SF, Steele RJ, Hocky GM, Lemeneh B, Lad SP, Oermann EK. Large-Scale Multi-omic Biosequence Transformers for Modeling Protein-Nucleic Acid Interactions. ARXIV 2025:arXiv:2408.16245v3. [PMID: 40236839 PMCID: PMC11998858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 04/17/2025]
Abstract
The transformer architecture has revolutionized bioinformatics and driven progress in the understanding and prediction of the properties of biomolecules. Almost all research on large-scale biosequence transformers has focused on one domain at a time (single-omic), usually DNA/RNA or proteins. These models have seen incredible success in downstream tasks in each domain, and have achieved particularly noteworthy breakthroughs in sequence modeling and structural modeling. However, these single-omic models are naturally incapable of efficiently modeling multi-omic tasks, one of the most biologically critical being protein-nucleic acid interactions. We present our work training the largest open-source multi-omic foundation model to date. We show that these multi-omic models (MOMs) can learn joint representations between various single-omic distributions that are emergently consistent with the Central Dogma of molecular biology despite only being trained on unlabeled biosequences. We further demonstrate that MOMs can be fine-tuned to achieve state-of-the-art results on protein-nucleic acid interaction tasks, namely predicting the change in Gibbs free energyΔ G of the binding interaction between a given nucleic acid and protein. Remarkably, we show that multi-omic biosequence transformers emergently learn useful structural information without any a priori structural training, allowing us to predict which protein residues are most involved in the protein-nucleic acid binding interaction. Lastly, we provide evidence that multi-omic biosequence models are in many cases superior to foundation models trained on single-omics distributions, both in performance-per-FLOP and absolute performance, suggesting a more generalized or foundational approach to building these models for biology.
Collapse
Affiliation(s)
- Sully F Chen
- Duke University School of Medicine, Durham, NC 27710, USA
| | | | - Glen M Hocky
- Department of Chemistry and Simons Center for Computational Physical Chemistry, New York University, New York, NY 10012, USA
| | | | - Shivanand P Lad
- Duke University School of Medicine, Department of Neurological Surgery, Durham, NC 27710, USA
| | - Eric K Oermann
- NYU Langone Health, Department of Neurological Surgery, New York, NY 10016, USA
| |
Collapse
|
2
|
Zhou J, Luo C, Liu H, Heffel MG, Straub RE, Kleinman JE, Hyde TM, Ecker JR, Weinberger DR, Han S. Deep learning imputes DNA methylation states in single cells and enhances the detection of epigenetic alterations in schizophrenia. CELL GENOMICS 2025; 5:100774. [PMID: 39986279 PMCID: PMC11960545 DOI: 10.1016/j.xgen.2025.100774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Revised: 11/13/2024] [Accepted: 01/24/2025] [Indexed: 02/24/2025]
Abstract
DNA methylation (DNAm) is a key epigenetic mark with essential roles in gene regulation, mammalian development, and human diseases. Single-cell technologies enable profiling DNAm at cytosines in individual cells, but they often suffer from low coverage for CpG sites. We introduce scMeFormer, a transformer-based deep learning model for imputing DNAm states at each CpG site in single cells. Comprehensive evaluations across five single-nucleus DNAm datasets from human and mouse demonstrate scMeFormer's superior performance over alternative models, achieving high-fidelity imputation even with coverage reduced to 10% of original CpG sites. Applying scMeFormer to a single-nucleus DNAm dataset from the prefrontal cortex of patients with schizophrenia and controls identified thousands of schizophrenia-associated differentially methylated regions that would have remained undetectable without imputation and added granularity to our understanding of epigenetic alterations in schizophrenia. We anticipate that scMeFormer will be a valuable tool for advancing single-cell DNAm studies.
Collapse
Affiliation(s)
- Jiyun Zhou
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21287, USA
| | - Chongyuan Luo
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Hanqing Liu
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Matthew G Heffel
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA 90095, USA; Bioinformatics Interdepartmental Program, University of California, Los Angeles, Los Angeles, CA 90095, USA
| | - Richard E Straub
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21287, USA
| | - Joel E Kleinman
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21287, USA; Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA
| | - Thomas M Hyde
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21287, USA; Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA; Department of Neurology, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Joseph R Ecker
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA; Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA 92037, USA
| | - Daniel R Weinberger
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21287, USA; Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA; Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA; Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA; Department of Neurology, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| | - Shizhong Han
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD 21287, USA; Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD 21287, USA; Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA.
| |
Collapse
|
3
|
Luo S, Peng H, Shi Y, Cai J, Zhang S, Shao N, Li J. Integration of proteomics profiling data to facilitate discovery of cancer neoantigens: a survey. Brief Bioinform 2025; 26:bbaf087. [PMID: 40052441 PMCID: PMC11886573 DOI: 10.1093/bib/bbaf087] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2024] [Revised: 12/29/2024] [Accepted: 02/19/2025] [Indexed: 03/10/2025] Open
Abstract
Cancer neoantigens are peptides that originate from alterations in the genome, transcriptome, or proteome. These peptides can elicit cancer-specific T-cell recognition, making them potential candidates for cancer vaccines. The rapid advancement of proteomics technology holds tremendous potential for identifying these neoantigens. Here, we provided an up-to-date survey about database-based search methods and de novo peptide sequencing approaches in proteomics, and we also compared these methods to recommend reliable analytical tools for neoantigen identification. Unlike previous surveys on mass spectrometry-based neoantigen discovery, this survey summarizes the key advancements in de novo peptide sequencing approaches that utilize artificial intelligence. From a comparative study on a dataset of the HepG2 cell line and nine mixed hepatocellular carcinoma proteomics samples, we demonstrated the potential of proteomics for the identification of cancer neoantigens and conducted comparisons of the existing methods to illustrate their limits. Understanding these limits, we suggested a novel workflow for neoantigen discovery as perspectives.
Collapse
Affiliation(s)
- Shifu Luo
- Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, 518107, Guangdong, China
- Faculty of Health Sciences, University of Macau, Taipa, Macao SAR 999078, China
| | - Hui Peng
- Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, 518107, Guangdong, China
- School of Biological Sciences, Nanyang Technological University, 60 Nanyang Drive, Singapore
| | - Ying Shi
- Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, 518107, Guangdong, China
- School of Computer and Information Technology, Shanxi University, Taiyuan, 030006, Shanxi, China
| | - Jiaxin Cai
- Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, 518107, Guangdong, China
| | - Songming Zhang
- Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, 518107, Guangdong, China
| | - Ningyi Shao
- Faculty of Health Sciences, University of Macau, Taipa, Macao SAR 999078, China
| | - Jinyan Li
- Faculty of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, 518107, Guangdong, China
| |
Collapse
|
4
|
Zeng R, Li Z, Li J, Zhang Q. DNA promoter task-oriented dictionary mining and prediction model based on natural language technology. Sci Rep 2025; 15:153. [PMID: 39747934 PMCID: PMC11697570 DOI: 10.1038/s41598-024-84105-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2024] [Accepted: 12/19/2024] [Indexed: 01/04/2025] Open
Abstract
Promoters are essential DNA sequences that initiate transcription and regulate gene expression. Precisely identifying promoter sites is crucial for deciphering gene expression patterns and the roles of gene regulatory networks. Recent advancements in bioinformatics have leveraged deep learning and natural language processing (NLP) to enhance promoter prediction accuracy. Techniques such as convolutional neural networks (CNNs), long short-term memory (LSTM) networks, and BERT models have been particularly impactful. However, current approaches often rely on arbitrary DNA sequence segmentation during BERT pre-training, which may not yield optimal results. To overcome this limitation, this article introduces a novel DNA sequence segmentation method. This approach develops a more refined dictionary for DNA sequences, utilizes it for BERT pre-training, and employs an Inception neural network as the foundational model. This BERT-Inception architecture captures information across multiple granularities. Experimental results show that the model improves the performance of several downstream tasks and introduces deep learning interpretability, providing new perspectives for interpreting and understanding DNA sequence information. The detailed source code is available at https://github.com/katouMegumiH/Promoter_BERT .
Collapse
Affiliation(s)
- Ruolei Zeng
- Department of Computer Science and Engineering, University of Minnesota, Minneapolis, MN, 55455, USA
| | - Zihan Li
- National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China.
| | - Jialu Li
- National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China
| | - Qingchuan Zhang
- National Engineering Research Centre for Agri-Product Quality Traceability, Beijing Technology and Business University, No.11 Fucheng Road, Beijing, 100048, China.
| |
Collapse
|
5
|
Park S, To Chong K, Tayara H. CpGFuse: a holistic approach for accurate identification of methylation states of DNA CpG sites. Brief Bioinform 2024; 26:bbaf063. [PMID: 39968737 PMCID: PMC11836533 DOI: 10.1093/bib/bbaf063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2024] [Revised: 12/27/2024] [Accepted: 02/07/2025] [Indexed: 02/20/2025] Open
Abstract
Anomalous DNA methylation has wide-ranging implications, spanning from neurological disorders to cancer and cardiovascular complications. Current methods for single-cell DNA methylation analysis face limitations in coverage, leading to information loss and hampering our understanding of disease associations. The primary goal of this study is the imputation of CpG site methylation states in a given cell by leveraging the CpG states of other cells of the same type. To address this, we introduce CpGFuse, a novel methodology that combines information from diverse genomic features. Leveraging two benchmark datasets, we employed a careful preprocessing approach and conducted a comprehensive ablation study to assess the individual and collective contributions of DNA sequence, intercellular, and intracellular features. Our proposed model, CpGFuse, employs a convolutional neural network with an attention mechanism, surpassing existing models across HCCs and HepG2 datasets. The results highlight the effectiveness of our approach in enhancing accuracy and providing a robust tool for CpG site prediction in genomics. CpGFuse's success underscores the importance of integrating multiple genomic features for accurate identification of methylation states of CpG site.
Collapse
Affiliation(s)
- Sehi Park
- Department of Electronics and Information Engineering, Jeonbuk National University, Baekje-daero, Deokjin-gu, Jeonju 54896, South Korea
| | - Kil To Chong
- Department of Electronics and Information Engineering, Jeonbuk National University, Baekje-daero, Deokjin-gu, Jeonju 54896, South Korea
- Advances Electronics and Information Research Center, Jeonbuk National University, Baekje-daero, Deokjin-gu, Jeonju 54896, South Korea
| | - Hilal Tayara
- School of international Engineering and Science, Jeonbuk National University, Baekje-daero, Deokjin-gu, Jeonju 54896, South Korea
| |
Collapse
|
6
|
Zhou J, Luo C, Liu H, Heffel MG, Straub RE, Kleinman JE, Hyde TM, Ecker JR, Weinberger DR, Han S. scMeFormer: a transformer-based deep learning model for imputing DNA methylation states in single cells enhances the detection of epigenetic alterations in schizophrenia. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.25.577200. [PMID: 38328094 PMCID: PMC10849713 DOI: 10.1101/2024.01.25.577200] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
DNA methylation (DNAm), a crucial epigenetic mark, plays a key role in gene regulation, mammalian development, and various human diseases. Single-cell technologies enable the profiling of DNAm states at cytosines within the DNA sequence of individual cells, but they often suffer from limited coverage of CpG sites. In this study, we introduce scMeFormer, a transformer-based deep learning model designed to impute DNAm states for each CpG site in single cells. Through comprehensive evaluations, we demonstrate the superior performance of scMeFormer compared to alternative models across four single-nucleus DNAm datasets generated by distinct technologies. Remarkably, scMeFormer exhibits high-fidelity imputation, even when dealing with significantly reduced coverage, as low as 10% of the original CpG sites. Furthermore, we applied scMeFormer to a single-nucleus DNAm dataset generated from the prefrontal cortex of four schizophrenia patients and four neurotypical controls. This enabled the identification of thousands of differentially methylated regions associated with schizophrenia that would have remained undetectable without imputation and added granularity to our understanding of epigenetic alterations in schizophrenia within specific cell types. Our study highlights the power of deep learning in imputing DNAm states in single cells, and we expect scMeFormer to be a valuable tool for single-cell DNAm studies.
Collapse
Affiliation(s)
- Jiyun Zhou
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21287, USA
| | - Chongyuan Luo
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, 90095, USA
| | - Hanqing Liu
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, 92037, USA
| | - Matthew G. Heffel
- Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, 90095, USA
- Bioinformatics Interdepartmental Program, University of California Los Angeles, Los Angeles, CA 90095, USA
| | - Richard E. Straub
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21287, USA
| | - Joel E. Kleinman
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21287, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, 21287, USA
| | - Thomas M. Hyde
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21287, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, 21287, USA
- Department of Neurology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Joseph R. Ecker
- Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, 92037, USA
- Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA, 92037, USA
| | - Daniel R. Weinberger
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21287, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, 21287, USA
- Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
- Department of Neuroscience, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
- Department of Neurology, Johns Hopkins University School of Medicine, Baltimore, MD, 21205, USA
| | - Shizhong Han
- Lieber Institute for Brain Development, Johns Hopkins Medical Campus, Baltimore, MD, 21287, USA
- Department of Psychiatry and Behavioral Sciences, Johns Hopkins University School of Medicine, Baltimore, MD, 21287, USA
- Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, MD 21205, USA
| |
Collapse
|
7
|
Qu S, Gong M, Deng Y, Xiang Y, Ye D. Research progress and application of single-cell sequencing in head and neck malignant tumors. Cancer Gene Ther 2024; 31:18-27. [PMID: 37968342 PMCID: PMC10794142 DOI: 10.1038/s41417-023-00691-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Revised: 10/19/2023] [Accepted: 11/03/2023] [Indexed: 11/17/2023]
Abstract
Single-cell sequencing (SCS) is a technology that separates thousands of cells from the organism and accurately analyzes the genetic material expressed in each cell using high-throughput sequencing technology. Unlike the traditional bulk sequencing approach, which can only provide the average value of a cell population and cannot obtain specific single-cell data, single-cell sequencing can identify the gene sequence and expression changes of a single cell, and reflects the differences between genetic material and protein between cells, and ultimately the role played by the tumor microenvironment. single-cell sequencing can further explore the pathogenesis of head and neck malignancies from the single-cell biological level and provides a theoretical basis for the clinical diagnosis and treatment of head and neck malignancies. This article will systematically introduce the latest progress and application of single-cell sequencing in malignant head and neck tumors.
Collapse
Affiliation(s)
- Siyuan Qu
- Department of Otorhinolaryngology-Head and Neck Surgery, The Affiliated Lihuili Hospital, Ningbo University, Ningbo, 315040, Zhejiang, China
| | - Mengdan Gong
- Department of Otorhinolaryngology-Head and Neck Surgery, The Affiliated Lihuili Hospital, Ningbo University, Ningbo, 315040, Zhejiang, China
| | - Yongqin Deng
- Department of Otorhinolaryngology-Head and Neck Surgery, The Affiliated Lihuili Hospital, Ningbo University, Ningbo, 315040, Zhejiang, China
| | - Yizhen Xiang
- Department of Otorhinolaryngology-Head and Neck Surgery, The Affiliated Lihuili Hospital, Ningbo University, Ningbo, 315040, Zhejiang, China
| | - Dong Ye
- Department of Otorhinolaryngology-Head and Neck Surgery, The Affiliated Lihuili Hospital, Ningbo University, Ningbo, 315040, Zhejiang, China.
| |
Collapse
|
8
|
Huang L, Song M, Shen H, Hong H, Gong P, Deng HW, Zhang C. Deep Learning Methods for Omics Data Imputation. BIOLOGY 2023; 12:1313. [PMID: 37887023 PMCID: PMC10604785 DOI: 10.3390/biology12101313] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/08/2023] [Revised: 09/28/2023] [Accepted: 10/02/2023] [Indexed: 10/28/2023]
Abstract
One common problem in omics data analysis is missing values, which can arise due to various reasons, such as poor tissue quality and insufficient sample volumes. Instead of discarding missing values and related data, imputation approaches offer an alternative means of handling missing data. However, the imputation of missing omics data is a non-trivial task. Difficulties mainly come from high dimensionality, non-linear or non-monotonic relationships within features, technical variations introduced by sampling methods, sample heterogeneity, and the non-random missingness mechanism. Several advanced imputation methods, including deep learning-based methods, have been proposed to address these challenges. Due to its capability of modeling complex patterns and relationships in large and high-dimensional datasets, many researchers have adopted deep learning models to impute missing omics data. This review provides a comprehensive overview of the currently available deep learning-based methods for omics imputation from the perspective of deep generative model architectures such as autoencoder, variational autoencoder, generative adversarial networks, and Transformer, with an emphasis on multi-omics data imputation. In addition, this review also discusses the opportunities that deep learning brings and the challenges that it might face in this field.
Collapse
Affiliation(s)
- Lei Huang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Meng Song
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| | - Hui Shen
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Huixiao Hong
- Division of Bioinformatics and Biostatistics, National Center for Toxicological Research, U.S. Food and Drug Administration, Jefferson, AR 72079, USA
| | - Ping Gong
- Environmental Laboratory, U.S. Army Engineer Research and Development Center, Vicksburg, MS 39180, USA
| | - Hong-Wen Deng
- Center for Biomedical Informatics and Genomics, School of Medicine, Tulane University, New Orleans, LA 70112, USA
| | - Chaoyang Zhang
- School of Computing Sciences and Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA
| |
Collapse
|
9
|
Yassi M, Chatterjee A, Parry M. Application of deep learning in cancer epigenetics through DNA methylation analysis. Brief Bioinform 2023; 24:bbad411. [PMID: 37985455 PMCID: PMC10661960 DOI: 10.1093/bib/bbad411] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2023] [Revised: 10/08/2023] [Accepted: 10/25/2023] [Indexed: 11/22/2023] Open
Abstract
DNA methylation is a fundamental epigenetic modification involved in various biological processes and diseases. Analysis of DNA methylation data at a genome-wide and high-throughput level can provide insights into diseases influenced by epigenetics, such as cancer. Recent technological advances have led to the development of high-throughput approaches, such as genome-scale profiling, that allow for computational analysis of epigenetics. Deep learning (DL) methods are essential in facilitating computational studies in epigenetics for DNA methylation analysis. In this systematic review, we assessed the various applications of DL applied to DNA methylation data or multi-omics data to discover cancer biomarkers, perform classification, imputation and survival analysis. The review first introduces state-of-the-art DL architectures and highlights their usefulness in addressing challenges related to cancer epigenetics. Finally, the review discusses potential limitations and future research directions in this field.
Collapse
Affiliation(s)
- Maryam Yassi
- Department of Mathematics and Statistics, University of Otago, Dunedin, New Zealand
- Department of Pathology, Dunedin School of Medicine, University of Otago, Dunedin, New Zealand
| | - Aniruddha Chatterjee
- Department of Pathology, Dunedin School of Medicine, University of Otago, Dunedin, New Zealand
- Honorary Professor, UPES University, Dehradun, India
| | - Matthew Parry
- Department of Mathematics and Statistics, University of Otago, Dunedin, New Zealand
- Te Pūnaha Matatini Centre of Research Excellence, University of Auckland, Auckland, New Zealand
| |
Collapse
|
10
|
Deng Y, Tang J, Zhang J, Zou J, Zhu Q, Fan S. GraphCpG: imputation of single-cell methylomes based on locus-aware neighboring subgraphs. Bioinformatics 2023; 39:btad533. [PMID: 37647650 PMCID: PMC10516632 DOI: 10.1093/bioinformatics/btad533] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 07/24/2023] [Accepted: 08/28/2023] [Indexed: 09/01/2023] Open
Abstract
MOTIVATION Single-cell DNA methylation sequencing can assay DNA methylation at single-cell resolution. However, incomplete coverage compromises related downstream analyses, outlining the importance of imputation techniques. With a rising number of cell samples in recent large datasets, scalable and efficient imputation models are critical to addressing the sparsity for genome-wide analyses. RESULTS We proposed a novel graph-based deep learning approach to impute methylation matrices based on locus-aware neighboring subgraphs with locus-aware encoding orienting on one cell type. Merely using the CpGs methylation matrix, the obtained GraphCpG outperforms previous methods on datasets containing more than hundreds of cells and achieves competitive performance on smaller datasets, with subgraphs of predicted sites visualized by retrievable bipartite graphs. Besides better imputation performance with increasing cell number, it significantly reduces computation time and demonstrates improvement in downstream analysis. AVAILABILITY AND IMPLEMENTATION The source code is freely available at https://github.com/yuzhong-deng/graphcpg.git.
Collapse
Affiliation(s)
- Yuzhong Deng
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China
| | - Jianxiong Tang
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China
| | - Jiyang Zhang
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China
| | - Jianxiao Zou
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China
- Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen 518110, Guangdong, China
| | - Que Zhu
- Department of Out-patient, The Second Affiliated Hospital of Chongqing Medical University, Chongqing 400010, China
| | - Shicai Fan
- School of Automation Engineering, University of Electronic Science and Technology of China, Chengdu 611731, Sichuan, China
- Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen 518110, Guangdong, China
| |
Collapse
|
11
|
Wichmann A, Buschong E, Müller A, Jünger D, Hildebrandt A, Hankeln T, Schmidt B. MetaTransformer: deep metagenomic sequencing read classification using self-attention models. NAR Genom Bioinform 2023; 5:lqad082. [PMID: 37705831 PMCID: PMC10495543 DOI: 10.1093/nargab/lqad082] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Revised: 07/14/2023] [Accepted: 08/30/2023] [Indexed: 09/15/2023] Open
Abstract
Deep learning has emerged as a paradigm that revolutionizes numerous domains of scientific research. Transformers have been utilized in language modeling outperforming previous approaches. Therefore, the utilization of deep learning as a tool for analyzing the genomic sequences is promising, yielding convincing results in fields such as motif identification and variant calling. DeepMicrobes, a machine learning-based classifier, has recently been introduced for taxonomic prediction at species and genus level. However, it relies on complex models based on bidirectional long short-term memory cells resulting in slow runtimes and excessive memory requirements, hampering its effective usability. We present MetaTransformer, a self-attention-based deep learning metagenomic analysis tool. Our transformer-encoder-based models enable efficient parallelization while outperforming DeepMicrobes in terms of species and genus classification abilities. Furthermore, we investigate approaches to reduce memory consumption and boost performance using different embedding schemes. As a result, we are able to achieve 2× to 5× speedup for inference compared to DeepMicrobes while keeping a significantly smaller memory footprint. MetaTransformer can be trained in 9 hours for genus and 16 hours for species prediction. Our results demonstrate performance improvements due to self-attention models and the impact of embedding schemes in deep learning on metagenomic sequencing data.
Collapse
Affiliation(s)
- Alexander Wichmann
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Etienne Buschong
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - André Müller
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Daniel Jünger
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Andreas Hildebrandt
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Thomas Hankeln
- Institute of Organic and Molecular Evolution (iomE), Johannes Gutenberg University, J.-J. Becher-Weg 30A, 55128 Mainz, Rhineland-Palatinate, Germany
| | - Bertil Schmidt
- Institute of Computer Science, Johannes Gutenberg University, Staudingerweg 9, 55128 Mainz, Rhineland-Palatinate, Germany
| |
Collapse
|
12
|
Sereshki S, Lee N, Omirou M, Fasoula D, Lonardi S. On the prediction of non-CG DNA methylation using machine learning. NAR Genom Bioinform 2023; 5:lqad045. [PMID: 37206627 PMCID: PMC10189801 DOI: 10.1093/nargab/lqad045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2023] [Revised: 04/06/2023] [Accepted: 05/05/2023] [Indexed: 05/21/2023] Open
Abstract
DNA methylation can be detected and measured using sequencing instruments after sodium bisulfite conversion, but experiments can be expensive for large eukaryotic genomes. Sequencing nonuniformity and mapping biases can leave parts of the genome with low or no coverage, thus hampering the ability of obtaining DNA methylation levels for all cytosines. To address these limitations, several computational methods have been proposed that can predict DNA methylation from the DNA sequence around the cytosine or from the methylation level of nearby cytosines. However, most of these methods are entirely focused on CG methylation in humans and other mammals. In this work, we study, for the first time, the problem of predicting cytosine methylation for CG, CHG and CHH contexts on six plant species, either from the DNA primary sequence around the cytosine or from the methylation levels of neighboring cytosines. In this framework, we also study the cross-species prediction problem and the cross-context prediction problem (within the same species). Finally, we show that providing gene and repeat annotations allows existing classifiers to significantly improve their prediction accuracy. We introduce a new classifier called AMPS (annotation-based methylation prediction from sequence) that takes advantage of genomic annotations to achieve higher accuracy.
Collapse
Affiliation(s)
- Saleh Sereshki
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - Nathan Lee
- Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA
| | - Michalis Omirou
- Department of Agrobiotechnology, Agricultural Microbiology Laboratory, Agricultural Research Institute, Nicosia 1516, Cyprus
| | - Dionysia Fasoula
- Department of Plant Breeding, Agricultural Research Institute, Nicosia 1516, Cyprus
| | - Stefano Lonardi
- To whom correspondence should be addressed. Tel: +1 951 827 2203; Fax: +1 951 827 4643;
| |
Collapse
|
13
|
Luo X, Wang Y, Zou Q, Xu L. Recall DNA methylation levels at low coverage sites using a CNN model in WGBS. PLoS Comput Biol 2023; 19:e1011205. [PMID: 37315069 DOI: 10.1371/journal.pcbi.1011205] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2022] [Accepted: 05/22/2023] [Indexed: 06/16/2023] Open
Abstract
DNA methylation is an important regulator of gene transcription. WGBS is the gold-standard approach for base-pair resolution quantitative of DNA methylation. It requires high sequencing depth. Many CpG sites with insufficient coverage in the WGBS data, resulting in inaccurate DNA methylation levels of individual sites. Many state-of-arts computation methods were proposed to predict the missing value. However, many methods required either other omics datasets or other cross-sample data. And most of them only predicted the state of DNA methylation. In this study, we proposed the RcWGBS, which can impute the missing (or low coverage) values from the DNA methylation levels on the adjacent sides. Deep learning techniques were employed for the accurate prediction. The WGBS datasets of H1-hESC and GM12878 were down-sampled. The average difference between the DNA methylation level at 12× depth predicted by RcWGBS and that at >50× depth in the H1-hESC and GM2878 cells are less than 0.03 and 0.01, respectively. RcWGBS performed better than METHimpute even though the sequencing depth was as low as 12×. Our work would help to process methylation data of low sequencing depth. It is beneficial for researchers to save sequencing costs and improve data utilization through computational methods.
Collapse
Affiliation(s)
- Ximei Luo
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Yansu Wang
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, Sichuan, China
- Yangtze Delta Region Institute (Quzhou), University of Electronic Science and Technology of China, Quzhou, Zhejiang, China
| | - Lei Xu
- School of Electronic and Communication Engineering, Shenzhen Polytechnic, Shenzhen, Guangdong, China
| |
Collapse
|
14
|
Dodlapati S, Jiang Z, Sun J. Completing Single-Cell DNA Methylome Profiles via Transfer Learning Together With KL-Divergence. Front Genet 2022; 13:910439. [PMID: 35938031 PMCID: PMC9353187 DOI: 10.3389/fgene.2022.910439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 05/25/2022] [Indexed: 11/13/2022] Open
Abstract
The high level of sparsity in methylome profiles obtained using whole-genome bisulfite sequencing in the case of low biological material amount limits its value in the study of systems in which large samples are difficult to assemble, such as mammalian preimplantation embryonic development. The recently developed computational methods for addressing the sparsity by imputing missing have their limits when the required minimum data coverage or profiles of the same tissue in other modalities are not available. In this study, we explored the use of transfer learning together with Kullback-Leibler (KL) divergence to train predictive models for completing methylome profiles with very low coverage (below 2%). Transfer learning was used to leverage less sparse profiles that are typically available for different tissues for the same species, while KL divergence was employed to maximize the usage of information carried in the input data. A deep neural network was adopted to extract both DNA sequence and local methylation patterns for imputation. Our study of training models for completing methylome profiles of bovine oocytes and early embryos demonstrates the effectiveness of transfer learning and KL divergence, with individual increase of 29.98 and 29.43%, respectively, in prediction performance and 38.70% increase when the two were used together. The drastically increased data coverage (43.80-73.6%) after imputation powers downstream analyses involving methylomes that cannot be effectively done using the very low coverage profiles (0.06-1.47%) before imputation.
Collapse
Affiliation(s)
- Sanjeeva Dodlapati
- Department of Computer Science, Old Dominion University, Norfolk, VA, United States
| | - Zongliang Jiang
- School of Animal Sciences, AgCenter, Louisiana State University, Baton Rouge, LA, United States
| | - Jiangwen Sun
- Department of Computer Science, Old Dominion University, Norfolk, VA, United States
| |
Collapse
|