1
|
Shinkai N, Asada K, Machino H, Takasawa K, Takahashi S, Kouno N, Komatsu M, Hamamoto R, Kaneko S. SEgene identifies links between super enhancers and gene expression across cell types. NPJ Syst Biol Appl 2025; 11:49. [PMID: 40389443 PMCID: PMC12089303 DOI: 10.1038/s41540-025-00533-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2025] [Accepted: 05/11/2025] [Indexed: 05/21/2025] Open
Abstract
Enhancers are non-coding DNA regions that facilitate gene transcription, with a specialized subset, super-enhancers, known to exert exceptionally strong transcriptional activation effects. Super-enhancers have been implicated in oncogenesis, and their identification is achievable through histone mark chromatin immunoprecipitation followed by sequencing data using existing analytical tools. However, conventional super-enhancer detection methodologies often do not accurately reflect actual gene expression levels, and the large volume of identified super-enhancers complicates comprehensive analysis. To address these limitations, we developed the super-enhancer to gene links (SE-to-gene Links) analysis, a platform named "SEgene" which incorporates the peak-to-gene links approach-a statistical method designed to reveal correlations between genes and peak regions ( https://github.com/hamamoto-lab/SEgene ). This platform enables a targeted evaluation of super-enhancer regions in relation to gene expression, facilitating the identification of super-enhancers that are functionally linked to transcriptional activity. Here, we demonstrate the application of SE-to-gene Links analysis to public datasets, confirming its efficacy in accurately detecting super-enhancers and identifying functionally associated genes. Additionally, SE-to-gene Links analysis identified ERBB2 as a significant gene of interest in the lung adenocarcinoma dataset from the National Cancer Center Japan cohort, suggesting a potential impact across multiple patient samples. Thus, the SE-to-gene Links analysis provides an analytical tool for evaluating super-enhancers as potential therapeutic targets, supporting the identification of clinically significant super-enhancer regions and their functionally associated genes.
Collapse
Affiliation(s)
- Norio Shinkai
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
- Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University, Tokyo, Japan
| | - Ken Asada
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
| | - Hidenori Machino
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
| | - Ken Takasawa
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
| | - Satoshi Takahashi
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
| | - Nobuji Kouno
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
| | - Masaaki Komatsu
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan
| | - Ryuji Hamamoto
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan.
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan.
- Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University, Tokyo, Japan.
| | - Syuzo Kaneko
- Division of Medical AI Research and Development, National Cancer Center Research Institute, Tokyo, Japan.
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, Tokyo, Japan.
| |
Collapse
|
2
|
Rana V, Peng J, Pan C, Lyu H, Cheng A, Kim M, Milenkovic O. Interpretable online network dictionary learning for inferring long-range chromatin interactions. PLoS Comput Biol 2024; 20:e1012095. [PMID: 38753877 PMCID: PMC11135774 DOI: 10.1371/journal.pcbi.1012095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 05/29/2024] [Accepted: 04/20/2024] [Indexed: 05/18/2024] Open
Abstract
Dictionary learning (DL), implemented via matrix factorization (MF), is commonly used in computational biology to tackle ubiquitous clustering problems. The method is favored due to its conceptual simplicity and relatively low computational complexity. However, DL algorithms produce results that lack interpretability in terms of real biological data. Additionally, they are not optimized for graph-structured data and hence often fail to handle them in a scalable manner. In order to address these limitations, we propose a novel DL algorithm called online convex network dictionary learning (online cvxNDL). Unlike classical DL algorithms, online cvxNDL is implemented via MF and designed to handle extremely large datasets by virtue of its online nature. Importantly, it enables the interpretation of dictionary elements, which serve as cluster representatives, through convex combinations of real measurements. Moreover, the algorithm can be applied to data with a network structure by incorporating specialized subnetwork sampling techniques. To demonstrate the utility of our approach, we apply cvxNDL on 3D-genome RNAPII ChIA-Drop data with the goal of identifying important long-range interaction patterns (long-range dictionary elements). ChIA-Drop probes higher-order interactions, and produces data in the form of hypergraphs whose nodes represent genomic fragments. The hyperedges represent observed physical contacts. Our hypergraph model analysis has the objective of creating an interpretable dictionary of long-range interaction patterns that accurately represent global chromatin physical contact maps. Through the use of dictionary information, one can also associate the contact maps with RNA transcripts and infer cellular functions. To accomplish the task at hand, we focus on RNAPII-enriched ChIA-Drop data from Drosophila Melanogaster S2 cell lines. Our results offer two key insights. First, we demonstrate that online cvxNDL retains the accuracy of classical DL (MF) methods while simultaneously ensuring unique interpretability and scalability. Second, we identify distinct collections of proximal and distal interaction patterns involving chromatin elements shared by related processes across different chromosomes, as well as patterns unique to specific chromosomes. To associate the dictionary elements with biological properties of the corresponding chromatin regions, we employ Gene Ontology (GO) enrichment analysis and perform multiple RNA coexpression studies.
Collapse
Affiliation(s)
- Vishal Rana
- Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, Illinois, United States of America
| | - Jianhao Peng
- Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, Illinois, United States of America
| | - Chao Pan
- Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, Illinois, United States of America
| | - Hanbaek Lyu
- Department of Mathematics, University of Wisconsin - Madison, Madison, Wisconsin, United States of America
| | - Albert Cheng
- School of Biological and Health Systems Engineering, Arizona State University, Phoenix, Arizona, United States of America
| | - Minji Kim
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Olgica Milenkovic
- Department of Electrical and Computer Engineering, University of Illinois, Urbana-Champaign, Illinois, United States of America
| |
Collapse
|
3
|
Jing K, Xu Y, Yang Y, Yin P, Ning D, Huang G, Deng Y, Chen G, Li G, Tian SZ, Zheng M. ScSmOP: a universal computational pipeline for single-cell single-molecule multiomics data analysis. Brief Bioinform 2023; 24:bbad343. [PMID: 37779245 DOI: 10.1093/bib/bbad343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2023] [Revised: 06/24/2023] [Accepted: 09/10/2023] [Indexed: 10/03/2023] Open
Abstract
Single-cell multiomics techniques have been widely applied to detect the key signature of cells. These methods have achieved a single-molecule resolution and can even reveal spatial localization. These emerging methods provide insights elucidating the features of genomic, epigenomic and transcriptomic heterogeneity in individual cells. However, they have given rise to new computational challenges in data processing. Here, we describe Single-cell Single-molecule multiple Omics Pipeline (ScSmOP), a universal pipeline for barcode-indexed single-cell single-molecule multiomics data analysis. Essentially, the C language is utilized in ScSmOP to set up spaced-seed hash table-based algorithms for barcode identification according to ligation-based barcoding data and synthesis-based barcoding data, followed by data mapping and deconvolution. We demonstrate high reproducibility of data processing between ScSmOP and published pipelines in comprehensive analyses of single-cell omics data (scRNA-seq, scATAC-seq, scARC-seq), single-molecule chromatin interaction data (ChIA-Drop, SPRITE, RD-SPRITE), single-cell single-molecule chromatin interaction data (scSPRITE) and spatial transcriptomic data from various cell types and species. Additionally, ScSmOP shows more rapid performance and is a versatile, efficient, easy-to-use and robust pipeline for single-cell single-molecule multiomics data analysis.
Collapse
Affiliation(s)
- Kai Jing
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Yewen Xu
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Yang Yang
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Pengfei Yin
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Duo Ning
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Guangyu Huang
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Yuqing Deng
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Gengzhan Chen
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Guoliang Li
- National Key Laboratory of Crop Genetic Improvement, Hubei Hongshan Laboratory, Huazhong Agricultural University, Wuhan 430070, China
- Agricultural Bioinformatics Key Laboratory of Hubei Province, Hubei Engineering Technology Research Center of Agricultural Big Data, 3D Genomics Research Center, College of Informatics, Huazhong Agricultural University, Wuhan 430070, China
| | - Simon Zhongyuan Tian
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| | - Meizhen Zheng
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
- Department of Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen 518055, China
| |
Collapse
|
4
|
Hamamoto R, Takasawa K, Shinkai N, Machino H, Kouno N, Asada K, Komatsu M, Kaneko S. Analysis of super-enhancer using machine learning and its application to medical biology. Brief Bioinform 2023; 24:bbad107. [PMID: 36960780 PMCID: PMC10199775 DOI: 10.1093/bib/bbad107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Revised: 02/11/2023] [Accepted: 03/01/2023] [Indexed: 03/25/2023] Open
Abstract
The analysis of super-enhancers (SEs) has recently attracted attention in elucidating the molecular mechanisms of cancer and other diseases. SEs are genomic structures that strongly induce gene expression and have been reported to contribute to the overexpression of oncogenes. Because the analysis of SEs and integrated analysis with other data are performed using large amounts of genome-wide data, artificial intelligence technology, with machine learning at its core, has recently begun to be utilized. In promoting precision medicine, it is important to consider information from SEs in addition to genomic data; therefore, machine learning technology is expected to be introduced appropriately in terms of building a robust analysis platform with a high generalization performance. In this review, we explain the history and principles of SE, and the results of SE analysis using state-of-the-art machine learning and integrated analysis with other data are presented to provide a comprehensive understanding of the current status of SE analysis in the field of medical biology. Additionally, we compared the accuracy between existing machine learning methods on the benchmark dataset and attempted to explore the kind of data preprocessing and integration work needed to make the existing algorithms work on the benchmark dataset. Furthermore, we discuss the issues and future directions of current SE analysis.
Collapse
Affiliation(s)
- Ryuji Hamamoto
- Division Chief in the Division of Medical AI Research and Development, National Cancer Center Research Institute; a Professor in the Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University and a Team Leader of the Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project
| | - Ken Takasawa
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project and an External Research Staff in the Medical AI Research and Development, National Cancer Center Research Institute
| | - Norio Shinkai
- Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University
| | - Hidenori Machino
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project and an External Research Staff in the Medical AI Research and Development, National Cancer Center Research Institute
| | - Nobuji Kouno
- Department of Surgery, Graduate School of Medicine, Kyoto University
| | - Ken Asada
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project and an External Research Staff of Medical AI Research and Development, National Cancer Center Research Institute
| | - Masaaki Komatsu
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project and an External Research Staff of Medical AI Research and Development, National Cancer Center Research Institute
| | - Syuzo Kaneko
- Division of Medical AI Research and Development, National Cancer Center Research Institute and a Visiting Scientist in the Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project
| |
Collapse
|
5
|
Tian SZ, Yin P, Jing K, Yang Y, Xu Y, Huang G, Ning D, Fullwood MJ, Zheng M. MCI-frcnn: A deep learning method for topological micro-domain boundary detection. Front Cell Dev Biol 2022; 10:1050769. [PMID: 36531953 PMCID: PMC9749004 DOI: 10.3389/fcell.2022.1050769] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Accepted: 11/07/2022] [Indexed: 11/22/2024] Open
Abstract
Chromatin structural domains, or topologically associated domains (TADs), are a general organizing principle in chromatin biology. RNA polymerase II (RNAPII) mediates multiple chromatin interactive loops, tethering together as RNAPII-associated chromatin interaction domains (RAIDs) to offer a framework for gene regulation. RAID and TAD alterations have been found to be associated with diseases. They can be further dissected as micro-domains (micro-TADs and micro-RAIDs) by clustering single-molecule chromatin-interactive complexes from next-generation three-dimensional (3D) genome techniques, such as ChIA-Drop. Currently, there are few tools available for micro-domain boundary identification. In this work, we developed the MCI-frcnn deep learning method to train a Faster Region-based Convolutional Neural Network (Faster R-CNN) for micro-domain boundary detection. At the training phase in MCI-frcnn, 50 images of RAIDs from Drosophila RNAPII ChIA-Drop data, containing 261 micro-RAIDs with ground truth boundaries, were trained for 7 days. Using this well-trained MCI-frcnn, we detected micro-RAID boundaries for the input new images, with a fast speed (5.26 fps), high recognition accuracy (AUROC = 0.85, mAP = 0.69), and high boundary region quantification (genomic IoU = 76%). We further applied MCI-frcnn to detect human micro-TADs boundaries using human GM12878 SPRITE data and obtained a high region quantification score (mean gIoU = 85%). In all, the MCI-frcnn deep learning method which we developed in this work is a general tool for micro-domain boundary detection.
Collapse
Affiliation(s)
- Simon Zhongyuan Tian
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Pengfei Yin
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Kai Jing
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Yang Yang
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Yewen Xu
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Guangyu Huang
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Duo Ning
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| | - Melissa J. Fullwood
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
- Cancer Science Institute of Singapore, National University of Singapore, Singapore, Singapore
- Institute of Molecular and Cell Biology, Agency for Science, Technology and Research (A*STAR), Singapore, Singapore
| | - Meizhen Zheng
- Shenzhen Key Laboratory of Gene Regulation and Systems Biology, School of Life Sciences, Southern University of Science and Technology, Shenzhen, China
| |
Collapse
|
6
|
Wen N, Liu G, Zhang J, Zhang R, Fu Y, Han X. A fingerprints based molecular property prediction method using the BERT model. J Cheminform 2022; 14:71. [PMID: 36271394 PMCID: PMC9585730 DOI: 10.1186/s13321-022-00650-3] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2022] [Accepted: 10/09/2022] [Indexed: 11/10/2022] Open
Abstract
Molecular property prediction (MPP) is vital in drug discovery and drug reposition. Deep learning-based MPP models capture molecular property-related features from various molecule representations. In this paper, we propose a molecule sequence embedding and prediction model facing with MPP task. We pre-trained a bi-directional encoder representations from Transformers (BERT) encoder to obtain the semantic representation of compound fingerprints, called Fingerprints-BERT (FP-BERT), in a self-supervised learning manner. Then, the encoded molecular representation by the FP-BERT is input to the convolutional neural network (CNN) to extract higher-level abstract features, and the predicted properties of the molecule are finally obtained through fully connected layer for distinct classification or regression MPP tasks. Comparison with the baselines shows that the proposed model achieves high prediction performance on all of the classification tasks and regression tasks.
Collapse
Affiliation(s)
- Naifeng Wen
- School of Mechanical and Electronic Engineering, Dalian Minzu University, Dalian, China
| | - Guanqun Liu
- School of Mechanical and Electronic Engineering, Dalian Minzu University, Dalian, China
| | - Jie Zhang
- Beijing Huawei Digital Technologies Co., Ltd, Beijing, China
| | - Rubo Zhang
- School of Mechanical and Electronic Engineering, Dalian Minzu University, Dalian, China
| | - Yating Fu
- School of Mechanical and Electronic Engineering, Dalian Minzu University, Dalian, China
| | - Xu Han
- School of Mechanical and Electronic Engineering, Dalian Minzu University, Dalian, China
| |
Collapse
|