1
|
Liu Y, Pei W, Chen L, Xia Y, Yan H, Hu X. scCorrect: Cross-modality label transfer from scRNA-seq to scATAC-seq using domain adaptation. Anal Biochem 2025; 702:115847. [PMID: 40154828 DOI: 10.1016/j.ab.2025.115847] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Revised: 03/10/2025] [Accepted: 03/15/2025] [Indexed: 04/01/2025]
Abstract
Cell type annotation in single-cell chromatin accessibility sequencing (scATAC-seq) is crucial for enabling researchers to identify subpopulations of cells associated with specific diseases, elucidate gene regulatory networks, and discover markers indicative of disease states. The prevailing approach for cell type annotation in single-cell research involves transferring well-delineated cell types from single-cell RNA sequencing (scRNA-seq) data to scATAC-seq data using a label propagation algorithm. However, the inherent modal discrepancies (i.e.biological interpretation) between scRNA-seq and scATAC-seq data, coupled with the intrinsic sparsity and high dimensionality of scATAC-seq data, pose significant challenges to the efficacy of this strategy. To address these challenges, we introduce a novel neural network framework, scCorrect, which operates in two distinct phases. In the first phase, scCorrect aligns the scRNA-seq and scATAC-seq datasets, generating initial annotation results. The second phase involves training a corrective network specifically designed to amend any erroneous annotations produced during the first phase. Empirical tests across multiple datasets have demonstrated that scCorrect consistently achieves superior recognition accuracy, underscoring its significant potential to enhance disease-related research in humans.
Collapse
Affiliation(s)
- Yan Liu
- Department of Computer Science, Yangzhou University, Yangzhou, 225100, PR China.
| | - Wenyi Pei
- Geriatric Department, Shanghai Baoshan District Wusong Central Hospital, Tongtai North Road 101, Shanghai, 200940, PR China
| | - Li Chen
- Department of Computer Science, Yangzhou University, Yangzhou, 225100, PR China
| | - Yu Xia
- Department of Computer Science, Yangzhou University, Yangzhou, 225100, PR China
| | - He Yan
- College of Information Science and Technology, Nanjing Forestry University, Nanjing, 210037, PR China
| | - Xiaohua Hu
- Geriatric Department, Shanghai Baoshan District Wusong Central Hospital, Tongtai North Road 101, Shanghai, 200940, PR China; Digital Innovation Laboratory, The First Affiliated Hospital of Naval Medical University, Changhai Road 168, Shanghai, 200433, PR China.
| |
Collapse
|
2
|
Yates J, Van Allen EM. New horizons at the interface of artificial intelligence and translational cancer research. Cancer Cell 2025; 43:708-727. [PMID: 40233719 PMCID: PMC12007700 DOI: 10.1016/j.ccell.2025.03.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/24/2025] [Revised: 03/04/2025] [Accepted: 03/12/2025] [Indexed: 04/17/2025]
Abstract
Artificial intelligence (AI) is increasingly being utilized in cancer research as a computational strategy for analyzing multiomics datasets. Advances in single-cell and spatial profiling technologies have contributed significantly to our understanding of tumor biology, and AI methodologies are now being applied to accelerate translational efforts, including target discovery, biomarker identification, patient stratification, and therapeutic response prediction. Despite these advancements, the integration of AI into clinical workflows remains limited, presenting both challenges and opportunities. This review discusses AI applications in multiomics analysis and translational oncology, emphasizing their role in advancing biological discoveries and informing clinical decision-making. Key areas of focus include cellular heterogeneity, tumor microenvironment interactions, and AI-aided diagnostics. Challenges such as reproducibility, interpretability of AI models, and clinical integration are explored, with attention to strategies for addressing these hurdles. Together, these developments underscore the potential of AI and multiomics to enhance precision oncology and contribute to advancements in cancer care.
Collapse
Affiliation(s)
- Josephine Yates
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA; Institute for Machine Learning, Department of Computer Science, ETH Zürich, Zurich, Switzerland; ETH AI Center, ETH Zurich, Zurich, Switzerland; Swiss Institute for Bioinformatics (SIB), Lausanne, Switzerland
| | - Eliezer M Van Allen
- Department of Medical Oncology, Dana-Farber Cancer Institute, Boston, MA, USA; Cancer Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA; Division of Medical Sciences, Harvard University, Boston, MA, USA; Parker Institute for Cancer Immunotherapy, Dana-Farber Cancer Institute, Boston, MA, USA.
| |
Collapse
|
3
|
Zhao T, Gu Y, Yang J, Usuyama N, Lee HH, Kiblawi S, Naumann T, Gao J, Crabtree A, Abel J, Moung-Wen C, Piening B, Bifulco C, Wei M, Poon H, Wang S. A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities. Nat Methods 2025; 22:166-176. [PMID: 39558098 DOI: 10.1038/s41592-024-02499-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2024] [Accepted: 10/02/2024] [Indexed: 11/20/2024]
Abstract
Biomedical image analysis is fundamental for biomedical discovery. Holistic image analysis comprises interdependent subtasks such as segmentation, detection and recognition, which are tackled separately by traditional approaches. Here, we propose BiomedParse, a biomedical foundation model that can jointly conduct segmentation, detection and recognition across nine imaging modalities. This joint learning improves the accuracy for individual tasks and enables new applications such as segmenting all relevant objects in an image through a textual description. To train BiomedParse, we created a large dataset comprising over 6 million triples of image, segmentation mask and textual description by leveraging natural language labels or descriptions accompanying existing datasets. We showed that BiomedParse outperformed existing methods on image segmentation across nine imaging modalities, with larger improvement on objects with irregular shapes. We further showed that BiomedParse can simultaneously segment and label all objects in an image. In summary, BiomedParse is an all-in-one tool for biomedical image analysis on all major image modalities, paving the path for efficient and accurate image-based biomedical discovery.
Collapse
Affiliation(s)
| | - Yu Gu
- Microsoft Research, Redmond, WA, USA
| | | | | | | | | | | | | | - Angela Crabtree
- Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA
| | | | | | - Brian Piening
- Providence Genomics, Portland, OR, USA
- Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA
| | - Carlo Bifulco
- Providence Genomics, Portland, OR, USA
- Earle A. Chiles Research Institute, Providence Cancer Institute, Portland, OR, USA
| | - Mu Wei
- Microsoft Research, Redmond, WA, USA.
| | | | - Sheng Wang
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA.
- Department of Surgery, University of Washington, Seattle, WA, USA.
| |
Collapse
|
4
|
Luo S, Zhu M, Lin L, Xie J, Lin S, Chen Y, Zhu J, Huang J. DECA: harnessing interpretable transformer model for cellular deconvolution of chromatin accessibility profile. Brief Bioinform 2024; 26:bbaf069. [PMID: 39987573 PMCID: PMC11847511 DOI: 10.1093/bib/bbaf069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2024] [Revised: 01/09/2025] [Accepted: 02/06/2025] [Indexed: 02/25/2025] Open
Abstract
The assay for transposase-accessible chromatin with sequencing (ATAC-seq) identifies chromatin accessibility across the genome, crucial for gene expression regulating. However, bulk ATAC-seq obscures cellular heterogeneity, while single-cell ATAC-seq suffers from issues such as sparsity and costliness. To this end, we introduce DECA, a sophisticated deep learning model based on vision transformer to deconvolve cell type information from bulk chromatin accessibility profiles, utilizing single-cell ATAC-seq datasets as reference for enhanced precision and resolution. Notably, patch attention generated by DECA's multi-head attention mechanism aligns with chromatin interactions detected by Hi-C. Additionally, DECA predicted lineage-specific cell composition changes due to genetic perturbation. The chromatin accessibility signatures predicted by DECA are enriched with cell-type specific genetic variations. Ultimately, we applied DECA on pan-cancer ATAC-seq datasets and demonstrated its capability to deconvolve cell type proportions with clinical significance. Taken together, DECA deconvolves cellular proportions and predicts their chromatin accessibility profiles from bulk chromatin accessibility data, which enable exploring the gene regulatory programs in development and diseases.
Collapse
Affiliation(s)
- Shijie Luo
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, No. 4221, Xiang'an South Road, Xiamen, Fujian 361102, China
- National Institute for Data Science in Health and Medicine, Xiamen University, No. 4221, Xiang'an South Road, Xiamen, Fujian 361102, China
| | - Ming Zhu
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, No. 4221, Xiang'an South Road, Xiamen, Fujian 361102, China
| | - Liquan Lin
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, No. 4221, Xiang'an South Road, Xiamen, Fujian 361102, China
| | - Jiajing Xie
- National Institute for Data Science in Health and Medicine, Xiamen University, No. 4221, Xiang'an South Road, Xiamen, Fujian 361102, China
| | - Shihao Lin
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, No. 4221, Xiang'an South Road, Xiamen, Fujian 361102, China
| | - Ying Chen
- School of Informatics, Xiamen University, No. 4221, Xiang'an South Road, Fujian 361000, China
| | - Jiali Zhu
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, No. 4221, Xiang'an South Road, Xiamen, Fujian 361102, China
| | - Jialiang Huang
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, No. 4221, Xiang'an South Road, Xiamen, Fujian 361102, China
- National Institute for Data Science in Health and Medicine, Xiamen University, No. 4221, Xiang'an South Road, Xiamen, Fujian 361102, China
| |
Collapse
|
5
|
Tang Z, Chen G, Chen S, He H, You L, Chen CYC. Knowledge-based inductive bias and domain adaptation for cell type annotation. Commun Biol 2024; 7:1440. [PMID: 39501016 PMCID: PMC11538527 DOI: 10.1038/s42003-024-07171-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 10/30/2024] [Indexed: 11/08/2024] Open
Abstract
Measurement techniques often result in domain gaps among batches of cellular data from a specific modality. The effectiveness of cross-batch annotation methods is influenced by inductive bias, which refers to a set of assumptions that describe the behavior of model predictions. Different annotation methods possess distinct inductive biases, leading to varying degrees of generalizability and interpretability. Given that certain cell types exhibit unique functional patterns, we hypothesize that the inductive biases of cell annotation methods should align with these biological patterns to produce meaningful predictions. In this study, we propose KIDA, Knowledge-based Inductive bias and Domain Adaptation. The knowledge-based inductive bias constrains the prediction rules learned from the reference dataset, composed of multiple batches, to functional patterns relevant to biology, thereby enhancing the generalization of the model to unseen batches. Since the query dataset also contains gaps from multiple batches, KIDA's domain adaptation employs pseudo labels for self-knowledge distillation, effectively narrowing the distribution gap between model predictions and the query dataset. Benchmark experiments demonstrate that KIDA is capable of achieving accurate cross-batch cell type annotation.
Collapse
Affiliation(s)
- Zhenchao Tang
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, China
| | - Guanxing Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, China
| | - Shouzhi Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, China
| | - Haohuai He
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
- Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Linlin You
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China.
| | - Calvin Yu-Chian Chen
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, China.
- State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Genomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen, China.
- Department of Medical Research, China Medical University Hospital, Taichung, Taiwan.
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan.
- Guangdong L-Med Biotechnology Co., Ltd., Meizhou, China.
| |
Collapse
|
6
|
Altay A, Vingron M. scATAcat: cell-type annotation for scATAC-seq data. NAR Genom Bioinform 2024; 6:lqae135. [PMID: 39380946 PMCID: PMC11459382 DOI: 10.1093/nargab/lqae135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2024] [Revised: 09/11/2024] [Accepted: 09/23/2024] [Indexed: 10/10/2024] Open
Abstract
Cells whose accessibility landscape has been profiled with scATAC-seq cannot readily be annotated to a particular cell type. In fact, annotating cell-types in scATAC-seq data is a challenging task since, unlike in scRNA-seq data, we lack knowledge of 'marker regions' which could be used for cell-type annotation. Current annotation methods typically translate accessibility to expression space and rely on gene expression patterns. We propose a novel approach, scATAcat, that leverages characterized bulk ATAC-seq data as prototypes to annotate scATAC-seq data. To mitigate the inherent sparsity of single-cell data, we aggregate cells that belong to the same cluster and create pseudobulk. To demonstrate the feasibility of our approach we collected a number of datasets with respective annotations to quantify the results and evaluate performance for scATAcat. scATAcat is available as a python package at https://github.com/aybugealtay/scATAcat.
Collapse
Affiliation(s)
- Aybuge Altay
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, 14195 Berlin, Germany
| | - Martin Vingron
- Department of Computational Molecular Biology, Max Planck Institute for Molecular Genetics, Ihnestraße 63-73, 14195 Berlin, Germany
| |
Collapse
|
7
|
LeRoy N, Smith J, Zheng G, Rymuza J, Gharavi E, Brown D, Zhang A, Sheffield N. Fast clustering and cell-type annotation of scATAC data using pre-trained embeddings. NAR Genom Bioinform 2024; 6:lqae073. [PMID: 38974799 PMCID: PMC11224678 DOI: 10.1093/nargab/lqae073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2023] [Revised: 04/29/2024] [Accepted: 06/20/2024] [Indexed: 07/09/2024] Open
Abstract
Data from the single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) are now widely available. One major computational challenge is dealing with high dimensionality and inherent sparsity, which is typically addressed by producing lower dimensional representations of single cells for downstream clustering tasks. Current approaches produce such individual cell embeddings directly through a one-step learning process. Here, we propose an alternative approach by building embedding models pre-trained on reference data. We argue that this provides a more flexible analysis workflow that also has computational performance advantages through transfer learning. We implemented our approach in scEmbed, an unsupervised machine-learning framework that learns low-dimensional embeddings of genomic regulatory regions to represent and analyze scATAC-seq data. scEmbed performs well in terms of clustering ability and has the key advantage of learning patterns of region co-occurrence that can be transferred to other, unseen datasets. Moreover, models pre-trained on reference data can be exploited to build fast and accurate cell-type annotation systems without the need for other data modalities. scEmbed is implemented in Python and it is available to download from GitHub. We also make our pre-trained models available on huggingface for public use. scEmbed is open source and available at https://github.com/databio/geniml. Pre-trained models from this work can be obtained on huggingface: https://huggingface.co/databio.
Collapse
Affiliation(s)
- Nathan J LeRoy
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
| | - Jason P Smith
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Guangtao Zheng
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Julia Rymuza
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| | - Erfaneh Gharavi
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Donald E Brown
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Systems and Information Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Aidong Zhang
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
| | - Nathan C Sheffield
- Center for Public Health Genomics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Biomedical Engineering, School of Medicine, University of Virginia, Charlottesville, VA 22904, USA
- Department of Biochemistry and Molecular Genetics, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Child Health Research Center, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
- Department of Computer Science, School of Engineering, University of Virginia, Charlottesville, VA 22908, USA
- School of Data Science, University of Virginia, Charlottesville, VA 22904, USA
- Department of Public Health Sciences, School of Medicine, University of Virginia, Charlottesville, VA 22908, USA
| |
Collapse
|
8
|
Lin Y, Pan Z, Zeng Y, Yang Y, Dai Z. Detecting novel cell type in single-cell chromatin accessibility data via open-set domain adaptation. Brief Bioinform 2024; 25:bbae370. [PMID: 39073828 DOI: 10.1093/bib/bbae370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 06/27/2024] [Accepted: 07/15/2024] [Indexed: 07/30/2024] Open
Abstract
Recent advances in single-cell technologies enable the rapid growth of multi-omics data. Cell type annotation is one common task in analyzing single-cell data. It is a challenge that some cell types in the testing set are not present in the training set (i.e. unknown cell types). Most scATAC-seq cell type annotation methods generally assign each cell in the testing set to one known type in the training set but neglect unknown cell types. Here, we present OVAAnno, an automatic cell types annotation method which utilizes open-set domain adaptation to detect unknown cell types in scATAC-seq data. Comprehensive experiments show that OVAAnno successfully identifies known and unknown cell types. Further experiments demonstrate that OVAAnno also performs well on scRNA-seq data. Our codes are available online at https://github.com/lisaber/OVAAnno/tree/master.
Collapse
Affiliation(s)
- Yuefan Lin
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
| | - Zixiang Pan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
| | - Yuansong Zeng
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
| | - Zhiming Dai
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
| |
Collapse
|
9
|
Cao Y, Zhao X, Tang S, Jiang Q, Li S, Li S, Chen S. scButterfly: a versatile single-cell cross-modality translation method via dual-aligned variational autoencoders. Nat Commun 2024; 15:2973. [PMID: 38582890 PMCID: PMC10998864 DOI: 10.1038/s41467-024-47418-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2023] [Accepted: 03/28/2024] [Indexed: 04/08/2024] Open
Abstract
Recent advancements for simultaneously profiling multi-omics modalities within individual cells have enabled the interrogation of cellular heterogeneity and molecular hierarchy. However, technical limitations lead to highly noisy multi-modal data and substantial costs. Although computational methods have been proposed to translate single-cell data across modalities, broad applications of the methods still remain impeded by formidable challenges. Here, we propose scButterfly, a versatile single-cell cross-modality translation method based on dual-aligned variational autoencoders and data augmentation schemes. With comprehensive experiments on multiple datasets, we provide compelling evidence of scButterfly's superiority over baseline methods in preserving cellular heterogeneity while translating datasets of various contexts and in revealing cell type-specific biological insights. Besides, we demonstrate the extensive applications of scButterfly for integrative multi-omics analysis of single-modality data, data enhancement of poor-quality single-cell multi-omics, and automatic cell type annotation of scATAC-seq data. Moreover, scButterfly can be generalized to unpaired data training, perturbation-response analysis, and consecutive translation.
Collapse
Affiliation(s)
- Yichuan Cao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China
| | - Xiamiao Zhao
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China
| | - Songming Tang
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China
| | - Qun Jiang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division of BNRIST, Department of Automation, Tsinghua University, 100084, Beijing, China
| | - Sijie Li
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China
| | - Siyu Li
- School of Statistics and Data Science, Nankai University, Tianjin, 300071, China
| | - Shengquan Chen
- School of Mathematical Sciences and LPMC, Nankai University, Tianjin, 300071, China.
| |
Collapse
|
10
|
Zeng Y, Luo M, Shangguan N, Shi P, Feng J, Xu J, Chen K, Lu Y, Yu W, Yang Y. Deciphering cell types by integrating scATAC-seq data with genome sequences. NATURE COMPUTATIONAL SCIENCE 2024; 4:285-298. [PMID: 38600256 DOI: 10.1038/s43588-024-00622-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 03/18/2024] [Indexed: 04/12/2024]
Abstract
The single-cell assay for transposase-accessible chromatin using sequencing (scATAC-seq) technology provides insight into gene regulation and epigenetic heterogeneity at single-cell resolution, but cell annotation from scATAC-seq remains challenging due to high dimensionality and extreme sparsity within the data. Existing cell annotation methods mostly focus on the cell peak matrix without fully utilizing the underlying genomic sequence. Here we propose a method, SANGO, for accurate single-cell annotation by integrating genome sequences around the accessibility peaks within scATAC data. The genome sequences of peaks are encoded into low-dimensional embeddings, and then iteratively used to reconstruct the peak statistics of cells through a fully connected network. The learned weights are considered as regulatory modes to represent cells, and utilized to align the query cells and the annotated cells in the reference data through a graph transformer network for cell annotations. SANGO was demonstrated to consistently outperform competing methods on 55 paired scATAC-seq datasets across samples, platforms and tissues. SANGO was also shown to be able to detect unknown tumor cells through attention edge weights learned by the graph transformer. Moreover, from the annotated cells, we found cell-type-specific peaks that provide functional insights/biological signals through expression enrichment analysis, cis-regulatory chromatin interaction analysis and motif enrichment analysis.
Collapse
Affiliation(s)
- Yuansong Zeng
- School of Big Data and Software Engineering, Chongqing University, Chongqing, China
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Mai Luo
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Ningyuan Shangguan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Peiyu Shi
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Junxi Feng
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Jin Xu
- State Key Laboratory of Biocontrol, School of Life Sciences, Sun Yat-sen University, Guangzhou, China
| | - Ken Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Yutong Lu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Weijiang Yu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, China.
- Key Laboratory of Machine Intelligence and Advanced Computing (MOE), Guangzhou, China.
| |
Collapse
|
11
|
Lu C, Wei Y, Abbas M, Agula H, Wang E, Meng Z, Zhang R. Application of Single-Cell Assay for Transposase-Accessible Chromatin with High Throughput Sequencing in Plant Science: Advances, Technical Challenges, and Prospects. Int J Mol Sci 2024; 25:1479. [PMID: 38338756 PMCID: PMC10855595 DOI: 10.3390/ijms25031479] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2023] [Revised: 01/16/2024] [Accepted: 01/23/2024] [Indexed: 02/12/2024] Open
Abstract
The Single-cell Assay for Transposase-Accessible Chromatin with high throughput sequencing (scATAC-seq) has gained increasing popularity in recent years, allowing for chromatin accessibility to be deciphered and gene regulatory networks (GRNs) to be inferred at single-cell resolution. This cutting-edge technology now enables the genome-wide profiling of chromatin accessibility at the cellular level and the capturing of cell-type-specific cis-regulatory elements (CREs) that are masked by cellular heterogeneity in bulk assays. Additionally, it can also facilitate the identification of rare and new cell types based on differences in chromatin accessibility and the charting of cellular developmental trajectories within lineage-related cell clusters. Due to technical challenges and limitations, the data generated from scATAC-seq exhibit unique features, often characterized by high sparsity and noise, even within the same cell type. To address these challenges, various bioinformatic tools have been developed. Furthermore, the application of scATAC-seq in plant science is still in its infancy, with most research focusing on root tissues and model plant species. In this review, we provide an overview of recent progress in scATAC-seq and its application across various fields. We first conduct scATAC-seq in plant science. Next, we highlight the current challenges of scATAC-seq in plant science and major strategies for cell type annotation. Finally, we outline several future directions to exploit scATAC-seq technologies to address critical challenges in plant science, ranging from plant ENCODE(The Encyclopedia of DNA Elements) project construction to GRN inference, to deepen our understanding of the roles of CREs in plant biology.
Collapse
Affiliation(s)
- Chao Lu
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (C.L.); (Y.W.)
- Key Laboratory of Herbage & Endemic Crop Biology, Ministry of Education, School of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Yunxiao Wei
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (C.L.); (Y.W.)
| | - Mubashir Abbas
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (C.L.); (Y.W.)
| | - Hasi Agula
- Key Laboratory of Herbage & Endemic Crop Biology, Ministry of Education, School of Life Sciences, Inner Mongolia University, Hohhot 010070, China
| | - Edwin Wang
- Cumming School of Medicine, University of Calgary, Calgary, AB T2N 4N1, Canada
| | - Zhigang Meng
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (C.L.); (Y.W.)
| | - Rui Zhang
- Biotechnology Research Institute, Chinese Academy of Agricultural Sciences, Beijing 100081, China; (C.L.); (Y.W.)
| |
Collapse
|
12
|
Jiang Y, Hu Z, Lynch AW, Jiang J, Zhu A, Zhang Y, Xie Y, Li R, Zhou N, Meyer CA, Cejas P, Brown M, Long HW, Qiu X. scATAnno: Automated Cell Type Annotation for single-cell ATAC Sequencing Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.06.01.543296. [PMID: 37333088 PMCID: PMC10274707 DOI: 10.1101/2023.06.01.543296] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/20/2023]
Abstract
The recent advances in single-cell epigenomic techniques have created a growing demand for scATAC-seq analysis. One key task is to determine cell types based on epigenetic profiling. We introduce scATAnno, a workflow designed to automatically annotate scATAC-seq data using large-scale scATAC-seq reference atlases. This workflow can generate scATAC-seq reference atlases from publicly available datasets, and enable accurate cell type annotation by integrating query data with reference atlases, without the aid of scRNA-seq profiling. To enhance annotation accuracy, we have incorporated KNN-based and weighted distance-based uncertainty scores to effectively detect unknown cell populations within the query data. We showcase the utility of scATAnno across multiple datasets, including peripheral blood mononuclear cell (PBMC), basal cell carcinoma (BCC) and Triple Negative Breast Cancer (TNBC), and demonstrate that scATAnno accurately annotates cell types across conditions. Overall, scATAnno is a powerful tool for cell type annotation in scATAC-seq data and can aid in the interpretation of new scATAC-seq datasets in complex biological systems.
Collapse
|
13
|
Qian FC, Zhou LW, Zhu YB, Li YY, Yu ZM, Feng CC, Fang QL, Zhao Y, Cai FH, Wang QY, Tang HF, Li CQ. scATAC-Ref: a reference of scATAC-seq with known cell labels in multiple species. Nucleic Acids Res 2024; 52:D285-D292. [PMID: 37897340 PMCID: PMC10767920 DOI: 10.1093/nar/gkad924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 09/14/2023] [Accepted: 10/16/2023] [Indexed: 10/30/2023] Open
Abstract
Chromatin accessibility profiles at single cell resolution can reveal cell type-specific regulatory programs, help dissect highly specialized cell functions and trace cell origin and evolution. Accurate cell type assignment is critical for effectively gaining biological and pathological insights, but is difficult in scATAC-seq. Hence, by extensively reviewing the literature, we designed scATAC-Ref (https://bio.liclab.net/scATAC-Ref/), a manually curated scATAC-seq database aimed at providing a comprehensive, high-quality source of chromatin accessibility profiles with known cell labels across broad cell types. Currently, scATAC-Ref comprises 1 694 372 cells with known cell labels, across various biological conditions, >400 cell/tissue types and five species. We used uniform system environment and software parameters to perform comprehensive downstream analysis on these chromatin accessibility profiles with known labels, including gene activity score, TF enrichment score, differential chromatin accessibility regions, pathway/GO term enrichment analysis and co-accessibility interactions. The scATAC-Ref also provided a user-friendly interface to query, browse and visualize cell types of interest, thereby providing a valuable resource for exploring epigenetic regulation in different tissues and cell types.
Collapse
Affiliation(s)
- Feng-Cui Qian
- The First Affiliated Hospital & Hunan Provincial Key Laboratory of Multi-omics And Artificial Intelligence of Cardiovascular Diseases, Hengyang Medical School, University of South China, Hengyang, Hunan, 421001, China
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences & MOE Key Lab of Rare Pediatric Diseases, Hengyang Medical School, University of South China, Hengyang, Hunan, 421001, China
- The First Affiliated Hospital, Cardiovascular Lab of Big Data and Imaging Artificial Intelligence, Hengyang Medical School, University of South China, Hengyang, Hunan, 421001, China
| | - Li-Wei Zhou
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Yan-Bing Zhu
- Beijing Clinical Research Institute, Beijing Friendship Hospital, Capital Medical University, Beijing, China
| | - Yan-Yu Li
- School of Medical Informatics, Daqing Campus, Harbin Medical University, Daqing, 163319, China
| | - Zheng-Min Yu
- School of Computer, University of South China, Hengyang, Hunan, 421001, China
| | - Chen-Chen Feng
- School of Medical Informatics, Daqing Campus, Harbin Medical University, Daqing, 163319, China
| | - Qiao-Li Fang
- School of Computer, University of South China, Hengyang, Hunan, 421001, China
| | - Yu Zhao
- School of Computer, University of South China, Hengyang, Hunan, 421001, China
| | - Fu-Hong Cai
- School of Computer, University of South China, Hengyang, Hunan, 421001, China
| | - Qiu-Yu Wang
- The First Affiliated Hospital, Cardiovascular Lab of Big Data and Imaging Artificial Intelligence, Hengyang Medical School, University of South China, Hengyang, Hunan, 421001, China
| | - Hui-Fang Tang
- The First Affiliated Hospital & Hunan Provincial Key Laboratory of Multi-omics And Artificial Intelligence of Cardiovascular Diseases, Hengyang Medical School, University of South China, Hengyang, Hunan, 421001, China
- The First Affiliated Hospital, Institute of Cardiovascular Disease, Hengyang Medical School, University of South China, Hengyang, Hunan, 421001, China
- The First Affiliated Hospital, Department of Cardiology, Hengyang Medical School, University of South China, Hengyang, China
| | - Chun-Quan Li
- The First Affiliated Hospital & Hunan Provincial Key Laboratory of Multi-omics And Artificial Intelligence of Cardiovascular Diseases, Hengyang Medical School, University of South China, Hengyang, Hunan, 421001, China
- Hunan Provincial Maternal and Child Health Care Hospital, National Health Commission Key Laboratory of Birth Defect Research and Prevention, Hengyang Medical School, University of South China, Hengyang, Hunan, 421001, China
- Department of Biochemistry and Molecular Biology, School of Basic Medical Sciences & MOE Key Lab of Rare Pediatric Diseases, Hengyang Medical School, University of South China, Hengyang, Hunan, 421001, China
- School of Computer, University of South China, Hengyang, Hunan, 421001, China
- The First Affiliated Hospital, Cardiovascular Lab of Big Data and Imaging Artificial Intelligence, Hengyang Medical School, University of South China, Hengyang, Hunan, 421001, China
| |
Collapse
|
14
|
Tian L, Xie Y, Xie Z, Tian J, Tian W. AtacAnnoR: a reference-based annotation tool for single cell ATAC-seq data. Brief Bioinform 2023; 24:bbad268. [PMID: 37497729 DOI: 10.1093/bib/bbad268] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 06/14/2023] [Accepted: 07/04/2023] [Indexed: 07/28/2023] Open
Abstract
Here, we present AtacAnnoR, a two-round annotation method for scATAC-seq data using well-annotated scRNA-seq data as reference. We evaluate AtacAnnoR's performance against six competing methods on 11 benchmark datasets. Our results show that AtacAnnoR achieves the highest mean accuracy and the highest mean balanced accuracy and performs particularly well when unpaired scRNA-seq data are used as the reference. Furthermore, AtacAnnoR implements a 'Combine and Discard' strategy to further improve annotation accuracy when annotations of multiple references are available. AtacAnnoR has been implemented in an R package and can be directly integrated into currently popular scATAC-seq analysis pipelines.
Collapse
Affiliation(s)
- Lejin Tian
- State Key Laboratory of Genetic Engineering, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - Yunxiao Xie
- State Key Laboratory of Genetic Engineering, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai, China
| | - Zhaobin Xie
- State Key Laboratory of Genetic Engineering, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai, China
| | | | - Weidong Tian
- State Key Laboratory of Genetic Engineering, Department of Computational Biology, School of Life Sciences, Fudan University, Shanghai, China
- Children's Hospital of Fudan University, Shanghai, China
- Children's Hospital of Shandong University, Jinan, China
| |
Collapse
|
15
|
Logotheti S, Papadaki E, Zolota V, Logothetis C, Vrahatis AG, Soundararajan R, Tzelepi V. Lineage Plasticity and Stemness Phenotypes in Prostate Cancer: Harnessing the Power of Integrated "Omics" Approaches to Explore Measurable Metrics. Cancers (Basel) 2023; 15:4357. [PMID: 37686633 PMCID: PMC10486655 DOI: 10.3390/cancers15174357] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2023] [Revised: 08/21/2023] [Accepted: 08/25/2023] [Indexed: 09/10/2023] Open
Abstract
Prostate cancer (PCa), the most frequent and second most lethal cancer type in men in developed countries, is a highly heterogeneous disease. PCa heterogeneity, therapy resistance, stemness, and lethal progression have been attributed to lineage plasticity, which refers to the ability of neoplastic cells to undergo phenotypic changes under microenvironmental pressures by switching between developmental cell states. What remains to be elucidated is how to identify measurements of lineage plasticity, how to implement them to inform preclinical and clinical research, and, further, how to classify patients and inform therapeutic strategies in the clinic. Recent research has highlighted the crucial role of next-generation sequencing technologies in identifying potential biomarkers associated with lineage plasticity. Here, we review the genomic, transcriptomic, and epigenetic events that have been described in PCa and highlight those with significance for lineage plasticity. We further focus on their relevance in PCa research and their benefits in PCa patient classification. Finally, we explore ways in which bioinformatic analyses can be used to determine lineage plasticity based on large omics analyses and algorithms that can shed light on upstream and downstream events. Most importantly, an integrated multiomics approach may soon allow for the identification of a lineage plasticity signature, which would revolutionize the molecular classification of PCa patients.
Collapse
Affiliation(s)
- Souzana Logotheti
- Department of Pathology, University of Patras, 26504 Patras, Greece; (S.L.); (E.P.); (V.Z.)
| | - Eugenia Papadaki
- Department of Pathology, University of Patras, 26504 Patras, Greece; (S.L.); (E.P.); (V.Z.)
- Department of Informatics, Ionian University, 49100 Corfu, Greece;
| | - Vasiliki Zolota
- Department of Pathology, University of Patras, 26504 Patras, Greece; (S.L.); (E.P.); (V.Z.)
| | - Christopher Logothetis
- Department of Genitourinary Medical Oncology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA;
| | | | - Rama Soundararajan
- Department of Translational Molecular Pathology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA
| | - Vasiliki Tzelepi
- Department of Pathology, University of Patras, 26504 Patras, Greece; (S.L.); (E.P.); (V.Z.)
| |
Collapse
|
16
|
Lu J, Shen J, Xiong B, Ma W, Staab S, Yang C. HiPrompt: Few-Shot Biomedical Knowledge Fusion via Hierarchy-Oriented Prompting. INTERNATIONAL ACM SIGIR CONFERENCE ON RESEARCH AND DEVELOPMENT IN INFORMATION RETRIEVAL. ANNUAL INTERNATIONAL ACMSIGIR CONFERENCE ON RESEARCH & DEVELOPMENT IN INFORMATION RETRIEVAL 2023; 2023:2052-2056. [PMID: 38352127 PMCID: PMC10863609 DOI: 10.1145/3539618.3591997] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/16/2024]
Abstract
Medical decision-making processes can be enhanced by comprehensive biomedical knowledge bases, which require fusing knowledge graphs constructed from different sources via a uniform index system. The index system often organizes biomedical terms in a hierarchy to provide the aligned entities with fine-grained granularity. To address the challenge of scarce supervision in the biomedical knowledge fusion (BKF) task, researchers have proposed various unsupervised methods. However, these methods heavily rely on ad-hoc lexical and structural matching algorithms, which fail to capture the rich semantics conveyed by biomedical entities and terms. Recently, neural embedding models have proved effective in semantic-rich tasks, but they rely on sufficient labeled data to be adequately trained. To bridge the gap between the scarce-labeled BKF and neural embedding models, we propose HiPrompt, a supervision-efficient knowledge fusion framework that elicits the few-shot reasoning ability of large language models through hierarchy-oriented prompts. Empirical results on the collected KG-Hi-BKF benchmark datasets demonstrate the effectiveness of HiPrompt.
Collapse
Affiliation(s)
| | | | - Bo Xiong
- University of Stuttgart, Germany
| | | | - Steffen Staab
- University of Stuttgart, Germany, University of Southampton, UK
| | | |
Collapse
|