1
|
Zhang Q, Wu X, Li X, Ma W, Wu T, Li L, Hu F, Xie Y, Wu X. TransAnno-Net: A Deep Learning Framework for Accurate Cell Type Annotation of Mouse Lung Tissue Using Self-supervised Pretraining. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2025; 267:108809. [PMID: 40315689 DOI: 10.1016/j.cmpb.2025.108809] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/03/2025] [Revised: 04/05/2025] [Accepted: 04/23/2025] [Indexed: 05/04/2025]
Abstract
BACKGROUND Single-cell RNA sequencing (scRNA-seq) has become a significant tool for addressing complex issuess in the field of biology. In the context of scRNA-seq analysis, it is imperative to accurately determine the type of each cell. However, conventional supervised or semi-supervised methodologies are contingent on expert labels and incur substantial labeling costs, In contrast self-supervised pre-training strategies leverage unlabeled data during the pre-training phase and utilise a limited amount of labeled data in the fine-tuning phase, thereby greatly reducing labor costs. Furthermore, the fine-tuning does not need to learn the feature representations from scratch, enhancing the efficiency and transferability of the model. METHODS The proposed methodology is outlined below. The deep learning framework, TransAnno-Net, is based on transfer learning and a Transformer architecture. It has been designed for efficient and accurate cell type annotations in large-scale scRNA-seq datasets of mouse lung organs. Specifically, TransAnno-Net is pre-trained on the scRNA-seq lung data of approximately 100,000 cells to acquire gene-gene similarities via self-supervised learning. It is then migrated to a relatively small number of datasets to fine-tune specific cell type annotation tasks. To address the issue of imbalance in cell types commonly observed in scRNA-seq data, we applied a random oversampling technique is applied to the fine-tuned dataset. This is done to mitigate the impact of distributional imbalance on the annotation outcomes. RESULTS The experimental findings demonstrate that TransAnno-Net exhibits superior performance with an AUC of 0.979, 0.901, and 0.982, respectively, on three mouse lung datasets, outperforming eight state-of-the-art (SOTA) methods. In addition, TransAnno-Net demonstrates robust performance on cross-organ, cross-platform datasets, and is competitive with the fully supervised learning-based method. CONCLUSION The TransAnno-Net method is a highly effective cross-platform and cross-data set single-cell type annotation method for mouse lung tissues and supports cross-organ cell type annotation. This approach is expected to enhance the efficiency of research on the biological mechanisms of complex biological systems and diseases.
Collapse
Affiliation(s)
- Qing Zhang
- School of Computer Science and Engineering, Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan, 430205, Hubei, PR China
| | - Xiaoxiao Wu
- School of Computer Science and Engineering, Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan, 430205, Hubei, PR China
| | - Xiang Li
- School of Computer Science and Engineering, Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan, 430205, Hubei, PR China
| | - Wei Ma
- School of Computer Science and Engineering, Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan, 430205, Hubei, PR China
| | - Tongquan Wu
- School of Mechanical Engineering, Hefei University of Technology, Hefei, PR China
| | - Liuyue Li
- Faculty of Applied Science and Engineering, University of Toronto, Canada
| | - Fan Hu
- UNISKIN Research Institute on Skin Aging, Inertia Shanghai Biotechnology Co., Ltd., Shanghai, PR China; DermaHealth Shanghai Biotechnology Co., Ltd., Shanghai, PR China
| | - Yicheng Xie
- Department of Dermatology, Children's Hospital, Zhejiang University School of Medicine, National Clinical Research Center for Child Health, Hangzhou, 310052, Zhejiang Province, China.
| | - Xinglong Wu
- School of Computer Science and Engineering, Hubei Key Laboratory of Intelligent Robot, Wuhan Institute of Technology, Wuhan, 430205, Hubei, PR China.
| |
Collapse
|
2
|
Shen WK, Zhang CY, Gu YM, Luo T, Chen SY, Yue T, Xie GY, Liao Y, Yuan Y, Lei Q, Guo AY. An automatic annotation tool and reference database for T cell subtypes and states at single-cell resolution. Sci Bull (Beijing) 2025; 70:1659-1672. [PMID: 40157887 DOI: 10.1016/j.scib.2025.02.043] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2024] [Revised: 01/08/2025] [Accepted: 02/28/2025] [Indexed: 04/01/2025]
Abstract
T cells have various subtypes and states with different functions. However, a reference list and automated annotation tool for T cell subtypes and states are lacking, which is critical for analyzing and comparing T cells under various conditions. We constructed the largest human T cell reference, containing 1,348,268 T cells from 35 conditions and 16 tissues. We classified T cells into 33 subtypes and further stratified them into 68 categories according to subtype and state. Based on this reference, we developed a tool named STCAT to automatically annotate T cells from scRNA-seq data by hierarchical models and marker correction. The accuracy of STCAT was 28% higher than that of existing tools validated on six independent datasets, including cancer and healthy samples. Using STCAT, we consistently discovered that CD4+ Th17 cells were enriched in late-stage lung cancer patients in multiple datasets, whereas MAIT cells were prevalent in milder-stage COVID-19 patients. We also confirmed a decrease in Treg cytotoxicity in post-treatment ovarian cancer. Systematic landscape analyses of CD4+ and CD8+ T cell references revealed that CD4+ Treg cells were enriched in tumor samples and that CD8+ naive-related cells were abundant in healthy individuals. Finally, we deposited all the T cell references and annotations into a TCellAtlas (https://guolab.wchscu.cn/TCellAtlas) database, which allows users to browse T cell expression profiles and analyze customized scRNA-seq data by STCAT. In conclusion, comprehensive human T cell subtypes and states reference, automated annotation tool, and database will greatly facilitate research on T cell immunity and tumor immunology.
Collapse
Affiliation(s)
- Wen-Kang Shen
- Department of Thoracic Surgery, West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610041, China; Hubei Bioinformatics & Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Chu-Yu Zhang
- Department of Thoracic Surgery, West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610041, China; Hubei Bioinformatics & Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Yi-Min Gu
- Department of Thoracic Surgery, West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Tao Luo
- Department of Thoracic Surgery, West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610041, China; College of Life Sciences, University of Chinese Academy of Sciences, Beijing 101408, China
| | - Si-Yi Chen
- Department of Rheumatology & Immunology Union Hospital, Tongji Medical College, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Tao Yue
- Department of Thoracic Surgery, West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610041, China; Hubei Bioinformatics & Molecular Imaging Key Laboratory, College of Life Science and Technology, Huazhong University of Science and Technology, Wuhan 430074, China
| | - Gui-Yan Xie
- Department of Thoracic Surgery, West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Yu Liao
- Department of Thoracic Surgery, West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610041, China
| | - Yong Yuan
- Department of Thoracic Surgery, West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610041, China.
| | - Qian Lei
- Department of Thoracic Surgery, West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610041, China.
| | - An-Yuan Guo
- Department of Thoracic Surgery, West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu 610041, China.
| |
Collapse
|
3
|
Suo Z, Pan B, Shi H, Ma L, Zheng Y, Xu W, Lin L, Zhang E, Wang L, Zhang M, Qu Y, Zheng H, Gao X, Ni C. HL-BscPF: Hybrid learning facilitates brain cell auto-identification in multiple pathologies. Life Sci 2025:123751. [PMID: 40414555 DOI: 10.1016/j.lfs.2025.123751] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2025] [Revised: 05/04/2025] [Accepted: 05/20/2025] [Indexed: 05/27/2025]
Abstract
AIMS The rapidly growing scale and complexity of single-cell transcriptomic data in brain research make it increasingly difficult for traditional methods to extract meaningful insights efficiently, highlighting the need for artificial intelligence. MATERIALS AND METHODS We presented the Hybrid Learning-based Brain single-cell Prediction Framework (HL-BscPF), designed to automate cell type classification and reveal disease-related pathways in the brain. HL-BscPF integrates ItClust and TOSICA models, combining autoencoder-based dimensionality reduction with transformer architecture to enhance predictive accuracy. HL-BscPF was evaluated using brain scRNA-seq datasets representing various neuropathological states, and its predictive performance was benchmarked against ground-truth annotations. KEY FINDINGS Applied to four brain-specific single-cell datasets, including aging, Alzheimer's disease, postoperative cognitive dysfunction, and stroke, HL-BscPF accurately classified cell types and uncovered key functional alterations in neuronal and glial populations. TOSICA showed higher accuracy in large-scale datasets due to its multi-head self-attention capabilities, whereas ItClust performed optimally in cases with lower cell diversity, demonstrating their complementary strengths. By providing precise cell identification and novel insights into brain-specific pathway dysregulation, HL-BscPF offers a powerful tool for extracting meaningful insights from vast single-cell datasets, enabling a deeper understanding of the complex neuropathologies. SIGNIFICANCE HL-BscPF demonstrates exceptional accuracy and interpretability in cell type annotation and functional analysis, uncovering critical disease-related mechanisms. This framework offers a powerful tool for advancing single-cell research in brain pathologies.
Collapse
Affiliation(s)
- Zizheng Suo
- Department of anesthesiology, National Cancer Center / National Clinical Research Center for Cancer / Cancer hospital, Chinese Academy of Medical Sciences and Peking union medical college, Beijing 100021, PR China
| | - Bocheng Pan
- Institute of Microelectronics, Chinese Academy of Sciences, Beijing 100029, PR China
| | - Hailong Shi
- Institute of Microelectronics, Chinese Academy of Sciences, Beijing 100029, PR China
| | - Linhui Ma
- Department of anesthesiology, National Cancer Center / National Clinical Research Center for Cancer / Cancer hospital, Chinese Academy of Medical Sciences and Peking union medical college, Beijing 100021, PR China
| | - Yuxiang Zheng
- Department of anesthesiology, National Cancer Center / National Clinical Research Center for Cancer / Cancer hospital, Chinese Academy of Medical Sciences and Peking union medical college, Beijing 100021, PR China
| | - Wenjie Xu
- Department of anesthesiology, National Cancer Center / National Clinical Research Center for Cancer / Cancer hospital, Chinese Academy of Medical Sciences and Peking union medical college, Beijing 100021, PR China
| | - Lina Lin
- Department of anesthesiology, National Cancer Center / National Clinical Research Center for Cancer / Cancer hospital, Chinese Academy of Medical Sciences and Peking union medical college, Beijing 100021, PR China
| | - Enze Zhang
- Department of anesthesiology, National Cancer Center / National Clinical Research Center for Cancer / Cancer hospital, Chinese Academy of Medical Sciences and Peking union medical college, Beijing 100021, PR China
| | - Lijuan Wang
- Department of anesthesiology, National Cancer Center / National Clinical Research Center for Cancer / Cancer hospital, Chinese Academy of Medical Sciences and Peking union medical college, Beijing 100021, PR China
| | - Mingzhu Zhang
- Department of anesthesiology, National Cancer Center / National Clinical Research Center for Cancer / Cancer hospital, Chinese Academy of Medical Sciences and Peking union medical college, Beijing 100021, PR China
| | - Yinyin Qu
- Department of Anesthesiology, Peking University Third Hospital, Beijing 100191, PR China
| | - Hui Zheng
- Department of anesthesiology, National Cancer Center / National Clinical Research Center for Cancer / Cancer hospital, Chinese Academy of Medical Sciences and Peking union medical college, Beijing 100021, PR China
| | - Xingyu Gao
- Institute of Microelectronics, Chinese Academy of Sciences, Beijing 100029, PR China.
| | - Cheng Ni
- Department of anesthesiology, National Cancer Center / National Clinical Research Center for Cancer / Cancer hospital, Chinese Academy of Medical Sciences and Peking union medical college, Beijing 100021, PR China.
| |
Collapse
|
4
|
Li H, Zhang MJ, Zhang B, Lin WP, Li SJ, Xiong D, Wang Q, Wang WD, Yang QC, Huang CF, Deng WW, Sun ZJ. Mature tertiary lymphoid structures evoke intra-tumoral T and B cell responses via progenitor exhausted CD4 + T cells in head and neck cancer. Nat Commun 2025; 16:4228. [PMID: 40335494 PMCID: PMC12059173 DOI: 10.1038/s41467-025-59341-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2024] [Accepted: 04/18/2025] [Indexed: 05/09/2025] Open
Abstract
Tumor tertiary lymphoid structures (TLS), especially mature TLS (mTLS), have been associated with better prognosis and improved responses to immune checkpoint blockade (ICB), but the underlying mechanisms remain incompletely understood. Here, by performing single-cell RNA, antigen receptor sequencing and spatial transcriptomics on tumor tissue from head and neck squamous cell carcinoma (HNSCC) patients with different statuses of TLS, we observe that mTLS are enriched with stem-like T cells, and B cells at various maturation stages. Notably, progenitor exhausted CD4+ T cells, with features resembling follicular helper T cells, support these responses, by activating B cells to produce plasma cells in the germinal center, and interacting with DC-LAMP+ dendritic cells to support CD8+ T cell activation. Conversely, non-mTLS tumors do not promote local anti-tumor immunity which is abundant of immunosuppressive cells or a lack of stem-like B and T cells. Furthermore, patients with mTLS manifest improved overall survival and response to ICB compared to those with non-mTLS. Overall, our study provides insights into mechanisms underlying mTLS-mediated intra-tumoral immunity events against cancer.
Collapse
Affiliation(s)
- Hao Li
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China
- Department of Oral Maxillofacial-Head Neck Oncology, School & Hospital of Stomatology, Wuhan University, Wuhan, China
| | - Meng-Jie Zhang
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China
| | - Boxin Zhang
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China
| | - Wen-Ping Lin
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China
| | - Shu-Jin Li
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China
| | - Dian Xiong
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China
| | - Qing Wang
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China
| | - Wen-Da Wang
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China
| | - Qi-Chao Yang
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China
| | - Cong-Fa Huang
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China
| | - Wei-Wei Deng
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China.
- Department of Oral Maxillofacial-Head Neck Oncology, School & Hospital of Stomatology, Wuhan University, Wuhan, China.
| | - Zhi-Jun Sun
- State Key Laboratory of Oral & Maxillofacial Reconstruction and Regeneration, Key Laboratory of Oral Biomedicine Ministry of Education, Hubei Key Laboratory of Stomatology, School & Hospital of Stomatology, Frontier Science Center for Immunology and Metabolism, Taikang Center for Life and Medical Sciences, Wuhan University, Wuhan, China.
- Department of Oral Maxillofacial-Head Neck Oncology, School & Hospital of Stomatology, Wuhan University, Wuhan, China.
| |
Collapse
|
5
|
Wang J, Ye F, Chai H, Jiang Y, Wang T, Ran X, Xia Q, Xu Z, Fu Y, Zhang G, Wu H, Guo G, Guo H, Ruan Y, Wang Y, Xing D, Xu X, Zhang Z. Advances and applications in single-cell and spatial genomics. SCIENCE CHINA. LIFE SCIENCES 2025; 68:1226-1282. [PMID: 39792333 DOI: 10.1007/s11427-024-2770-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/20/2024] [Accepted: 10/10/2024] [Indexed: 01/12/2025]
Abstract
The applications of single-cell and spatial technologies in recent times have revolutionized the present understanding of cellular states and the cellular heterogeneity inherent in complex biological systems. These advancements offer unprecedented resolution in the examination of the functional genomics of individual cells and their spatial context within tissues. In this review, we have comprehensively discussed the historical development and recent progress in the field of single-cell and spatial genomics. We have reviewed the breakthroughs in single-cell multi-omics technologies, spatial genomics methods, and the computational strategies employed toward the analyses of single-cell atlas data. Furthermore, we have highlighted the advances made in constructing cellular atlases and their clinical applications, particularly in the context of disease. Finally, we have discussed the emerging trends, challenges, and opportunities in this rapidly evolving field.
Collapse
Affiliation(s)
- Jingjing Wang
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Fang Ye
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Haoxi Chai
- Life Sciences Institute and The Second Affiliated Hospital, Zhejiang University, Hangzhou, 310058, China
| | - Yujia Jiang
- BGI Research, Shenzhen, 518083, China
- BGI Research, Hangzhou, 310030, China
| | - Teng Wang
- Biomedical Pioneering Innovation Center (BIOPIC) and School of Life Sciences, Peking University, Beijing, 100871, China
- Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, 100871, China
| | - Xia Ran
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Institute of Hematology, Zhejiang University, Hangzhou, 310000, China
| | - Qimin Xia
- Biomedical Pioneering Innovation Center (BIOPIC) and School of Life Sciences, Peking University, Beijing, 100871, China
| | - Ziye Xu
- Department of Laboratory Medicine of The First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Yuting Fu
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Center for Stem Cell and Regenerative Medicine, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Guodong Zhang
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Center for Stem Cell and Regenerative Medicine, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Hanyu Wu
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China
- Center for Stem Cell and Regenerative Medicine, Zhejiang University School of Medicine, Hangzhou, 310058, China
| | - Guoji Guo
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China.
- Center for Stem Cell and Regenerative Medicine, Zhejiang University School of Medicine, Hangzhou, 310058, China.
- Zhejiang Provincial Key Lab for Tissue Engineering and Regenerative Medicine, Dr. Li Dak Sum & Yip Yio Chin Center for Stem Cell and Regenerative Medicine, Hangzhou, 310058, China.
- Institute of Hematology, Zhejiang University, Hangzhou, 310000, China.
| | - Hongshan Guo
- Bone Marrow Transplantation Center of the First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China.
- Institute of Hematology, Zhejiang University, Hangzhou, 310000, China.
| | - Yijun Ruan
- Life Sciences Institute and The Second Affiliated Hospital, Zhejiang University, Hangzhou, 310058, China.
| | - Yongcheng Wang
- Department of Laboratory Medicine of The First Affiliated Hospital & Liangzhu Laboratory, Zhejiang University School of Medicine, Hangzhou, 310058, China.
| | - Dong Xing
- Biomedical Pioneering Innovation Center (BIOPIC) and School of Life Sciences, Peking University, Beijing, 100871, China.
- Beijing Advanced Innovation Center for Genomics (ICG), Peking University, Beijing, 100871, China.
| | - Xun Xu
- BGI Research, Shenzhen, 518083, China.
- BGI Research, Hangzhou, 310030, China.
- Guangdong Provincial Key Laboratory of Genome Read and Write, BGI Research, Shenzhen, 518083, China.
| | - Zemin Zhang
- Biomedical Pioneering Innovation Center (BIOPIC) and School of Life Sciences, Peking University, Beijing, 100871, China.
| |
Collapse
|
6
|
Dai Q, Liu W, Yu X, Duan X, Liu Z. Self-Supervised Graph Representation Learning for Single-Cell Classification. Interdiscip Sci 2025:10.1007/s12539-025-00700-y. [PMID: 40180773 DOI: 10.1007/s12539-025-00700-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2024] [Revised: 03/02/2025] [Accepted: 03/04/2025] [Indexed: 04/05/2025]
Abstract
Accurately identifying cell types in single-cell RNA sequencing data is critical for understanding cellular differentiation and pathological mechanisms in downstream analysis. As traditional biological approaches are laborious and time-intensive, it is imperative to develop computational biology methods for cell classification. However, it remains a challenge for existing methods to adequately utilize the potential gene expression information within the vast amount of unlabeled cell data, which limits their classification and generalization performance. Therefore, we propose a novel self-supervised graph representation learning framework for single-cell classification, named scSSGC. Specifically, in the pre-training stage of self-supervised learning, multiple K-means clustering tasks conducted on unlabeled cell data are jointly employed for model training, thereby mitigating the issue of limited labeled data. To effectively capture the potential interactions among cells, we introduce a locally augmented graph neural network to enhance the information aggregation capability for nodes with fewer neighbors in the cell graph. A range of benchmark experiments demonstrates that scSSGC outperforms existing state-of-the-art cell classification methods. More importantly, scSSGC provides stable performance when faced with cross-datasets, indicating better generalization ability.
Collapse
Affiliation(s)
- Qiguo Dai
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116650, China.
- SEAC Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, 116650, China.
| | - Wuhao Liu
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116650, China
- SEAC Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, 116650, China
| | - Xianhai Yu
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, 116650, China
- SEAC Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, 116650, China
| | - Xiaodong Duan
- SEAC Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, 116650, China
| | - Ziqiang Liu
- SEAC Key Laboratory of Big Data Applied Technology, Dalian Minzu University, Dalian, 116650, China
- Hangzhou Institute of Medicine, Chinese Academy of Sciences, Hangzhou, 310018, China
| |
Collapse
|
7
|
Traversa D, Chiara M. Mapping Cell Identity from scRNA-seq: A primer on computational methods. Comput Struct Biotechnol J 2025; 27:1559-1569. [PMID: 40270709 PMCID: PMC12017876 DOI: 10.1016/j.csbj.2025.03.051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Revised: 03/29/2025] [Accepted: 03/31/2025] [Indexed: 04/25/2025] Open
Abstract
Single cell (sc) technologies mark a conceptual and methodological breakthrough in our way to study cells, the base units of life. Thanks to these technological developments, large-scale initiatives are currently ongoing aimed at mapping of all the cell types in the human body, with the ambitious aim to gain a cell-level resolution of physiological development and disease. Since its broad applicability and ease of interpretation scRNA-seq is probably the most common sc-based application. This assay uses high throughput RNA sequencing to capture gene expression profiles at the sc-level. Subsequently, under the assumption that differences in transcriptional programs correspond to distinct cellular identities, ad-hoc computational methods are used to infer cell types from gene expression patterns. A wide array of computational methods were developed for this task. However, depending on the underlying algorithmic approach and associated computational requirements, each method might have a specific range of application, with implications that are not always clear to the end user. Here we will provide a concise overview on state-of-the-art computational methods for cell identity annotation in scRNA-seq, tailored for new users and non-computational scientists. To this end, we classify existing tools in five main categories, and discuss their key strengths, limitations and range of application.
Collapse
Affiliation(s)
- Daniele Traversa
- Department of Biosciences, Università degli Studi di Milano, via Celoria 26, Milan 20133, Italy
| | - Matteo Chiara
- Department of Biosciences, Università degli Studi di Milano, via Celoria 26, Milan 20133, Italy
| |
Collapse
|
8
|
Sujana STA, Shahjaman M, Singha AC. Application of bioinformatic tools in cell type classification for single-cell RNA-seq data. Comput Biol Chem 2025; 115:108332. [PMID: 39793515 DOI: 10.1016/j.compbiolchem.2024.108332] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2024] [Revised: 12/06/2024] [Accepted: 12/24/2024] [Indexed: 01/13/2025]
Abstract
The advancements in single-cell RNA sequencing (scRNAseq) technology have significantly transformed genomics research, enabling the handling of thousands of cells in each experiment. As of now, 32,068 research studies have been cataloged in the Pubmed database. The primary aim of scRNAseq investigations is to identify cell types, understand the antitumor immune response, and identify new and uncommon cell types. Traditional techniques for identifying cell types include microscopy, histology, and pathological characteristics. However, the complexity of instruments and the need for precise experimental design make it difficult to fully capture the overall heterogeneity. Unsupervised clustering and supervised classification methods have been used to solve this task. Supervised cell type classification methods have gained popularity as large-scale, high-quality, well-annotated and more robust results compared to clustering methods. A recent study showed that support vector machine (SVM) gives a high-quality classification performance in different scenarios. In this article, we compare and evaluate the performance of four different kernels (sigmoid, linear, radial, polynomial) of SVM. The results of the experiments on three standard scRNA-seq datasets indicate that SVM with linear and SVM with sigmoid kernel classify the cells more accurately (approx. 99 %) where SVM linear kernel method has remarkably fast computation time and we also evaluate the results using some single cell specific evaluation matrices F-1 score, MCC, AUC value. Additionally, it sheds light on the potential use of kernels of SVM to give underlying information of single-cell RNA-Seq data more effectively.
Collapse
Affiliation(s)
- Shah Tania Akter Sujana
- Bioinformatics Lab, Department of Statistics, Begum Rokeya University, Rangpur 5404, Bangladesh.
| | - Md Shahjaman
- Bioinformatics Lab, Department of Statistics, Begum Rokeya University, Rangpur 5404, Bangladesh.
| | - Atul Chandra Singha
- Bioinformatics Lab, Department of Statistics, Begum Rokeya University, Rangpur 5404, Bangladesh.
| |
Collapse
|
9
|
Hu H, Guo Y, Ge F, Yin H, Zhang H, Zhou Z, Yan F, Ye Q, Wu J, Cao J, Hsieh C, Yang B. UniMap: Type-Level Integration Enhances Biological Preservation and Interpretability in Single-Cell Annotation. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2025; 12:e2410790. [PMID: 40013940 PMCID: PMC12021081 DOI: 10.1002/advs.202410790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 01/22/2025] [Indexed: 02/28/2025]
Abstract
Integrating single-cell datasets from multiple studies provides a cost-effective way to build comprehensive cell atlases, granting deeper insights into cellular characteristics across diverse biological systems. However, current data integration methods struggle with interference in partially overlapping datasets and varying annotation granularities. Here, a multiselective adversarial network is introduced for the first time and present UniMap, which functions as a "discerner" to identify and exclude interfering cells from various data sources during dataset integration. Compared to other state-of-the-art methods, UniMap emphasizes type-level integration and proves to be the best model for preserving biological variability, achieving noticeably higher accuracy in single-cell automated annotation under various circumstances. Additionally, it enhances interpretability by revealing shared and domain-specific cell types and providing prediction confidence. The efficacy of UniMap is demonstrated in terms of identifying new cell types, creating high-resolution cell atlases, annotating cells along developmental trajectories, and performing cross-species analysis, underscoring its potential as a robust tool for single-cell research.
Collapse
Affiliation(s)
- Haitao Hu
- Institute of Pharmacology and ToxicologyZhejiang Province Key Laboratory of Anti‐Cancer Drug ResearchCollege of Pharmaceutical SciencesZhejiang UniversityHangzhou310058China
- Polytechnic Institute of Zhejiang UniversityZhejiang UniversityHangzhou310015China
| | - Yue Guo
- Institute of Pharmacology and ToxicologyZhejiang Province Key Laboratory of Anti‐Cancer Drug ResearchCollege of Pharmaceutical SciencesZhejiang UniversityHangzhou310058China
| | - Fujing Ge
- Institute of Pharmacology and ToxicologyZhejiang Province Key Laboratory of Anti‐Cancer Drug ResearchCollege of Pharmaceutical SciencesZhejiang UniversityHangzhou310058China
| | - Hao Yin
- Institute of Pharmacology and ToxicologyZhejiang Province Key Laboratory of Anti‐Cancer Drug ResearchCollege of Pharmaceutical SciencesZhejiang UniversityHangzhou310058China
- Polytechnic Institute of Zhejiang UniversityZhejiang UniversityHangzhou310015China
| | - Hao Zhang
- Institute of Pharmacology and ToxicologyZhejiang Province Key Laboratory of Anti‐Cancer Drug ResearchCollege of Pharmaceutical SciencesZhejiang UniversityHangzhou310058China
- Polytechnic Institute of Zhejiang UniversityZhejiang UniversityHangzhou310015China
| | - Zhesheng Zhou
- Institute of Pharmacology and ToxicologyZhejiang Province Key Laboratory of Anti‐Cancer Drug ResearchCollege of Pharmaceutical SciencesZhejiang UniversityHangzhou310058China
| | - Fangjie Yan
- Institute of Pharmacology and ToxicologyZhejiang Province Key Laboratory of Anti‐Cancer Drug ResearchCollege of Pharmaceutical SciencesZhejiang UniversityHangzhou310058China
| | - Qing Ye
- College of Pharmaceutical SciencesZhejiang UniversityHangzhouZhejiang310058P. R. China
| | - Jialu Wu
- College of Pharmaceutical SciencesZhejiang UniversityHangzhouZhejiang310058P. R. China
| | - Ji Cao
- Institute of Pharmacology and ToxicologyZhejiang Province Key Laboratory of Anti‐Cancer Drug ResearchCollege of Pharmaceutical SciencesZhejiang UniversityHangzhou310058China
- The Innovation Institute for Artificial Intelligence in MedicineZhejiang UniversityHangzhou310018China
- Engineering Research Center of Innovative Anticancer DrugsMinistry of EducationHangzhou310000China
- Center for Medical Research and Innovation in Digestive System TumorsMinistry of EducationHangzhou310020China
| | - Chang‐Yu Hsieh
- College of Pharmaceutical SciencesZhejiang UniversityHangzhouZhejiang310058P. R. China
- The Innovation Institute for Artificial Intelligence in MedicineZhejiang UniversityHangzhou310018China
| | - Bo Yang
- Institute of Pharmacology and ToxicologyZhejiang Province Key Laboratory of Anti‐Cancer Drug ResearchCollege of Pharmaceutical SciencesZhejiang UniversityHangzhou310058China
- The Innovation Institute for Artificial Intelligence in MedicineZhejiang UniversityHangzhou310018China
- Engineering Research Center of Innovative Anticancer DrugsMinistry of EducationHangzhou310000China
- School of MedicineHangzhou City UniversityHangzhou310015China
| |
Collapse
|
10
|
Guo F, Guan R, Li Y, Liu Q, Wang X, Yang C, Wang J. Foundation models in bioinformatics. Natl Sci Rev 2025; 12:nwaf028. [PMID: 40078374 PMCID: PMC11900445 DOI: 10.1093/nsr/nwaf028] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2024] [Revised: 12/17/2024] [Accepted: 01/08/2025] [Indexed: 03/14/2025] Open
Abstract
With the adoption of foundation models (FMs), artificial intelligence (AI) has become increasingly significant in bioinformatics and has successfully addressed many historical challenges, such as pre-training frameworks, model evaluation and interpretability. FMs demonstrate notable proficiency in managing large-scale, unlabeled datasets, because experimental procedures are costly and labor intensive. In various downstream tasks, FMs have consistently achieved noteworthy results, demonstrating high levels of accuracy in representing biological entities. A new era in computational biology has been ushered in by the application of FMs, focusing on both general and specific biological issues. In this review, we introduce recent advancements in bioinformatics FMs employed in a variety of downstream tasks, including genomics, transcriptomics, proteomics, drug discovery and single-cell analysis. Our aim is to assist scientists in selecting appropriate FMs in bioinformatics, according to four model types: language FMs, vision FMs, graph FMs and multimodal FMs. In addition to understanding molecular landscapes, AI technology can establish the theoretical and practical foundation for continued innovation in molecular biology.
Collapse
Affiliation(s)
- Fei Guo
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Xiangjiang Laboratory, Changsha 410083, China
| | - Renchu Guan
- Key Laboratory for Symbol Computation and Knowledge Engineering of the Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun 130012, China
| | - Yaohang Li
- Department of Computer Science, Old Dominion University, Norfolk 23529, USA
| | - Qi Liu
- School of Life Sciences and Technology, Tongji University, Shanghai 200092, China
| | - Xiaowo Wang
- Department of Automation, Tsinghua University, Beijing 100084, China
| | - Can Yang
- Department of Mathematics, State Key Laboratory of Molecular Neuroscience, and Big Data Bio-Intelligence Lab, The Hong Kong University of Science and Technology, Hong Kong, China
| | - Jianxin Wang
- Hunan Provincial Key Lab on Bioinformatics, School of Computer Science and Engineering, Central South University, Changsha 410083, China
- Xiangjiang Laboratory, Changsha 410083, China
| |
Collapse
|
11
|
Chen Y, Zou J. Simple and effective embedding model for single-cell biology built from ChatGPT. Nat Biomed Eng 2025; 9:483-493. [PMID: 39643729 DOI: 10.1038/s41551-024-01284-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2023] [Accepted: 10/16/2024] [Indexed: 12/09/2024]
Abstract
Large-scale gene-expression data are being leveraged to pretrain models that implicitly learn gene and cellular functions. However, such models require extensive data curation and training. Here we explore a much simpler alternative: leveraging ChatGPT embeddings of genes based on the literature. We used GPT-3.5 to generate gene embeddings from text descriptions of individual genes and to then generate single-cell embeddings by averaging the gene embeddings weighted by each gene's expression level. We also created a sentence embedding for each cell by using only the gene names ordered by their expression level. On many downstream tasks used to evaluate pretrained single-cell embedding models-particularly, tasks of gene-property and cell-type classifications-our model, which we named GenePT, achieved comparable or better performance than models pretrained from gene-expression profiles of millions of cells. GenePT shows that large-language-model embeddings of the literature provide a simple and effective path to encoding single-cell biological knowledge.
Collapse
Affiliation(s)
- Yiqun Chen
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA
| | - James Zou
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
- Department of Electrical Engineering, Stanford University, Stanford, CA, USA.
- Department of Computer Science, Stanford University, Stanford, CA, USA.
| |
Collapse
|
12
|
Zou Z, Liu Y, Bai Y, Luo J, Zhang Z. scTrans: Sparse attention powers fast and accurate cell type annotation in single-cell RNA-seq data. PLoS Comput Biol 2025; 21:e1012904. [PMID: 40184563 PMCID: PMC11970913 DOI: 10.1371/journal.pcbi.1012904] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2024] [Accepted: 02/24/2025] [Indexed: 04/06/2025] Open
Abstract
Cell type annotation is crucial in single-cell RNA sequencing data analysis because it enables significant biological discoveries and deepens our understanding of tissue biology. Given the high-dimensional and highly sparse nature of single-cell RNA sequencing data, most existing annotation tools focus on highly variable genes to reduce dimensionality and computational load. However, this approach inevitably results in information loss, potentially weakening the model's generalization performance and adaptability to novel datasets. To mitigate this issue, we developed scTrans, a single cell Transformer-based model, which employs sparse attention to utilize all non-zero genes, thereby effectively reducing the input data dimensionality while minimizing information loss. We validated the speed and accuracy of scTrans by performing cell type annotation on 31 different tissues within the Mouse Cell Atlas. Remarkably, even with datasets nearing a million cells, scTrans efficiently perform cell type annotation in limited computational resources. Furthermore, scTrans demonstrates strong generalization capabilities, accurately annotating cells in novel datasets and generating high-quality latent representations, which are essential for precise clustering and trajectory analysis.
Collapse
Affiliation(s)
- Zhiyi Zou
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Ying Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Yuting Bai
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Jiawei Luo
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, China
| | - Zhaolei Zhang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Department of Molecular Genetics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
13
|
Liu M, Zheng S, Li H, Budowle B, Wang L, Lou Z, Ge J. High resolution tissue and cell type identification via single cell transcriptomic profiling. PLoS One 2025; 20:e0318151. [PMID: 40138334 PMCID: PMC11940611 DOI: 10.1371/journal.pone.0318151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2024] [Accepted: 01/11/2025] [Indexed: 03/29/2025] Open
Abstract
Tissue identification can be instrumental in reconstructing a crime scene but remains a challenging task in forensic investigations. Conventionally, identifying the presence of certain tissue from tissue mixture by predefined cell type markers in bulk fashion is challenging due to limitations in sensitivity and accuracy. In contrast, single-cell RNA sequencing (scRNA-Seq) is a promising technology that has the potential to enhance or even revolutionize tissue and cell type identification. In this study, we developed a high sensitive general purpose single cell annotation pipeline, scTissueID, to accurately evaluate the single cell profile quality and precisely determine the cell and tissue types based on scRNA profiles. By incorporating a crucial and unique reference cell quality differentiation phase of targeting only high confident cells as reference, scTissueID achieved better and consistent performance in determining cell and tissue types compared to 8 state-of-art single cell annotation pipelines and 6 widely adopted machine learning algorithms, as demonstrated through a large-scale and comprehensive comparison study using both forensic-relevant and Human Cell Atlas (HCA) data. We highlighted the significance of cell quality differentiation, a previously undervalued factor. Thus, this study offers a tool capable of accurately and efficiently identifying cell and tissue types, with broad applicability to forensic investigations and other biomedical research endeavors.
Collapse
Affiliation(s)
- Muyi Liu
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, Texas, United States of America
- Department of Cell Biology, Harvard Medical School, Boston, Massachusetts, United States of America
| | - Suilan Zheng
- Department of Chemistry, Purdue University, West Lafayette, Indiana, United States of America
| | - Hongmin Li
- Department of Computer Science, California State University, East Bay, Hayward, California, United States of America
| | - Bruce Budowle
- Department of Forensic Medicine, University of Helsinki, Finland
| | - Le Wang
- Department of Electronic and Information Engineering, North China University of Technology, Beijing, China
| | - Zhaohuan Lou
- School of Pharmaceutical Sciences, Zhejiang Chinese Medical University, Hangzhou, China
| | - Jianye Ge
- Center for Human Identification, University of North Texas Health Science Center, Fort Worth, Texas, United States of America
| |
Collapse
|
14
|
Huang K, Tian J, Sun L, Hu H, Huang X, Zhou S, Deng A, Zhou Z, Jiang M, Li G, Xie P, Wang Y, Jiang X. TransGeneSelector: using a transformer approach to mine key genes from small transcriptomic datasets in plant responses to various environments. BMC Genomics 2025; 26:259. [PMID: 40098114 PMCID: PMC11912617 DOI: 10.1186/s12864-025-11434-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2024] [Accepted: 03/04/2025] [Indexed: 03/19/2025] Open
Abstract
Gene mining is crucial for understanding the regulatory mechanisms underlying complex biological processes, particularly in plants responding to environmental conditions. Traditional machine learning methods, while useful, often overlook important gene relationships due to their reliance on manual feature selection and limited ability to capture complex inter-gene regulatory dynamics. Deep learning approaches, while powerful, are often unsuitable for small sample sizes. This study introduces TransGeneSelector, the first deep learning framework specifically designed for mining key genes from small transcriptomic datasets. By integrating a Wasserstein Generative Adversarial Network with Gradient Penalty (WGAN-GP) for sample generation and a Transformer-based network for classification, TransGeneSelector efficiently addresses the challenges of small-sample transcriptomic data, capturing both global gene regulatory interactions and specific biological processes. Evaluated in Arabidopsis thaliana, the model achieved high classification accuracy in predicting seed germination and heat stress conditions, outperforming traditional methods like Random Forest and Support Vector Machines (SVM). Moreover, Shapley Additive Explanations (SHAP) analysis and gene regulatory network construction revealed that TransGeneSelector effectively identified genes that appear to have upstream regulatory functions based on our analyses, enriching them in multiple key pathways which are critical for seed germination and heat stress response. RT-qPCR validation further confirmed the model's gene selection accuracy, demonstrating consistent expression patterns across varying germination conditions. The findings underscore the potential of TransGeneSelector as a robust tool for gene mining, offering deeper insights into gene regulation and organism adaptation under diverse environmental conditions. This work provides a framework that leverages deep learning for key gene identification in small transcriptomic datasets.
Collapse
Affiliation(s)
- Kerui Huang
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Jianhong Tian
- College of Life Sciences, Hunan Normal University, Changsha, 410081, China
| | - Lei Sun
- Key Laboratory of Research and Utilization of Ethnomedicinal Plant Resources of Hunan Province, College of Biological and Food Engineering, Huaihua University, Huaihua, 418000, China
| | - Haoliang Hu
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Xuebin Huang
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Shiqi Zhou
- Rice Research Institute of Jiangxi Academy of Agricultural Sciences, Nanchang, 330000, China
| | - Aihua Deng
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Zhibo Zhou
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Ming Jiang
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China
| | - Guiwu Li
- College of Life Sciences, Hunan Normal University, Changsha, 410081, China
| | - Peng Xie
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China.
| | - Yun Wang
- Key Laboratory of Agricultural Products Processing and Food Safety in Hunan Higher Education, Hunan University of Arts and Science, Changde, 415000, China.
| | - Xiaocheng Jiang
- College of Life Sciences, Hunan Normal University, Changsha, 410081, China.
| |
Collapse
|
15
|
Aggarwal M, Cogan NG, Periwal V. SENSITIVITY BASED MODEL AGNOSTIC SCALABLE EXPLANATIONS OF DEEP LEARNING. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.21.639516. [PMID: 40093081 PMCID: PMC11908179 DOI: 10.1101/2025.02.21.639516] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/19/2025]
Abstract
Deep neural networks (DNNs) are powerful tools for data-driven predictive machine learning, but their complex architecture obscures mechanistic relations that they have learned from data. This information is critical to the scientific method of hypotheses development, experiment design, and model validation, especially when DNNs are used for biological and clinical predictions that affect human health. We design SensX, a model agnostic explainable AI (XAI) framework that outperformed current state-of-the-art XAI in accuracy (up to 52% higher) and computation time (up to 158 times faster), with higher consistency in all cases. It also determines an optimal subset of important input features, reducing dimensionality of further analyses. SensX scaled to explain vision transformer (ViT) models with more than 150,000 features, which is computationally infeasible for current state-of-the-art XAI. SensX validated that ViT models learned justifiable features as important for different facial attributes of different human faces. SensX revealed biases inherent to the ViT architecture, an observation possible only when importance of each feature is explained. We trained DNNs to annotate biological cell types using single-cell RNA-seq data and SensX determined the sets of genes that the DNNs learned to be important to different cell types.
Collapse
Affiliation(s)
| | - N G Cogan
- Department of Mathematics, Florida State University, Tallahassee, FL
| | | |
Collapse
|
16
|
Wang YR, Du PF. WCSGNet: a graph neural network approach using weighted cell-specific networks for cell-type annotation in scRNA-seq. Front Genet 2025; 16:1553352. [PMID: 40034748 PMCID: PMC11872911 DOI: 10.3389/fgene.2025.1553352] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2024] [Accepted: 01/27/2025] [Indexed: 03/05/2025] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool for understanding cellular heterogeneity, providing unprecedented resolution in molecular regulation analysis. Existing supervised learning approaches for cell type annotation primarily utilize gene expression profiles from scRNA-seq data. Although some methods incorporated gene interaction network information, they fail to use cell-specific gene association networks. This limitation overlooks the unique gene interaction patterns within individual cells, potentially compromising the accuracy of cell type classification. We introduce WCSGNet, a graph neural network-based algorithm for automatic cell-type annotation that leverages Weighted Cell-Specific Networks (WCSNs). These networks are constructed based on highly variable genes and inherently capture both gene expression patterns and gene association network structure features. Extensive experimental validation demonstrates that WCSGNet consistently achieves superior cell type classification performance, ranking among the top-performing methods while maintaining robust stability across diverse datasets. Notably, WCSGNet exhibits a distinct advantage in handling imbalanced datasets, outperforming existing methods in these challenging scenarios. All datasets and codes for reproducing this work were deposited in a GitHub repository (https://github.com/Yi-ellen/WCSGNet).
Collapse
Affiliation(s)
| | - Pu-Feng Du
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| |
Collapse
|
17
|
Zhao B, Song K, Wei DQ, Xiong Y, Ding J. scCobra allows contrastive cell embedding learning with domain adaptation for single cell data integration and harmonization. Commun Biol 2025; 8:233. [PMID: 39948393 PMCID: PMC11825689 DOI: 10.1038/s42003-025-07692-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2024] [Accepted: 02/06/2025] [Indexed: 02/16/2025] Open
Abstract
The rapid advancement of single-cell technologies has created an urgent need for effective methods to integrate and harmonize single-cell data. Technical and biological variations across studies complicate data integration, while conventional tools often struggle with reliance on gene expression distribution assumptions and over-correction. Here, we present scCobra, a deep generative neural network designed to overcome these challenges through contrastive learning with domain adaptation. scCobra effectively mitigates batch effects, minimizes over-correction, and ensures biologically meaningful data integration without assuming specific gene expression distributions. It enables online label transfer across datasets with batch effects, allowing continuous integration of new data without retraining. Additionally, scCobra supports batch effect simulation, advanced multi-omic integration, and scalable processing of large datasets. By integrating and harmonizing datasets from similar studies, scCobra expands the available data for investigating specific biological problems, improving cross-study comparability, and revealing insights that may be obscured in isolated datasets.
Collapse
Affiliation(s)
- Bowen Zhao
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
- Meakins-Christie Laboratories, Department of Medicine, McGill University Health Centre, Montreal, QC, Canada
- Division of Experimental Medicine, Department of Medicine, McGill University, Montreal, QC, Canada
| | - Kailu Song
- Meakins-Christie Laboratories, Department of Medicine, McGill University Health Centre, Montreal, QC, Canada
- Quantitative Life Sciences, McGill University, Montreal, QC, Canada
| | - Dong-Qing Wei
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China
| | - Yi Xiong
- State Key Laboratory of Microbial Metabolism, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, China.
| | - Jun Ding
- Meakins-Christie Laboratories, Department of Medicine, McGill University Health Centre, Montreal, QC, Canada.
- Division of Experimental Medicine, Department of Medicine, McGill University, Montreal, QC, Canada.
- Quantitative Life Sciences, McGill University, Montreal, QC, Canada.
- School of Computer Science, McGill University, Montreal, QC, Canada.
- Mila-Quebec AI Institute, Montreal, QC, Canada.
| |
Collapse
|
18
|
Wen Y, He H, Ma Y, Bao D, Cai LC, Wang H, Li Y, Zhao B, Cai Z. Computing hematopoiesis plasticity in response to genetic mutations and environmental stimulations. Life Sci Alliance 2025; 8:e202402971. [PMID: 39537342 PMCID: PMC11561260 DOI: 10.26508/lsa.202402971] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Revised: 11/06/2024] [Accepted: 11/06/2024] [Indexed: 11/16/2024] Open
Abstract
Cell plasticity (CP), describing a dynamic cell state, plays a crucial role in maintaining homeostasis during organ morphogenesis, regeneration, and trauma-to-repair biological process. Single-cell-omics datasets provide an unprecedented resource to empower CP analysis. Hematopoiesis offers fertile opportunities to develop quantitative methods for understanding CP. In this study, we generated high-quality lineage-negative single-cell RNA-sequencing datasets under various conditions and introduced a working pipeline named scPlasticity to interrogate naïve and disturbed plasticity of hematopoietic stem and progenitor cells with mutational or environmental challenges. Using embedding methods UMAP or FA, a continuum of hematopoietic development is visually observed in wild type where the pipeline confirms a low proportion of hybrid cells ( P hc , with bias range: 0.4∼0.6) on a transition trajectory. Upon Tet2 mutation, a driver of leukemia, or treatment of DSS, an inducer of colitis, P hc is increased and plasticity of hematopoietic stem and progenitor cells was enhanced. We prioritized several transcription factors and signaling pathways, which are responsible for P hc alterations. In silico perturbation suggests knocking out EGR regulons or pathways of IL-1R1 and β-adrenoreceptor partially reverses P hc promoted by Tet2 mutation and inflammation.
Collapse
Affiliation(s)
- Yuchen Wen
- National Key Laboratory of Experimental Hematology, Tianjin, China
- Tianjin Key Laboratory of Inflammatory Biology, Department of Pharmacology, School of Basic Medical Science, Tianjin Medical University, Tianjin, China
- The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, School of Basic Medical Science, Tianjin Medical University, Tianjin, China
| | - Hang He
- National Key Laboratory of Experimental Hematology, Tianjin, China
- Tianjin Key Laboratory of Inflammatory Biology, Department of Pharmacology, School of Basic Medical Science, Tianjin Medical University, Tianjin, China
- The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, School of Basic Medical Science, Tianjin Medical University, Tianjin, China
| | - Yunxi Ma
- National Key Laboratory of Experimental Hematology, Tianjin, China
- Tianjin Key Laboratory of Inflammatory Biology, Department of Pharmacology, School of Basic Medical Science, Tianjin Medical University, Tianjin, China
- The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, School of Basic Medical Science, Tianjin Medical University, Tianjin, China
| | - Dengyi Bao
- National Key Laboratory of Experimental Hematology, Tianjin, China
- Tianjin Key Laboratory of Inflammatory Biology, Department of Pharmacology, School of Basic Medical Science, Tianjin Medical University, Tianjin, China
- The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, School of Basic Medical Science, Tianjin Medical University, Tianjin, China
| | - Lorie Chen Cai
- The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, School of Basic Medical Science, Tianjin Medical University, Tianjin, China
| | - Huaquan Wang
- Department of Hematology, Tianjin Medical University Tianjin General Hospital, Tianjin, China
| | - Yanmei Li
- Department of Rheumatology and Immunology, Tianjin Medical University Tianjin General Hospital, Tianjin, China
| | - Baobing Zhao
- Department of Pharmacology, School of Pharmaceutical Sciences, Cheeloo College of Medicine, Shandong University, Jinan, China
| | - Zhigang Cai
- National Key Laboratory of Experimental Hematology, Tianjin, China
- Tianjin Key Laboratory of Inflammatory Biology, Department of Pharmacology, School of Basic Medical Science, Tianjin Medical University, Tianjin, China
- The Province and Ministry Co-sponsored Collaborative Innovation Center for Medical Epigenetics, School of Basic Medical Science, Tianjin Medical University, Tianjin, China
- Department of Hematology, Tianjin Medical University Tianjin General Hospital, Tianjin, China
- Department of Rheumatology and Immunology, Tianjin Medical University Tianjin General Hospital, Tianjin, China
| |
Collapse
|
19
|
Heimberg G, Kuo T, DePianto DJ, Salem O, Heigl T, Diamant N, Scalia G, Biancalani T, Turley SJ, Rock JR, Corrada Bravo H, Kaminker J, Vander Heiden JA, Regev A. A cell atlas foundation model for scalable search of similar human cells. Nature 2025; 638:1085-1094. [PMID: 39566551 PMCID: PMC11864978 DOI: 10.1038/s41586-024-08411-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2023] [Accepted: 11/14/2024] [Indexed: 11/22/2024]
Abstract
Single-cell RNA sequencing has profiled hundreds of millions of human cells across organs, diseases, development and perturbations to date. Mining these growing atlases could reveal cell-disease associations, identify cell states in unexpected tissue contexts and relate in vivo biology to in vitro models. These require a common measure of cell similarity across the body and an efficient way to search. Here we develop SCimilarity, a metric-learning framework to learn a unified and interpretable representation that enables rapid queries of tens of millions of cell profiles from diverse studies for cells that are transcriptionally similar to an input cell profile or state. We use SCimilarity to query a 23.4-million-cell atlas of 412 single-cell RNA-sequencing studies for macrophage and fibroblast profiles from interstitial lung disease1 and reveal similar cell profiles across other fibrotic diseases and tissues. The top scoring in vitro hit for the macrophage query was a 3D hydrogel system2, which we experimentally demonstrated reproduces this cell state. SCimilarity serves as a foundation model for single-cell profiles that enables researchers to query for similar cellular states across the human body, providing a powerful tool for generating biological insights from the Human Cell Atlas.
Collapse
Affiliation(s)
- Graham Heimberg
- Biology Research, AI Development, gRED Computational Sciences, Genentech, San Francisco, CA, USA.
- Department of Immunology Discovery, Genentech, San Francisco, CA, USA.
| | - Tony Kuo
- Roche Informatics, F. Hoffmann-La Roche, Mississauga, Ontario, Canada
| | - Daryle J DePianto
- Department of Immunology Discovery, Genentech, San Francisco, CA, USA
| | - Omar Salem
- Biology Research, AI Development, gRED Computational Sciences, Genentech, San Francisco, CA, USA
| | - Tobias Heigl
- Department of Immunology Discovery, Genentech, San Francisco, CA, USA
| | - Nathaniel Diamant
- Biology Research, AI Development, gRED Computational Sciences, Genentech, San Francisco, CA, USA
| | - Gabriele Scalia
- Biology Research, AI Development, gRED Computational Sciences, Genentech, San Francisco, CA, USA
| | - Tommaso Biancalani
- Biology Research, AI Development, gRED Computational Sciences, Genentech, San Francisco, CA, USA
| | - Shannon J Turley
- Department of Immunology Discovery, Genentech, San Francisco, CA, USA
- Department of Regenerative Medicine, Genentech, San Francisco, CA, USA
| | - Jason R Rock
- Department of Immunology Discovery, Genentech, San Francisco, CA, USA
- Department of Regenerative Medicine, Genentech, San Francisco, CA, USA
| | - Héctor Corrada Bravo
- Biology Research, AI Development, gRED Computational Sciences, Genentech, San Francisco, CA, USA
| | - Josh Kaminker
- OMNI Bioinformatics, gRED Computational Sciences, Genentech, San Francisco, CA, USA.
| | - Jason A Vander Heiden
- Biology Research, AI Development, gRED Computational Sciences, Genentech, San Francisco, CA, USA.
- Department of Immunology Discovery, Genentech, San Francisco, CA, USA.
| | - Aviv Regev
- Research and Early Development, Genentech, San Francisco, CA, USA.
| |
Collapse
|
20
|
Liu J, Yang M, Yu Y, Xu H, Wang T, Li K, Zhou X. Advancing bioinformatics with large language models: components, applications and perspectives. ARXIV 2025:arXiv:2401.04155v2. [PMID: 38259343 PMCID: PMC10802675] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/24/2024]
Abstract
Large language models (LLMs) are a class of artificial intelligence models based on deep learning, which have great performance in various tasks, especially in natural language processing (NLP). Large language models typically consist of artificial neural networks with numerous parameters, trained on large amounts of unlabeled input using self-supervised or semi-supervised learning. However, their potential for solving bioinformatics problems may even exceed their proficiency in modeling human language. In this review, we will provide a comprehensive overview of the essential components of large language models (LLMs) in bioinformatics, spanning genomics, transcriptomics, proteomics, drug discovery, and single-cell analysis. Key aspects covered include tokenization methods for diverse data types, the architecture of transformer models, the core attention mechanism, and the pre-training processes underlying these models. Additionally, we will introduce currently available foundation models and highlight their downstream applications across various bioinformatics domains. Finally, drawing from our experience, we will offer practical guidance for both LLM users and developers, emphasizing strategies to optimize their use and foster further innovation in the field.
Collapse
Affiliation(s)
- Jiajia Liu
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Mengyuan Yang
- Department of Cell Biology and Genetics, School of Basic Medical Sciences, Xi’an Jiaotong University Health Science Center, Xi’an, China
| | - Yankai Yu
- School of Computing and Artificial Intelligence, Southwest Jiaotong University, Chengdu, Sichuan 611756, China
| | - Haixia Xu
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Tiangang Wang
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
| | - Kang Li
- West China Biomedical Big Data Center, West China Hospital, Sichuan University, Chengdu, Sichuan 610041, China
| | - Xiaobo Zhou
- Center for Computational Systems Medicine, McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, 77030, USA
- McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
- School of Dentistry, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| |
Collapse
|
21
|
Wu Y, Fan Y, Miao Y, Li Y, Du G, Chen Z, Diao J, Chen YA, Ye M, You R, Chen A, Chen Y, Li W, Guo W, Dong J, Zhang X, Wang Y, Gu J. uniLIVER: a human liver cell atlas for data-driven cellular state mapping. J Genet Genomics 2025:S1673-8527(25)00032-3. [PMID: 39892777 DOI: 10.1016/j.jgg.2025.01.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2025] [Accepted: 01/22/2025] [Indexed: 02/04/2025]
Abstract
The liver performs several vital functions such as metabolism, toxin removal, and glucose storage through the coordination of various cell types. With the recent breakthrough of the single-cell/single-nucleus RNA-seq (sc/snRNA-seq) techniques, there is a great opportunity to establish a reference cell map of the liver at single-cell resolution with transcriptome-wise features. In this study, we build a unified liver cell atlas uniLIVER (http://lifeome.net/database/uniliver) by integrative analysis of a large-scale sc/snRNA-seq data collection of normal human liver with 331,125 cells and 79 samples from 6 datasets. Moreover, we introduce LiverCT, a novel machine learning based method for mapping any query dataset to the liver reference map by introducing the definition of "variant" cellular states analogy to the sequence variants in genomic analysis. Applying LiverCT on liver cancer datasets, we find that the "deviated" states of T cells are highly correlated with the stress pathway activities in hepatocellular carcinoma, and the enrichments of tumor cells with the hepatocyte-cholangiocyte "intermediate" states significantly indicate poor prognosis. Besides, we find that the tumor cells of different patients have different zonation tendencies and this zonation tendency is also significantly associated with the prognosis. This reference atlas mapping framework can also be extended to any other tissues.
Collapse
Affiliation(s)
- Yanhong Wu
- MOE Key Lab of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Yuhan Fan
- MOE Key Lab of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Yuxin Miao
- MOE Key Lab of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Yuman Li
- MOE Key Lab of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Guifang Du
- Hepato-Pancreato-Biliary Center, Beijing Tsinghua Changgung Hospital, Tsinghua University, Beijing 102218, China; Clinical Translational Science Center, Beijing Tsinghua Changgung Hospital, Tsinghua University, Beijing 102218, China
| | - Zeyu Chen
- MOE Key Lab of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Jinmei Diao
- Hepato-Pancreato-Biliary Center, Beijing Tsinghua Changgung Hospital, Tsinghua University, Beijing 102218, China; Clinical Translational Science Center, Beijing Tsinghua Changgung Hospital, Tsinghua University, Beijing 102218, China
| | - Yu-Ann Chen
- Hepato-Pancreato-Biliary Center, Beijing Tsinghua Changgung Hospital, Tsinghua University, Beijing 102218, China; Clinical Translational Science Center, Beijing Tsinghua Changgung Hospital, Tsinghua University, Beijing 102218, China
| | - Mingli Ye
- Fuzhou Institute of Data Technology, Fuzhou, Fujian 350207, China
| | - Renke You
- Fuzhou Institute of Data Technology, Fuzhou, Fujian 350207, China
| | - Amin Chen
- Fuzhou Institute of Data Technology, Fuzhou, Fujian 350207, China
| | - Yixin Chen
- MOE Key Lab of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wenrui Li
- MOE Key Lab of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Wenbo Guo
- MOE Key Lab of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Jiahong Dong
- Hepato-Pancreato-Biliary Center, Beijing Tsinghua Changgung Hospital, Tsinghua University, Beijing 102218, China; Clinical Translational Science Center, Beijing Tsinghua Changgung Hospital, Tsinghua University, Beijing 102218, China
| | - Xuegong Zhang
- MOE Key Lab of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China; Center for Synthetic and Systems Biology, School of Life Sciences and School of Medicine, Tsinghua University, Beijing 100084, China
| | - Yunfang Wang
- Hepato-Pancreato-Biliary Center, Beijing Tsinghua Changgung Hospital, Tsinghua University, Beijing 102218, China; Clinical Translational Science Center, Beijing Tsinghua Changgung Hospital, Tsinghua University, Beijing 102218, China.
| | - Jin Gu
- MOE Key Lab of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China.
| |
Collapse
|
22
|
Liu X, Chapple RH, Bennett D, Wright WC, Sanjali A, Culp E, Zhang Y, Pan M, Geeleher P. CSI-GEP: A GPU-based unsupervised machine learning approach for recovering gene expression programs in atlas-scale single-cell RNA-seq data. CELL GENOMICS 2025; 5:100739. [PMID: 39788105 PMCID: PMC11770216 DOI: 10.1016/j.xgen.2024.100739] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/13/2024] [Revised: 11/06/2024] [Accepted: 12/13/2024] [Indexed: 01/12/2025]
Abstract
Exploratory analysis of single-cell RNA sequencing (scRNA-seq) typically relies on hard clustering over two-dimensional projections like uniform manifold approximation and projection (UMAP). However, such methods can severely distort the data and have many arbitrary parameter choices. Methods that can model scRNA-seq data as non-discrete "gene expression programs" (GEPs) can better preserve the data's structure, but currently, they are often not scalable, not consistent across repeated runs, and lack an established method for choosing key parameters. Here, we developed a GPU-based unsupervised learning approach, "consensus and scalable inference of gene expression programs" (CSI-GEP). We show that CSI-GEP can recover ground truth GEPs in real and simulated atlas-scale scRNA-seq datasets, significantly outperforming cutting-edge methods, including GPT-based neural networks. We applied CSI-GEP to a whole mouse brain atlas of 2.2 million cells, disentangling endothelial cell types missed by other methods, and to an integrated scRNA-seq atlas of human tumors and cell lines, discovering mesenchymal-like GEPs unique to cancer cells growing in culture.
Collapse
Affiliation(s)
- Xueying Liu
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Richard H Chapple
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Declan Bennett
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - William C Wright
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Ankita Sanjali
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Erielle Culp
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA; Department of Genetics, Genomics, and Informatics, University of Tennessee Health Science Center, Memphis, TN 38163, USA
| | - Yinwen Zhang
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Min Pan
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA
| | - Paul Geeleher
- Department of Computational Biology, St. Jude Children's Research Hospital, Memphis, TN 38105, USA.
| |
Collapse
|
23
|
Hozumi Y, Wei GW. Analyzing scRNA-seq data by CCP-assisted UMAP and tSNE. PLoS One 2024; 19:e0311791. [PMID: 39671349 PMCID: PMC11642954 DOI: 10.1371/journal.pone.0311791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/20/2024] [Accepted: 09/24/2024] [Indexed: 12/15/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is widely used to reveal heterogeneity in cells, which has given us insights into cell-cell communication, cell differentiation, and differential gene expression. However, analyzing scRNA-seq data is a challenge due to sparsity and the large number of genes involved. Therefore, dimensionality reduction and feature selection are important for removing spurious signals and enhancing downstream analysis. Correlated clustering and projection (CCP) was recently introduced as an effective method for preprocessing scRNA-seq data. CCP utilizes gene-gene correlations to partition the genes and, based on the partition, employs cell-cell interactions to obtain super-genes. Because CCP is a data-domain approach that does not require matrix diagonalization, it can be used in many downstream machine learning tasks. In this work, we utilize CCP as an initialization tool for uniform manifold approximation and projection (UMAP) and t-distributed stochastic neighbor embedding (tSNE). By using 21 publicly available datasets, we have found that CCP significantly improves UMAP and tSNE visualization and dramatically improve their accuracy. More specifically, CCP improves UMAP by 22% in ARI, 14% in NMI and 15% in ECM, and improves tSNE by 11% in ARI, 9% in NMI and 8% in ECM.
Collapse
Affiliation(s)
- Yuta Hozumi
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
| | - Guo-Wei Wei
- Department of Mathematics, Michigan State University, East Lansing, Michigan, United States of America
- Department of Electrical and Computer Engineering, Michigan State University, East Lansing, Michigan, United States of America
- Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
24
|
Liu T, Li K, Wang Y, Li H, Zhao H. Evaluating the Utilities of Foundation Models in Single-cell Data Analysis. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.09.08.555192. [PMID: 38464157 PMCID: PMC10925156 DOI: 10.1101/2023.09.08.555192] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Foundation Models (FMs) have made significant strides in both industrial and scientific domains. In this paper, we evaluate the performance of FMs for single-cell sequencing data analysis through comprehensive experiments across eight downstream tasks pertinent to single-cell data. Overall, the top FMs include scGPT, Geneformer, and CellPLM by considering model performances and user accessibility among ten single-cell FMs. However, by comparing these FMs with task-specific methods, we found that single-cell FMs may not consistently excel than task-specific methods in all tasks, which challenges the necessity of developing foundation models for single-cell analysis. In addition, we evaluated the effects of hyper-parameters, initial settings, and stability for training single-cell FMs based on a proposed scEval framework, and provide guidelines for pre-training and fine-tuning, to enhance the performances of single-cell FMs. Our work summarizes the current state of single-cell FMs, points to their constraints and avenues for future development, and offers a freely available evaluation pipeline to benchmark new models and improve method development.
Collapse
|
25
|
Luo Y, Zhao C, Chen F. Multiomics Research: Principles and Challenges in Integrated Analysis. BIODESIGN RESEARCH 2024; 6:0059. [PMID: 39990095 PMCID: PMC11844812 DOI: 10.34133/bdr.0059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2024] [Revised: 10/24/2024] [Accepted: 10/28/2024] [Indexed: 02/25/2025] Open
Abstract
Multiomics research is a transformative approach in the biological sciences that integrates data from genomics, transcriptomics, proteomics, metabolomics, and other omics technologies to provide a comprehensive understanding of biological systems. This review elucidates the fundamental principles of multiomics, emphasizing the necessity of data integration to uncover the complex interactions and regulatory mechanisms underlying various biological processes. We explore the latest advances in computational methodologies, including deep learning, graph neural networks (GNNs), and generative adversarial networks (GANs), which facilitate the effective synthesis and interpretation of multiomics data. Additionally, this review addresses the critical challenges in this field, such as data heterogeneity, scalability, and the need for robust, interpretable models. We highlight the potential of large language models to enhance multiomics analysis through automated feature extraction, natural language generation, and knowledge integration. Despite the important promise of multiomics, the review acknowledges the substantial computational resources required and the complexity of model tuning, underscoring the need for ongoing innovation and collaboration in the field. This comprehensive analysis aims to guide researchers in navigating the principles and challenges of multiomics research to foster advances in integrative biological analysis.
Collapse
Affiliation(s)
- Yunqing Luo
- National Key Laboratory for Tropical Crop Breeding, College of Breeding and Multiplication, Sanya Institute of Breeding and Multiplication, Hainan University, Sanya 572025, China
- College of Tropical Agriculture and Forestry, Hainan University, Danzhou 571700, China
| | - Chengjun Zhao
- National Key Laboratory for Tropical Crop Breeding, College of Breeding and Multiplication, Sanya Institute of Breeding and Multiplication, Hainan University, Sanya 572025, China
- College of Tropical Agriculture and Forestry, Hainan University, Danzhou 571700, China
| | - Fei Chen
- National Key Laboratory for Tropical Crop Breeding, College of Breeding and Multiplication, Sanya Institute of Breeding and Multiplication, Hainan University, Sanya 572025, China
- College of Tropical Agriculture and Forestry, Hainan University, Danzhou 571700, China
| |
Collapse
|
26
|
Wang R, Liu Q, You W, Chen Y. A multi-task deep learning model based on comprehensive feature integration and self-attention mechanism for predicting response to anti-PD1/PD-L1. Int Immunopharmacol 2024; 142:113099. [PMID: 39265355 DOI: 10.1016/j.intimp.2024.113099] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 07/26/2024] [Accepted: 09/03/2024] [Indexed: 09/14/2024]
Abstract
BACKGROUND Immune checkpoint inhibitor (ICI) has been widely used in the treatment of advanced cancers, but predicting their efficacy remains challenging. Traditional biomarkers are numerous but exhibit heterogeneity within populations. For comprehensively utilizing the ICI-related biomarkers, we aim to conduct multidimensional feature selection and deep learning model construction. METHODS We used statistical and machine learning methods to map features of different levels to next-generation sequencing gene expression. We integrated genes from different sources into the feature input of a deep learning model, by means of self-attention mechanism. RESULTS We performed feature selection at the single-cell sequencing level, PD-L1 (CD274) analysis level, tumor mutational burden (TMB)/mismatch repair (MMR) level, and somatic copy number alteration (SCNA) level, obtaining 96 feature genes. Based on the pan-cancer dataset, we trained a multi-task deep learning model. We tested the model in the bladder urothelial carcinoma testing set 1 (AUC = 0.62, n = 298), bladder urothelial carcinoma testing set 2 (AUC = 0.66, n = 89), non-small cell lung cancer testing set (AUC = 0.85, n = 27), and skin cutaneous melanoma testing set (AUC = 0.71, n = 27). CONCLUSION Our study demonstrates the potential of the deep learning model for integrating multidimensional features in predicting the outcome of ICI. Our study also provides a potential methodological case for medical scenarios requiring the integration of multiple levels of features.
Collapse
Affiliation(s)
- Ren Wang
- The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi People's Hospital, Wuxi Medical Center, Department of Immunology, School of Basic Medical Sciences, Nanjing Medical University, Nanjing, China; The Affiliated Huai'an No. 1 People's Hospital, Nanjing Medical University, Huai'an, China; Jiangsu Key Lab of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Qiumei Liu
- The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi People's Hospital, Wuxi Medical Center, Department of Immunology, School of Basic Medical Sciences, Nanjing Medical University, Nanjing, China; The Affiliated Huai'an No. 1 People's Hospital, Nanjing Medical University, Huai'an, China; Jiangsu Key Lab of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Wenhua You
- The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi People's Hospital, Wuxi Medical Center, Department of Immunology, School of Basic Medical Sciences, Nanjing Medical University, Nanjing, China; The Affiliated Huai'an No. 1 People's Hospital, Nanjing Medical University, Huai'an, China; Jiangsu Key Lab of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China
| | - Yun Chen
- The Affiliated Wuxi People's Hospital of Nanjing Medical University, Wuxi People's Hospital, Wuxi Medical Center, Department of Immunology, School of Basic Medical Sciences, Nanjing Medical University, Nanjing, China; The Affiliated Huai'an No. 1 People's Hospital, Nanjing Medical University, Huai'an, China; Jiangsu Key Lab of Cancer Biomarkers, Prevention and Treatment, Collaborative Innovation Center for Cancer Personalized Medicine, Nanjing Medical University, Nanjing, China.
| |
Collapse
|
27
|
Yang X, Liu G, Feng G, Bu D, Wang P, Jiang J, Chen S, Yang Q, Miao H, Zhang Y, Man Z, Liang Z, Wang Z, Li Y, Li Z, Liu Y, Tian Y, Liu W, Li C, Li A, Dong J, Hu Z, Fang C, Cui L, Deng Z, Jiang H, Cui W, Zhang J, Yang Z, Li H, He X, Zhong L, Zhou J, Wang Z, Long Q, Xu P, Wang H, Meng Z, Wang X, Wang Y, Wang Y, Zhang S, Guo J, Zhao Y, Zhou Y, Li F, Liu J, Chen Y, Yang G, Li X. GeneCompass: deciphering universal gene regulatory mechanisms with a knowledge-informed cross-species foundation model. Cell Res 2024; 34:830-845. [PMID: 39375485 PMCID: PMC11615217 DOI: 10.1038/s41422-024-01034-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2024] [Accepted: 09/13/2024] [Indexed: 10/09/2024] Open
Abstract
Deciphering universal gene regulatory mechanisms in diverse organisms holds great potential for advancing our knowledge of fundamental life processes and facilitating clinical applications. However, the traditional research paradigm primarily focuses on individual model organisms and does not integrate various cell types across species. Recent breakthroughs in single-cell sequencing and deep learning techniques present an unprecedented opportunity to address this challenge. In this study, we built an extensive dataset of over 120 million human and mouse single-cell transcriptomes. After data preprocessing, we obtained 101,768,420 single-cell transcriptomes and developed a knowledge-informed cross-species foundation model, named GeneCompass. During pre-training, GeneCompass effectively integrated four types of prior biological knowledge to enhance our understanding of gene regulatory mechanisms in a self-supervised manner. By fine-tuning for multiple downstream tasks, GeneCompass outperformed state-of-the-art models in diverse applications for a single species and unlocked new realms of cross-species biological investigations. We also employed GeneCompass to search for key factors associated with cell fate transition and showed that the predicted candidate genes could successfully induce the differentiation of human embryonic stem cells into the gonadal fate. Overall, GeneCompass demonstrates the advantages of using artificial intelligence technology to decipher universal gene regulatory mechanisms and shows tremendous potential for accelerating the discovery of critical cell fate regulators and candidate drug targets.
Collapse
Affiliation(s)
- Xiaodong Yang
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Guole Liu
- State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
| | - Guihai Feng
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- Institute for Stem Cell and Regenerative Medicine, Chinese Academy of Sciences, Beijing, China
- Beijing Institute for Stem Cell and Regenerative Medicine, Beijing, China
| | - Dechao Bu
- Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
| | - Pengfei Wang
- University of Chinese Academy of Sciences, Beijing, China
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Jie Jiang
- Institute of Automation, Chinese Academy of Sciences, Beijing, China
| | - Shubai Chen
- Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Qinmeng Yang
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Hefan Miao
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Yiyang Zhang
- University of Chinese Academy of Sciences, Beijing, China
- CEMS, NCMIS, HCMS, MDIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Zhenpeng Man
- University of Chinese Academy of Sciences, Beijing, China
- CEMS, NCMIS, HCMS, MDIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Zhongming Liang
- University of Chinese Academy of Sciences, Beijing, China
- CEMS, NCMIS, HCMS, MDIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Zichen Wang
- State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
| | - Yaning Li
- Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zheng Li
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Yana Liu
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Yao Tian
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Wenhao Liu
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Cong Li
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Ao Li
- State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
| | - Jingxi Dong
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
| | - Zhilong Hu
- University of Chinese Academy of Sciences, Beijing, China
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Chen Fang
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Lina Cui
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Zixu Deng
- Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Haiping Jiang
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
| | - Wentao Cui
- University of Chinese Academy of Sciences, Beijing, China
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Jiahao Zhang
- University of Chinese Academy of Sciences, Beijing, China
- CEMS, NCMIS, HCMS, MDIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Zhaohui Yang
- Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China
| | - Handong Li
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
- Institute of Automation, Chinese Academy of Sciences, Beijing, China
| | - Xingjian He
- Institute of Automation, Chinese Academy of Sciences, Beijing, China
| | - Liqun Zhong
- State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
| | - Jiaheng Zhou
- State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China
| | - Zijian Wang
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Qingqing Long
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Ping Xu
- University of Chinese Academy of Sciences, Beijing, China
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Hongmei Wang
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- Institute for Stem Cell and Regenerative Medicine, Chinese Academy of Sciences, Beijing, China
- Beijing Institute for Stem Cell and Regenerative Medicine, Beijing, China
| | - Zhen Meng
- University of Chinese Academy of Sciences, Beijing, China
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Xuezhi Wang
- University of Chinese Academy of Sciences, Beijing, China
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Yangang Wang
- University of Chinese Academy of Sciences, Beijing, China
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China
| | - Yong Wang
- University of Chinese Academy of Sciences, Beijing, China
- CEMS, NCMIS, HCMS, MDIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Shihua Zhang
- University of Chinese Academy of Sciences, Beijing, China
- CEMS, NCMIS, HCMS, MDIS, RCSDS, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing, China
| | - Jingtao Guo
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China
- University of Chinese Academy of Sciences, Beijing, China
- Institute for Stem Cell and Regenerative Medicine, Chinese Academy of Sciences, Beijing, China
- Beijing Institute for Stem Cell and Regenerative Medicine, Beijing, China
| | - Yi Zhao
- Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.
- University of Chinese Academy of Sciences, Beijing, China.
- Research Center for Ubiquitous Computing Systems, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.
| | - Yuanchun Zhou
- University of Chinese Academy of Sciences, Beijing, China.
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China.
| | - Fei Li
- University of Chinese Academy of Sciences, Beijing, China.
- Computer Network Information Center, Chinese Academy of Sciences, Beijing, China.
| | - Jing Liu
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China.
- Institute of Automation, Chinese Academy of Sciences, Beijing, China.
| | - Yiqiang Chen
- Beijing Key Laboratory of Mobile Computing and Pervasive Device, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China.
- University of Chinese Academy of Sciences, Beijing, China.
| | - Ge Yang
- State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China.
- School of Artificial Intelligence, University of Chinese Academy of Sciences, Beijing, China.
| | - Xin Li
- State Key Laboratory of Stem Cell and Reproductive Biology, Institute of Zoology, Chinese Academy of Sciences, Beijing, China.
- University of Chinese Academy of Sciences, Beijing, China.
- Institute for Stem Cell and Regenerative Medicine, Chinese Academy of Sciences, Beijing, China.
- Beijing Institute for Stem Cell and Regenerative Medicine, Beijing, China.
| |
Collapse
|
28
|
Chang CJ, Hsu CY, Liu Q, Shyr Y. VICTOR: Validation and inspection of cell type annotation through optimal regression. Comput Struct Biotechnol J 2024; 23:3270-3280. [PMID: 39296808 PMCID: PMC11408377 DOI: 10.1016/j.csbj.2024.08.028] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2024] [Revised: 08/30/2024] [Accepted: 08/31/2024] [Indexed: 09/21/2024] Open
Abstract
Single-cell RNA sequencing provides unprecedent opportunities to explore the heterogeneity and dynamics inherent in cellular biology. An essential step in the data analysis involves the automatic annotation of cells. Despite development of numerous tools for automated cell annotation, assessing the reliability of predicted annotations remains challenging, particularly for rare and unknown cell types. Here, we introduce VICTOR: Validation and inspection of cell type annotation through optimal regression. VICTOR aims to gauge the confidence of cell annotations by an elastic-net regularized regression with optimal thresholds. We demonstrated that VICTOR performed well in identifying inaccurate annotations, surpassing existing methods in diagnostic ability across various single-cell datasets, including within-platform, cross-platform, cross-studies, and cross-omics settings.
Collapse
Affiliation(s)
- Chia-Jung Chang
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Department of Biomedical Engineering, National Cheng Kung University, Tainan, Taiwan
| | - Chih-Yuan Hsu
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Qi Liu
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Yu Shyr
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
29
|
Lan W, Ling T, Chen Q, Zheng R, Li M, Pan Y. scMoMtF: An interpretable multitask learning framework for single-cell multi-omics data analysis. PLoS Comput Biol 2024; 20:e1012679. [PMID: 39693287 DOI: 10.1371/journal.pcbi.1012679] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 11/26/2024] [Indexed: 12/20/2024] Open
Abstract
With the rapidly development of biotechnology, it is now possible to obtain single-cell multi-omics data in the same cell. However, how to integrate and analyze these single-cell multi-omics data remains a great challenge. Herein, we introduce an interpretable multitask framework (scMoMtF) for comprehensively analyzing single-cell multi-omics data. The scMoMtF can simultaneously solve multiple key tasks of single-cell multi-omics data including dimension reduction, cell classification and data simulation. The experimental results shows that scMoMtF outperforms current state-of-the-art algorithms on these tasks. In addition, scMoMtF has interpretability which allowing researchers to gain a reliable understanding of potential biological features and mechanisms in single-cell multi-omics data.
Collapse
Affiliation(s)
- Wei Lan
- Guangxi Key Laboratory of Multimedia Communications and Network Technology, School of computer, electronic and information, Guangxi university, Nanning, Guangxi, China
| | - Tongsheng Ling
- Guangxi Key Laboratory of Multimedia Communications and Network Technology, School of computer, electronic and information, Guangxi university, Nanning, Guangxi, China
| | - Qingfeng Chen
- Guangxi Key Laboratory of Multimedia Communications and Network Technology, School of computer, electronic and information, Guangxi university, Nanning, Guangxi, China
| | - Ruiqing Zheng
- School of computer and engineering, Central South University, Changsha, Hunan, China
| | - Min Li
- School of computer and engineering, Central South University, Changsha, Hunan, China
| | - Yi Pan
- School of Computer Science and Control Engineering, Shenzhen University of Advanced Technology, Shenzhen, Guangdong, China
| |
Collapse
|
30
|
Chau TN, Wang X, McDowell JM, Li S. Advancing plant single-cell genomics with foundation models. CURRENT OPINION IN PLANT BIOLOGY 2024; 82:102666. [PMID: 39579415 DOI: 10.1016/j.pbi.2024.102666] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 10/07/2024] [Accepted: 10/28/2024] [Indexed: 11/25/2024]
Abstract
Single-cell genomics, combined with advanced AI models, hold transformative potential for understanding complex biological processes in plants. This article reviews deep-learning approaches in single-cell genomics, focusing on foundation models, a type of large-scale, pretrained, multi-purpose generative AI models. We explore how these models, such as Generative Pre-trained Transformers (GPT), Bidirectional Encoder Representations from Transformers (BERT), and other Transformer-based architectures, are applied to extract meaningful biological insights from diverse single-cell datasets. These models address challenges in plant single-cell genomics, including improved cell-type annotation, gene network modeling, and multi-omics integration. Moreover, we assess the use of Generative Adversarial Networks (GANs) and diffusion models, focusing on their capacity to generate high-fidelity synthetic single-cell data, mitigate dropout events, and handle data sparsity and imbalance. Together, these AI-driven approaches hold immense potential to enhance research in plant genomics, facilitating discoveries in crop resilience, productivity, and stress adaptation.
Collapse
Affiliation(s)
- Tran N Chau
- Genetics, Bioinformatics, and Computational Biology, Virginia Tech, USA; School of Plant and Environmental Sciences, Virginia Tech, USA
| | - Xuan Wang
- Department of Computer Science, Virginia Tech, USA
| | - John M McDowell
- School of Plant and Environmental Sciences, Virginia Tech, USA
| | - Song Li
- Genetics, Bioinformatics, and Computational Biology, Virginia Tech, USA; School of Plant and Environmental Sciences, Virginia Tech, USA; Department of Computer Science, Virginia Tech, USA.
| |
Collapse
|
31
|
Wu Y, Xu P, Wang L, Liu S, Hou Y, Lu H, Hu P, Li X, Yu X. scGO: interpretable deep neural network for cell status annotation and disease diagnosis. Brief Bioinform 2024; 26:bbaf018. [PMID: 39820437 PMCID: PMC11737892 DOI: 10.1093/bib/bbaf018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2024] [Revised: 12/16/2024] [Accepted: 01/10/2025] [Indexed: 01/19/2025] Open
Abstract
Machine learning has emerged as a transformative tool for elucidating cellular heterogeneity in single-cell RNA sequencing. However, a significant challenge lies in the "black box" nature of deep learning models, which obscures the decision-making process and limits interpretability in cell status annotation. In this study, we introduced scGO, a Gene Ontology (GO)-inspired deep learning framework designed to provide interpretable cell status annotation for scRNA-seq data. scGO employs sparse neural networks to leverage the intrinsic biological relationships among genes, transcription factors, and GO terms, significantly augmenting interpretability and reducing computational cost. scGO outperforms state-of-the-art methods in the precise characterization of cell subtypes across diverse datasets. Our extensive experimentation across a spectrum of scRNA-seq datasets underscored the remarkable efficacy of scGO in disease diagnosis, prediction of developmental stages, and evaluation of disease severity and cellular senescence status. Furthermore, we incorporated in silico individual gene manipulations into the scGO model, introducing an additional layer for discovering therapeutic targets. Our results provide an interpretable model for accurately annotating cell status, capturing latent biological knowledge, and informing clinical practice.
Collapse
Affiliation(s)
- You Wu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, No. 800 Dong Chuan Road, Shanghai 200240, China
| | - Pengfei Xu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, No. 800 Dong Chuan Road, Shanghai 200240, China
| | - Liyuan Wang
- School of Agriculture and Biology, Shanghai Jiao Tong University, No. 800 Dong Chuan Road, Shanghai 200240, China
| | - Shuai Liu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, No. 800 Dong Chuan Road, Shanghai 200240, China
| | - Yingnan Hou
- School of Agriculture and Biology, Shanghai Jiao Tong University, No. 800 Dong Chuan Road, Shanghai 200240, China
| | - Hui Lu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, No. 800 Dong Chuan Road, Shanghai 200240, China
| | - Peng Hu
- Ministry of Education, Shanghai Ocean University, No. 999, Huchenghuan Road, Shanghai 201306, China
| | - Xiaofei Li
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, No. 800 Dong Chuan Road, Shanghai 200240, China
- Shanghai Pudong New Area People’s Hospital, No. 490, Chuanhuan South Road, Shanghai 201299, China
| | - Xiang Yu
- School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, No. 800 Dong Chuan Road, Shanghai 200240, China
| |
Collapse
|
32
|
Lu Q, Ding J, Li L, Chang Y. Graph contrastive learning of subcellular-resolution spatial transcriptomics improves cell type annotation and reveals critical molecular pathways. Brief Bioinform 2024; 26:bbaf020. [PMID: 39883515 PMCID: PMC11781232 DOI: 10.1093/bib/bbaf020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Revised: 12/12/2024] [Accepted: 01/10/2025] [Indexed: 01/31/2025] Open
Abstract
Imaging-based spatial transcriptomics (iST), such as MERFISH, CosMx SMI, and Xenium, quantify gene expression level across cells in space, but more importantly, they directly reveal the subcellular distribution of RNA transcripts at the single-molecule resolution. The subcellular localization of RNA molecules plays a crucial role in the compartmentalization-dependent regulation of genes within individual cells. Understanding the intracellular spatial distribution of RNA for a particular cell type thus not only improves the characterization of cell identity but also is of paramount importance in elucidating unique subcellular regulatory mechanisms specific to the cell type. However, current cell type annotation approaches of iST primarily utilize gene expression information while neglecting the spatial distribution of RNAs within cells. In this work, we introduce a semi-supervised graph contrastive learning method called Focus, the first method, to the best of our knowledge, that explicitly models RNA's subcellular distribution and community to improve cell type annotation. Focus demonstrates significant improvements over state-of-the-art algorithms across a range of spatial transcriptomics platforms, achieving improvements up to 27.8% in terms of accuracy and 51.9% in terms of F1-score for cell type annotation. Furthermore, Focus enjoys the advantages of intricate cell type-specific subcellular spatial gene patterns and providing interpretable subcellular gene analysis, such as defining the gene importance score. Importantly, with the importance score, Focus identifies genes harboring strong relevance to cell type-specific pathways, indicating its potential in uncovering novel regulatory programs across numerous biological systems.
Collapse
Affiliation(s)
- Qiaolin Lu
- School of Artificial Intelligence, Jilin University, Qianjin Street 2699, 130010 Changchun, China
| | - Jiayuan Ding
- Department of Computer Science and Engineering, Michigan State University, 220 Trowbridge Rd, East Lansing, MI 48824, United States
| | - Lingxiao Li
- Department, Boston University, Commonwealth Ave, Boston, MA 02215, United States
| | - Yi Chang
- School of Artificial Intelligence, Jilin University, Qianjin Street 2699, 130010 Changchun, China
- International Center of Future Science, Jilin University, Qianjin Street 2699, 130010 Changchun, China
- Engineering Research Center of Knowledge-Driven Human-Machine Intelligence, Jilin University, Qianjin Street 2699, 130010 Changchun, China
| |
Collapse
|
33
|
Yuan L, Sun S, Jiang Y, Zhang Q, Ye L, Zheng CH, Huang DS. scRGCL: a cell type annotation method for single-cell RNA-seq data using residual graph convolutional neural network with contrastive learning. Brief Bioinform 2024; 26:bbae662. [PMID: 39708840 DOI: 10.1093/bib/bbae662] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2024] [Revised: 11/13/2024] [Accepted: 12/04/2024] [Indexed: 12/23/2024] Open
Abstract
Cell type annotation is a critical step in analyzing single-cell RNA sequencing (scRNA-seq) data. A large number of deep learning (DL)-based methods have been proposed to annotate cell types of scRNA-seq data and have achieved impressive results. However, there are several limitations to these methods. First, they do not fully exploit cell-to-cell differential features. Second, they are developed based on shallow features and lack of flexibility in integrating high-order features in the data. Finally, the low-dimensional gene features may lead to overfitting in neural networks. To overcome those limitations, we propose a novel DL-based model, cell type annotation of single-cell RNA-seq data using residual graph convolutional neural network with contrastive learning (scRGCL), based on residual graph convolutional neural network and contrastive learning for cell type annotation of single-cell RNA-seq data. scRGCL mainly consists of a residual graph convolutional neural network, contrastive learning, and weight freezing. A residual graph convolutional neural network is utilized to extract complex high-order features from data. Contrastive learning can help the model learn meaningful cell-to-cell differential features. Weight freezing can avoid overfitting and help the model discover the impact of specific gene expression on cell type annotation. To verify the effectiveness of scRGCL, we compared its performance with six methods (three shallow learning algorithms and three state-of-the-art DL-based methods) on eight single-cell benchmark datasets from two species (seven in human and one in mouse). Experimental results not only show that scRGCL outperforms competing methods but also demonstrate the generalizability of scRGCL for cell type annotation. scRGCL is available at https://github.com/nathanyl/scRGCL.
Collapse
Affiliation(s)
- Lin Yuan
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, 250353, Shandong, China
- Shandong Engineering Research Center of Big Data Applied Technology, Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, 250353, Shandong, China
- Shandong Provincial Key Laboratory of Industrial Network and Information System Security, Shandong Fundamental Research Center for Computer Science, 3501 Daxue Road, 250353, Shandong, China
| | - Shengguo Sun
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, 250353, Shandong, China
- Shandong Engineering Research Center of Big Data Applied Technology, Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, 250353, Shandong, China
- Shandong Provincial Key Laboratory of Industrial Network and Information System Security, Shandong Fundamental Research Center for Computer Science, 3501 Daxue Road, 250353, Shandong, China
| | - Yufeng Jiang
- Key Laboratory of Computing Power Network and Information Security, Ministry of Education, Shandong Computer Science Center, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, 250353, Shandong, China
- Shandong Engineering Research Center of Big Data Applied Technology, Faculty of Computer Science and Technology, Qilu University of Technology (Shandong Academy of Sciences), 3501 Daxue Road, 250353, Shandong, China
- Shandong Provincial Key Laboratory of Industrial Network and Information System Security, Shandong Fundamental Research Center for Computer Science, 3501 Daxue Road, 250353, Shandong, China
| | - Qinhu Zhang
- Ningbo Institute of Digital Twin, Eastern Institute of Technology, 568 Tongxin Road, 315201, Zhejiang, China
| | - Lan Ye
- Cancer Center, The Second Hospital of Shandong University, 247 Beiyuan Street, 250033, Shandong, China
| | - Chun-Hou Zheng
- Key Lab of Intelligent Computing and Signal Processing of Ministry of Education, School of Artificial Intelligence, Anhui University, 111 Jiulong Road, 230601, Anhui, China
| | - De-Shuang Huang
- Ningbo Institute of Digital Twin, Eastern Institute of Technology, 568 Tongxin Road, 315201, Zhejiang, China
| |
Collapse
|
34
|
Kim J, Ionita M, Lee M, McKeague ML, Pattekar A, Painter MM, Wagenaar J, Truong V, Norton DT, Mathew D, Nam Y, Apostolidis SA, Clendenin C, Orzechowski P, Jung SH, Woerner J, Ittner CAG, Turner AP, Esperanza M, Dunn TG, Mangalmurti NS, Reilly JP, Meyer NJ, Calfee CS, Liu KD, Matthy MA, Swigart LB, Burnham EL, McKeehan J, Gandotra S, Russel DW, Gibbs KW, Thomas KW, Barot H, Greenplate AR, Wherry EJ, Kim D. Cytometry masked autoencoder: An accurate and interpretable automated immunophenotyper. Cell Rep Med 2024; 5:101808. [PMID: 39515318 PMCID: PMC11604491 DOI: 10.1016/j.xcrm.2024.101808] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2024] [Revised: 08/09/2024] [Accepted: 10/08/2024] [Indexed: 11/16/2024]
Abstract
Single-cell cytometry data are crucial for understanding the role of the immune system in diseases and responses to treatment. However, traditional methods for annotating cytometry data face challenges in scalability, robustness, and accuracy. We propose a cytometry masked autoencoder (cyMAE), which automates immunophenotyping tasks including cell type annotation. The model upholds user-defined cell type definitions, facilitating interpretability and cross-study comparisons. The training of cyMAE has a self-supervised phase, which leverages large amounts of unlabeled data, followed by fine-tuning on specialized tasks using smaller amounts of annotated data. The cost of training a new model is amortized over repeated inferences on new datasets using the same panel. Through validation across multiple studies using the same panel, we demonstrate that cyMAE delivers accurate and interpretable cellular immunophenotyping and improves the prediction of subject-level metadata. This proof of concept marks a significant step forward for large-scale immunology studies.
Collapse
Affiliation(s)
- Jaesik Kim
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA; Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Matei Ionita
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Matthew Lee
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Michelle L McKeague
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Ajinkya Pattekar
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Mark M Painter
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Joost Wagenaar
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Van Truong
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Dylan T Norton
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Divij Mathew
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Yonghyun Nam
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Sokratis A Apostolidis
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Division of Rheumatology, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Cynthia Clendenin
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Patryk Orzechowski
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Automatics and Robotics, AGH University of Science and Technology, al. Mickiewicza 30, 30-059 Kraków, Poland
| | - Sang-Hyuk Jung
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Jakob Woerner
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Caroline A G Ittner
- Division of Pulmonary and Critical Care Medicine, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Alexandra P Turner
- Division of Pulmonary and Critical Care Medicine, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Mika Esperanza
- Division of Pulmonary and Critical Care Medicine, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Thomas G Dunn
- Division of Hematology/Oncology, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Nilam S Mangalmurti
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Division of Pulmonary and Critical Care Medicine, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - John P Reilly
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Division of Pulmonary and Critical Care Medicine, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Nuala J Meyer
- Division of Pulmonary and Critical Care Medicine, Department of Medicine, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Carolyn S Calfee
- Department of Anesthesia and Perioperative Care, University of California, San Francisco, School of Medicine, San Francisco, CA 94143, USA; Division of Pulmonary, Critical Care, Allergy, and Sleep Medicine, University of California, San Francisco, School of Medicine, San Francisco, CA 94143, USA; Cardiovascular Research Institute, Department of Medicine, University of California, San Francisco, School of Medicine, San Francisco, CA 94158, USA
| | - Kathleen D Liu
- Division of Nephrology and Critical Care Medicine, University of California, San Francisco, School of Medicine, San Francisco, CA 94143, USA
| | - Michael A Matthy
- Cardiovascular Research Institute, Department of Medicine, University of California, San Francisco, School of Medicine, San Francisco, CA 94158, USA
| | - Lamorna Brown Swigart
- Department of Laboratory Medicine, University of California, San Francisco, School of Medicine, San Francisco, CA 94143, USA
| | - Ellen L Burnham
- Division of Pulmonary Sciences and Critical Care Medicine, Department of Medicine, University of Colorado School of Medicine, Aurora, CO 80045, USA
| | - Jeffrey McKeehan
- Division of Pulmonary Sciences and Critical Care Medicine, Department of Medicine, University of Colorado School of Medicine, Aurora, CO 80045, USA
| | - Sheetal Gandotra
- Division of Pulmonary, Allergy and Critical Care Medicine, Department of Medicine, University of Alabama at Birmingham, Birmingham, AL 35294, USA
| | - Derek W Russel
- Division of Pulmonary, Allergy and Critical Care Medicine, Department of Medicine, University of Alabama at Birmingham, Birmingham, AL 35294, USA; Pulmonary Section, Birmingham Veteran's Affairs Medical Center, Birmingham, AL 35233, USA
| | - Kevin W Gibbs
- Section on Pulmonary and Critical Care, Allergy, and Immunology, Department of Internal Medicine, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Karl W Thomas
- Section on Pulmonary and Critical Care, Allergy, and Immunology, Department of Internal Medicine, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Harsh Barot
- Section on Hospital Medicine, Department of Internal Medicine, Wake Forest School of Medicine, Winston-Salem, NC 27157, USA
| | - Allison R Greenplate
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - E John Wherry
- Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Parker Institute for Cancer Immunotherapy, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| | - Dokyoon Kim
- Department of Bioengineering, University of Pennsylvania, Philadelphia, PA, USA; Institute for Immunology & Immune Health (I3H), Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA; Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA.
| |
Collapse
|
35
|
Xiong X, Liu Y, Pu D, Yang Z, Bi Z, Tian L, Li X. DeSide: A unified deep learning approach for cellular deconvolution of tumor microenvironment. Proc Natl Acad Sci U S A 2024; 121:e2407096121. [PMID: 39514318 PMCID: PMC11573681 DOI: 10.1073/pnas.2407096121] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/09/2024] [Accepted: 09/23/2024] [Indexed: 11/16/2024] Open
Abstract
Cellular deconvolution via bulk RNA sequencing (RNA-seq) presents a cost-effective and efficient alternative to experimental methods such as flow cytometry and single-cell RNA-seq (scRNA-seq) for analyzing the complex cellular composition of tumor microenvironments. Despite challenges due to heterogeneity within and among tumors, our innovative deep learning-based approach, DeSide, shows exceptional accuracy in estimating the proportions of 16 distinct cell types and subtypes within solid tumors. DeSide integrates biological pathways and assesses noncancerous cell types first, effectively sidestepping the issue of highly variable gene expression profiles (GEPs) associated with cancer cells. By leveraging scRNA-seq data from six cancer types and 185 cancer cell lines across 22 cancer types as references, our method introduces distinctive sampling and filtering techniques to generate a high-quality training set that closely replicates real tumor GEPs, based on The Cancer Genome Atlas (TCGA) bulk RNA-seq data. With this model and high-quality training set, DeSide outperforms existing methods in estimating tumor purity and the proportions of noncancerous cells within solid tumors. Our model precisely predicts cellular compositions across 19 cancer types from TCGA and proves its effectiveness with multiple additional external datasets. Crucially, DeSide enables the identification and analysis of combinatorial cell type pairs, facilitating the stratification of cancer patients into prognostically significant groups. This approach not only provides deeper insights into the dynamics of tumor biology but also highlights potential therapeutic targets by underscoring the importance of specific cell type or subtype interactions.
Collapse
Affiliation(s)
- Xin Xiong
- Department of Physics, Hong Kong Baptist University, Hong Kong, China
| | - Yerong Liu
- Key Laboratory of Quantitative Synthetic Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Dandan Pu
- Key Laboratory of Quantitative Synthetic Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| | - Zhu Yang
- State Key Laboratory of Environmental and Biological Analysis, Hong Kong Baptist University, Hong Kong, China
| | - Zedong Bi
- Lingang Laboratory, Shanghai 200031, China
| | - Liang Tian
- Department of Physics, Hong Kong Baptist University, Hong Kong, China
- State Key Laboratory of Environmental and Biological Analysis, Hong Kong Baptist University, Hong Kong, China
- Institute of Computational and Theoretical Studies, Hong Kong Baptist University, Hong Kong, China
- Institute of Systems Medicine and Health Sciences, Hong Kong Baptist University, Hong Kong, China
| | - Xuefei Li
- Key Laboratory of Quantitative Synthetic Biology, Shenzhen Institute of Synthetic Biology, Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen 518055, China
| |
Collapse
|
36
|
Fan X, Liu J, Yang Y, Gu C, Han Y, Wu B, Jiang Y, Chen G, Heng PA. scGraphformer: unveiling cellular heterogeneity and interactions in scRNA-seq data using a scalable graph transformer network. Commun Biol 2024; 7:1463. [PMID: 39511415 PMCID: PMC11543810 DOI: 10.1038/s42003-024-07154-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Accepted: 10/28/2024] [Indexed: 11/15/2024] Open
Abstract
The precise classification of cell types from single-cell RNA sequencing (scRNA-seq) data is pivotal for dissecting cellular heterogeneity in biological research. Traditional graph neural network (GNN) models are constrained by reliance on predefined graphs, limiting the exploration of complex cell-to-cell relationships. We introduce scGraphformer, a transformer-based GNN that transcends these limitations by learning an all-encompassing cell-cell relational network directly from scRNA-seq data. Through an iterative refinement process, scGraphformer constructs a dense graph structure that captures the full spectrum of cellular interactions. This comprehensive approach enables the identification of subtle and previously obscured cellular patterns and relationships. Evaluated on multiple datasets, scGraphformer demonstrates superior performance in cell type identification compared to existing methods and showcases its scalability with large-scale datasets. Our method not only provides enhanced cell type classification ability but also reveals the underlying cell interactions, offering deeper insights into functional cellular relationships. The scGraphformer thus holds the potential to significantly advance the field of single-cell analysis and contribute to a more nuanced understanding of cellular behavior.
Collapse
Affiliation(s)
- Xingyu Fan
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
| | - Jiacheng Liu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China.
| | - Yaodong Yang
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
| | - Chunbin Gu
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
| | - Yuqiang Han
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
| | - Bian Wu
- Zhejiang Lab, Hangzhou, China
| | - Yirong Jiang
- Department of Chemistry, Zhejiang University, Hangzhou, China
| | | | - Pheng-Ann Heng
- Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong, China
| |
Collapse
|
37
|
Tang Z, Chen G, Chen S, He H, You L, Chen CYC. Knowledge-based inductive bias and domain adaptation for cell type annotation. Commun Biol 2024; 7:1440. [PMID: 39501016 PMCID: PMC11538527 DOI: 10.1038/s42003-024-07171-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2024] [Accepted: 10/30/2024] [Indexed: 11/08/2024] Open
Abstract
Measurement techniques often result in domain gaps among batches of cellular data from a specific modality. The effectiveness of cross-batch annotation methods is influenced by inductive bias, which refers to a set of assumptions that describe the behavior of model predictions. Different annotation methods possess distinct inductive biases, leading to varying degrees of generalizability and interpretability. Given that certain cell types exhibit unique functional patterns, we hypothesize that the inductive biases of cell annotation methods should align with these biological patterns to produce meaningful predictions. In this study, we propose KIDA, Knowledge-based Inductive bias and Domain Adaptation. The knowledge-based inductive bias constrains the prediction rules learned from the reference dataset, composed of multiple batches, to functional patterns relevant to biology, thereby enhancing the generalization of the model to unseen batches. Since the query dataset also contains gaps from multiple batches, KIDA's domain adaptation employs pseudo labels for self-knowledge distillation, effectively narrowing the distribution gap between model predictions and the query dataset. Benchmark experiments demonstrate that KIDA is capable of achieving accurate cross-batch cell type annotation.
Collapse
Affiliation(s)
- Zhenchao Tang
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, China
| | - Guanxing Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, China
| | - Shouzhi Chen
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, China
| | - Haohuai He
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China
- Department of Data Science and Artificial Intelligence, The Hong Kong Polytechnic University, Hong Kong SAR, China
| | - Linlin You
- Artificial Intelligence Medical Research Center, School of Intelligent Systems Engineering, Shenzhen Campus of Sun Yat-sen University, Shenzhen, China.
| | - Calvin Yu-Chian Chen
- AI for Science (AI4S)-Preferred Program, School of Electronic and Computer Engineering, Peking University Shenzhen Graduate School, Shenzhen, China.
- State Key Laboratory of Chemical Oncogenomics, Key Laboratory of Chemical Genomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen, China.
- Department of Medical Research, China Medical University Hospital, Taichung, Taiwan.
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan.
- Guangdong L-Med Biotechnology Co., Ltd., Meizhou, China.
| |
Collapse
|
38
|
Hu Z, Li Y, Han C. Transfer learning enabled transformer-based generative adversarial networks for modeling and generating terahertz channels. COMMUNICATIONS ENGINEERING 2024; 3:153. [PMID: 39488675 PMCID: PMC11531481 DOI: 10.1038/s44172-024-00309-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 02/05/2024] [Accepted: 10/24/2024] [Indexed: 11/04/2024]
Abstract
Terahertz communications are envisioned as a promising technology for the sixth generation and beyond wireless systems, which can support wireless links with Terabits-per-second (Tbps) data rates. As the foundation of designing terahertz communications, channel modeling and characterization are crucial to scrutinize the potential of this spectrum. However, current channel modeling in the terahertz band heavily relies on time-consuming and costly measurements. Here, we propose a transfer learning enabled transformer based generative adversarial network to mitigate this problem in terahertz channel modeling. Specifically, as a fundamental building block, a generative adversarial network is exploited to generate channel parameters. To improve the accuracy, a transformer structure with a self-attention mechanism is incorporated in generative adversarial network. Still incurring errors compared with ground-truth measurement, a transfer learning is designed to solve the mismatch between the formulated network and measurement. The proposed method can achieve high accuracy in channel modeling, while requiring only rather limited amount of measurement, which is a promising complement of current channel modeling techniques.
Collapse
Affiliation(s)
- Zhengdong Hu
- Terahertz Wireless Communications (TWC) Laboratory, Shanghai Jiao Tong University, 200240, Shanghai, China
| | - Yuanbo Li
- Terahertz Wireless Communications (TWC) Laboratory, Shanghai Jiao Tong University, 200240, Shanghai, China
| | - Chong Han
- Terahertz Wireless Communications (TWC) Laboratory, Shanghai Jiao Tong University, 200240, Shanghai, China.
| |
Collapse
|
39
|
Chen H, Lu Y, Rao Y. A self-training interpretable cell type annotation framework using specific marker gene. Bioinformatics 2024; 40:btae569. [PMID: 39312689 PMCID: PMC11488977 DOI: 10.1093/bioinformatics/btae569] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2024] [Revised: 09/03/2024] [Accepted: 09/19/2024] [Indexed: 09/25/2024] Open
Abstract
MOTIVATION Recent advances in sequencing technology provide opportunities to study biological processes at a higher resolution. Cell type annotation is an important step in scRNA-seq analysis, which often relies on established marker genes. However, most of the previous methods divide the identification of cell types into two stages, clustering and assignment, whose performances are susceptible to the clustering algorithm, and the marker information cannot effectively guide the clustering process. Furthermore, their linear heuristic-based cell assignment process is often insufficient to capture potential dependencies between cells and types. RESULTS Here, we present Interpretable Cell Type Annotation based on self-training (sICTA), a marker-based cell type annotation method that combines the self-training strategy with pseudo-labeling and the nonlinear association capturing capability of Transformer. In addition, we incorporate biological priori knowledge of genes and pathways into the classifier through an attention mechanism to enhance the transparency of the model. A benchmark analysis on 11 publicly available single-cell datasets demonstrates the superiority of sICTA compared to state-of-the-art methods. The robustness of our method is further validated by evaluating the prediction accuracy of the model on different cell types for each single-cell data. Moreover, ablation studies show that self-training and the ability to capture potential dependencies between cells and cell types, both of which are mutually reinforcing, work together to improve model performance. Finally, we apply sICTA to the pancreatic dataset, exemplifying the interpretable attention matrix captured by sICTA. AVAILABILITY AND IMPLEMENTATION The source code of sICTA is available in public at https://github.com/nbnbhwyy/sICTA. The processed datasets can be found at https://drive.google.com/drive/folders/1jbqSxacL_IDIZ4uPjq220C9Kv024m9eL. The final version of the model will be permanently available at https://doi.org/10.5281/zenodo.13474010.
Collapse
Affiliation(s)
- Hegang Chen
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
| | - Yuyin Lu
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
| | - Yanghui Rao
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China
| |
Collapse
|
40
|
Kong T, Yu T, Zhao J, Hu Z, Xiong N, Wan J, Dong X, Pan Y, Zheng H, Zhang L. scGAA: a general gated axial-attention model for accurate cell-type annotation of single-cell RNA-seq data. Sci Rep 2024; 14:22308. [PMID: 39333739 PMCID: PMC11436728 DOI: 10.1038/s41598-024-73356-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2024] [Accepted: 09/17/2024] [Indexed: 09/30/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a key technology for investigating cell development and analysing cell diversity across various diseases. However, the high dimensionality and extreme sparsity of scRNA-seq data pose great challenges for accurate cell type annotation. To address this, we developed a new cell-type annotation model called scGAA (general gated axial-attention model for accurate cell-type annotation of scRNA-seq). Based on the transformer framework, the model decomposes the traditional self-attention mechanism into horizontal and vertical attention, considerably improving computational efficiency. This axial attention mechanism can process high-dimensional data more efficiently while maintaining reasonable model complexity. Additionally, the gated unit was integrated into the model to enhance the capture of relationships between genes, which is crucial for achieving an accurate cell type annotation. The results revealed that our improved transformer model is a promising tool for practical applications. This theoretical innovation increased the model performance and provided new insights into analytical tools for scRNA-seq data.
Collapse
Affiliation(s)
- Tianci Kong
- College of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China
| | - Tiancheng Yu
- School of Sciences, Zhejiang University of Science and Technology, Hangzhou, 310023, China
| | - Jiaxin Zhao
- Department of Hepatobiliary and Pancreatic Surgery, Department of Surgery, Fourth Affiliated Hospital, School of Medicine, Zhejiang University, Yiwu, 322000, China
| | - Zhenhua Hu
- Department of Hepatobiliary and Pancreatic Surgery, Department of Surgery, Fourth Affiliated Hospital, School of Medicine, Zhejiang University, Yiwu, 322000, China
| | - Neal Xiong
- Department of Computer Science and Mathematics, Sul Ross State University, Alpine, USA
| | - Jian Wan
- College of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China
| | - Xiaoliang Dong
- College of Information Science and Engineering, Shandong Agricultural University, Taian, 271018, China
| | - Yi Pan
- Faculty of Computer Science and Control Engineering Shenzhen University of Advanced Technology, Shenzhen, 518118, China
| | - Huilin Zheng
- College of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China.
| | - Lei Zhang
- College of Biological and Chemical Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China.
- College of Information and Electronic Engineering, Zhejiang University of Science and Technology, Hangzhou, 310023, China.
| |
Collapse
|
41
|
Xie J, Song Y, Zheng H, Luo S, Chen Y, Zhang C, Yu R, Tong M. PathMethy: an interpretable AI framework for cancer origin tracing based on DNA methylation. Brief Bioinform 2024; 25:bbae497. [PMID: 39391931 PMCID: PMC11467402 DOI: 10.1093/bib/bbae497] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2024] [Revised: 09/09/2024] [Accepted: 10/02/2024] [Indexed: 10/12/2024] Open
Abstract
Despite advanced diagnostics, 3%-5% of cases remain classified as cancer of unknown primary (CUP). DNA methylation, an important epigenetic feature, is essential for determining the origin of metastatic tumors. We presented PathMethy, a novel Transformer model integrated with functional categories and crosstalk of pathways, to accurately trace the origin of tumors in CUP samples based on DNA methylation. PathMethy outperformed seven competing methods in F1-score across nine cancer datasets and predicted accurately the molecular subtypes within nine primary tumor types. It not only excelled at tracing the origins of both primary and metastatic tumors but also demonstrated a high degree of agreement with previously diagnosed sites in cases of CUP. PathMethy provided biological insights by highlighting key pathways, functional categories, and their interactions. Using functional categories of pathways, we gained a global understanding of biological processes. For broader access, a user-friendly web server for researchers and clinicians is available at https://cup.pathmethy.com.
Collapse
Affiliation(s)
- Jiajing Xie
- National Institute for Data Science in Health and Medicine, Xiamen University, No. 4221-121 South Xiang'an Road, Xiamen, Fujian 361102, China
| | - Yuhang Song
- School of Informatics, Xiamen University, No. 4221-121 South Xiang'an Road, Xiamen, Fujian 361005, China
| | - Hailong Zheng
- Department of Bioinformatics, School of Basic Medical Sciences, Southern Medical University, No. 1023, South Shatai Road, Baiyun District, Guangzhou, Guangdong, 510515, China
| | - Shijie Luo
- National Institute for Data Science in Health and Medicine, Xiamen University, No. 4221-121 South Xiang'an Road, Xiamen, Fujian 361102, China
| | - Ying Chen
- School of Informatics, Xiamen University, No. 4221-121 South Xiang'an Road, Xiamen, Fujian 361005, China
| | - Chen Zhang
- National Institute for Data Science in Health and Medicine, Xiamen University, No. 4221-121 South Xiang'an Road, Xiamen, Fujian 361102, China
| | - Rongshan Yu
- National Institute for Data Science in Health and Medicine, Xiamen University, No. 4221-121 South Xiang'an Road, Xiamen, Fujian 361102, China
- School of Informatics, Xiamen University, No. 4221-121 South Xiang'an Road, Xiamen, Fujian 361005, China
| | - Mengsha Tong
- National Institute for Data Science in Health and Medicine, Xiamen University, No. 4221-121 South Xiang'an Road, Xiamen, Fujian 361102, China
- State Key Laboratory of Cellular Stress Biology, School of Life Sciences, Faculty of Medicine and Life Sciences, Xiamen University, No. 4221-121 South Xiang'an Road, Xiamen, Fujian 361102, China
| |
Collapse
|
42
|
Cheng J, Pan X, Fang Y, Yang K, Xue Y, Yan Q, Yuan Y. GexMolGen: cross-modal generation of hit-like molecules via large language model encoding of gene expression signatures. Brief Bioinform 2024; 25:bbae525. [PMID: 39470305 PMCID: PMC11514063 DOI: 10.1093/bib/bbae525] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 09/03/2024] [Accepted: 10/22/2024] [Indexed: 10/30/2024] Open
Abstract
Designing de novo molecules with specific biological activity is an essential task since it holds the potential to bypass the exploration of target genes, which is an initial step in the modern drug discovery paradigm. However, traditional methods mainly screen molecules by comparing the desired molecular effects within the documented experimental results. The data set limits this process, and it is hard to conduct direct cross-modal comparisons. Therefore, we propose a solution based on cross-modal generation called GexMolGen (Gene Expression-based Molecule Generator), which generates hit-like molecules using gene expression signatures alone. These signatures are calculated by inputting control and desired gene expression states. Our model GexMolGen adopts a "first-align-then-generate" strategy, aligning the gene expression signatures and molecules within a mapping space, ensuring a smooth cross-modal transition. The transformed molecular embeddings are then decoded into molecular graphs. In addition, we employ an advanced single-cell large language model for input flexibility and pre-train a scaffold-based molecular model to ensure that all generated molecules are 100% valid. Empirical results show that our model can produce molecules highly similar to known references, whether feeding in- or out-of-domain transcriptome data. Furthermore, it can also serve as a reliable tool for cross-modal screening.
Collapse
Affiliation(s)
- Jiabei Cheng
- Department of Automation, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Xiaoyong Pan
- Department of Automation, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Yi Fang
- Department of Automation, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Kaiyuan Yang
- Department of Automation, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Yiming Xue
- Department of Automation, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
| | - Qingran Yan
- Department of Rheumatology, Ren Ji Hospital, Shanghai Jiao Tong University School of Medicine, No. 1630 East Road, Pudong New Area, Shanghai 200127, China
| | - Ye Yuan
- Department of Automation, School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University, 800 Dongchuan Road, Minhang District, Shanghai 200240, China
- Key Laboratory of Biopharmaceutical Preparation and Delivery, Chinese Academy of Sciences, 1 North 2nd Street, Zhongguancun, Haidian District, Beijing 100190, PR China
- State Key Laboratory of Biochemical Engineering, Institute of Process Engineering, Chinese Academy of Sciences, 1 North 2nd Street, Zhongguancun, Haidian District, Beijing 100190, PR China
| |
Collapse
|
43
|
Park S, Lee H. Robust self-supervised learning strategy to tackle the inherent sparsity in single-cell RNA-seq data. Brief Bioinform 2024; 25:bbae586. [PMID: 39550222 PMCID: PMC11568879 DOI: 10.1093/bib/bbae586] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2024] [Revised: 09/26/2024] [Accepted: 10/31/2024] [Indexed: 11/18/2024] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) is a powerful tool for elucidating cellular heterogeneity and tissue function in various biological contexts. However, the sparsity in scRNA-seq data limits the accuracy of cell type annotation and transcriptomic analysis due to information loss. To address this limitation, we present scRobust, a robust self-supervised learning strategy to tackle the inherent sparsity of scRNA-seq data. Built upon the Transformer architecture, scRobust employs a novel self-supervised learning strategy comprising contrastive learning and gene expression prediction tasks. We demonstrated the effectiveness of scRobust using nine benchmarks, additional dropout scenarios, and combined datasets. scRobust outperformed recent methods in cell-type annotation tasks and generated cell embeddings that capture multi-faceted clustering information (e.g. cell types and HbA1c levels). In addition, cell embeddings of scRobust were useful for detecting specific marker genes related to drug tolerance stages. Furthermore, when we applied scRobust to scATAC-seq data, high-quality cell embedding vectors were generated. These results demonstrate the representational power of scRobust.
Collapse
Affiliation(s)
- Sejin Park
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 61005, Gwangju, South Korea
| | - Hyunju Lee
- School of Electrical Engineering and Computer Science, Gwangju Institute of Science and Technology, 61005, Gwangju, South Korea
- Artificial Intelligence Graduate School, Gwangju Institute of Science and Technology, 61005, Gwangju, South Korea
| |
Collapse
|
44
|
Chu X, Li X, Zhang Y, Dang G, Miao Y, Xu W, Wang J, Zhang Z, Cheng S. Integrative single-cell analysis of human colorectal cancer reveals patient stratification with distinct immune evasion mechanisms. NATURE CANCER 2024; 5:1409-1426. [PMID: 39147986 DOI: 10.1038/s43018-024-00807-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/17/2023] [Accepted: 07/16/2024] [Indexed: 08/17/2024]
Abstract
The tumor microenvironment (TME) considerably influences colorectal cancer (CRC) progression, therapeutic response and clinical outcome, but studies of interindividual heterogeneities of the TME in CRC are lacking. Here, by integrating human colorectal single-cell transcriptomic data from approximately 200 donors, we comprehensively characterized transcriptional remodeling in the TME compared to noncancer tissues and identified a rare tumor-specific subset of endothelial cells with T cell recruitment potential. The large sample size enabled us to stratify patients based on their TME heterogeneity, revealing divergent TME subtypes in which cancer cells exploit different immune evasion mechanisms. Additionally, by associating single-cell transcriptional profiling with risk genes identified by genome-wide association studies, we determined that stromal cells are major effector cell types in CRC genetic susceptibility. In summary, our results provide valuable insights into CRC pathogenesis and might help with the development of personalized immune therapies.
Collapse
Affiliation(s)
| | | | - Yu Zhang
- Changping Laboratory, Beijing, China
| | - Guohui Dang
- Changping Laboratory, Beijing, China
- Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China
| | | | - Wenbin Xu
- Changping Laboratory, Beijing, China
| | | | - Zemin Zhang
- BIOPIC, Beijing Advanced Innovation Center for Genomics, School of Life Sciences, Peking University, Beijing, China.
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Peking University, Beijing, China.
| | | |
Collapse
|
45
|
Cao X, Huang YA, You ZH, Shang X, Hu L, Hu PW, Huang ZA. scPriorGraph: constructing biosemantic cell-cell graphs with prior gene set selection for cell type identification from scRNA-seq data. Genome Biol 2024; 25:207. [PMID: 39103856 DOI: 10.1186/s13059-024-03357-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2023] [Accepted: 07/29/2024] [Indexed: 08/07/2024] Open
Abstract
Cell type identification is an indispensable analytical step in single-cell data analyses. To address the high noise stemming from gene expression data, existing computational methods often overlook the biologically meaningful relationships between genes, opting to reduce all genes to a unified data space. We assume that such relationships can aid in characterizing cell type features and improving cell type recognition accuracy. To this end, we introduce scPriorGraph, a dual-channel graph neural network that integrates multi-level gene biosemantics. Experimental results demonstrate that scPriorGraph effectively aggregates feature values of similar cells using high-quality graphs, achieving state-of-the-art performance in cell type identification.
Collapse
Affiliation(s)
- Xiyue Cao
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Yu-An Huang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| | - Zhu-Hong You
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Lun Hu
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Peng-Wei Hu
- Xinjiang Technical Institute of Physics and Chemistry, Chinese Academy of Sciences, Ürümqi, China
| | - Zhi-An Huang
- Research Office, City University of Hong Kong (Dongguan), Dongguan, 523000, China
| |
Collapse
|
46
|
Szałata A, Hrovatin K, Becker S, Tejada-Lapuerta A, Cui H, Wang B, Theis FJ. Transformers in single-cell omics: a review and new perspectives. Nat Methods 2024; 21:1430-1443. [PMID: 39122952 DOI: 10.1038/s41592-024-02353-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Accepted: 06/07/2024] [Indexed: 08/12/2024]
Abstract
Recent efforts to construct reference maps of cellular phenotypes have expanded the volume and diversity of single-cell omics data, providing an unprecedented resource for studying cell properties. Despite the availability of rich datasets and their continued growth, current single-cell models are unable to fully capitalize on the information they contain. Transformers have become the architecture of choice for foundation models in other domains owing to their ability to generalize to heterogeneous, large-scale datasets. Thus, the question arises of whether transformers could set off a similar shift in the field of single-cell modeling. Here we first describe the transformer architecture and its single-cell adaptations and then present a comprehensive review of the existing applications of transformers in single-cell analysis and critically discuss their future potential for single-cell biology. By studying limitations and technical challenges, we aim to provide a structured outlook for future research directions at the intersection of machine learning and single-cell biology.
Collapse
Affiliation(s)
- Artur Szałata
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
- School of Computing, Information and Technology, Technical University of Munich, Munich, Germany
| | - Karin Hrovatin
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany
| | - Sören Becker
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
- School of Computing, Information and Technology, Technical University of Munich, Munich, Germany
- Munich Center of Machine Learning, Munich, Germany
| | - Alejandro Tejada-Lapuerta
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany
- School of Computing, Information and Technology, Technical University of Munich, Munich, Germany
| | - Haotian Cui
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
- Peter Munk Cardiac Center, University Health Network, Toronto, Ontario, Canada
| | - Bo Wang
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute for Artificial Intelligence, Toronto, Ontario, Canada
- Peter Munk Cardiac Center, University Health Network, Toronto, Ontario, Canada
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
- AI Hub, University Health Network, Toronto, Ontario, Canada
| | - Fabian J Theis
- Institute of Computational Biology, Helmholtz Center Munich, Munich, Germany.
- School of Computing, Information and Technology, Technical University of Munich, Munich, Germany.
- TUM School of Life Sciences Weihenstephan, Technical University of Munich, Munich, Germany.
| |
Collapse
|
47
|
Xia Y, Liu Y, Li T, He S, Chang H, Wang Y, Zhang Y, Ge W. Assessing parameter efficient methods for pre-trained language model in annotating scRNA-seq data. Methods 2024; 228:12-21. [PMID: 38759908 DOI: 10.1016/j.ymeth.2024.05.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 04/28/2024] [Accepted: 05/10/2024] [Indexed: 05/19/2024] Open
Abstract
Annotating cell types of single-cell RNA sequencing (scRNA-seq) data is crucial for studying cellular heterogeneity in the tumor microenvironment. Recently, large-scale pre-trained language models (PLMs) have achieved significant progress in cell-type annotation of scRNA-seq data. This approach effectively addresses previous methods' shortcomings in performance and generalization. However, fine-tuning PLMs for different downstream tasks demands considerable computational resources, rendering it impractical. Hence, a new research branch introduces parameter-efficient fine-tuning (PEFT). This involves optimizing a few parameters while leaving the majority unchanged, leading to substantial reductions in computational expenses. Here, we utilize scBERT, a large-scale pre-trained model, to explore the capabilities of three PEFT methods in scRNA-seq cell type annotation. Extensive benchmark studies across several datasets demonstrate the superior applicability of PEFT methods. Furthermore, downstream analysis using models obtained through PEFT showcases their utility in novel cell type discovery and model interpretability for potential marker genes. Our findings underscore the considerable potential of PEFT in PLM-based cell type annotation, presenting novel perspectives for the analysis of scRNA-seq data.
Collapse
Affiliation(s)
- Yucheng Xia
- Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu, 610209, China
| | - Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Tianhao Li
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Sihan He
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Hong Chang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Yaqing Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Wenyi Ge
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China.
| |
Collapse
|
48
|
Hao M, Gong J, Zeng X, Liu C, Guo Y, Cheng X, Wang T, Ma J, Zhang X, Song L. Large-scale foundation model on single-cell transcriptomics. Nat Methods 2024; 21:1481-1491. [PMID: 38844628 DOI: 10.1038/s41592-024-02305-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2023] [Accepted: 05/10/2024] [Indexed: 08/10/2024]
Abstract
Large pretrained models have become foundation models leading to breakthroughs in natural language processing and related fields. Developing foundation models for deciphering the 'languages' of cells and facilitating biomedical research is promising yet challenging. Here we developed a large pretrained model scFoundation, also named 'xTrimoscFoundationα', with 100 million parameters covering about 20,000 genes, pretrained on over 50 million human single-cell transcriptomic profiles. scFoundation is a large-scale model in terms of the size of trainable parameters, dimensionality of genes and volume of training data. Its asymmetric transformer-like architecture and pretraining task design empower effectively capturing complex context relations among genes in a variety of cell types and states. Experiments showed its merit as a foundation model that achieved state-of-the-art performances in a diverse array of single-cell analysis tasks such as gene expression enhancement, tissue drug response prediction, single-cell drug response classification, single-cell perturbation prediction, cell type annotation and gene module inference.
Collapse
Affiliation(s)
- Minsheng Hao
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, Beijing, China
- BioMap, Beijing, China
| | | | | | | | | | | | | | - Jianzhu Ma
- Department of Electrical Engineering, Tsinghua University, Beijing, China.
- Institute for AI Industry Research, Tsinghua University, Beijing, China.
| | - Xuegong Zhang
- MOE Key Laboratory of Bioinformatics and Bioinformatics Division, BNRIST, Department of Automation, Tsinghua University, Beijing, China.
- School of Life Sciences and School of Medicine, Center for Synthetic and Systems Biology, Tsinghua University, Beijing, China.
| | - Le Song
- BioMap, Beijing, China.
- Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE.
| |
Collapse
|
49
|
Cui H, Wang C, Maan H, Pang K, Luo F, Duan N, Wang B. scGPT: toward building a foundation model for single-cell multi-omics using generative AI. Nat Methods 2024; 21:1470-1480. [PMID: 38409223 DOI: 10.1038/s41592-024-02201-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 01/30/2024] [Indexed: 02/28/2024]
Abstract
Generative pretrained models have achieved remarkable success in various domains such as language and computer vision. Specifically, the combination of large-scale diverse datasets and pretrained transformers has emerged as a promising approach for developing foundation models. Drawing parallels between language and cellular biology (in which texts comprise words; similarly, cells are defined by genes), our study probes the applicability of foundation models to advance cellular biology and genetic research. Using burgeoning single-cell sequencing data, we have constructed a foundation model for single-cell biology, scGPT, based on a generative pretrained transformer across a repository of over 33 million cells. Our findings illustrate that scGPT effectively distills critical biological insights concerning genes and cells. Through further adaptation of transfer learning, scGPT can be optimized to achieve superior performance across diverse downstream applications. This includes tasks such as cell type annotation, multi-batch integration, multi-omic integration, perturbation response prediction and gene network inference.
Collapse
Affiliation(s)
- Haotian Cui
- Peter Munk Cardiac Centre, University Health Network, Toronto, Ontartio, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Chloe Wang
- Peter Munk Cardiac Centre, University Health Network, Toronto, Ontartio, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Hassaan Maan
- Peter Munk Cardiac Centre, University Health Network, Toronto, Ontartio, Canada
- Vector Institute, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
| | - Kuan Pang
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Fengning Luo
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Vector Institute, Toronto, Ontario, Canada
| | - Nan Duan
- Microsoft Research, Redmond, WA, USA
| | - Bo Wang
- Peter Munk Cardiac Centre, University Health Network, Toronto, Ontartio, Canada.
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada.
- Vector Institute, Toronto, Ontario, Canada.
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada.
- Department of Laboratory Medicine and Pathobiology, University of Toronto, Toronto, Ontario, Canada.
- AI Hub, University Health Network, Toronto, Ontario, Canada.
| |
Collapse
|
50
|
Lin Y, Pan Z, Zeng Y, Yang Y, Dai Z. Detecting novel cell type in single-cell chromatin accessibility data via open-set domain adaptation. Brief Bioinform 2024; 25:bbae370. [PMID: 39073828 DOI: 10.1093/bib/bbae370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2024] [Revised: 06/27/2024] [Accepted: 07/15/2024] [Indexed: 07/30/2024] Open
Abstract
Recent advances in single-cell technologies enable the rapid growth of multi-omics data. Cell type annotation is one common task in analyzing single-cell data. It is a challenge that some cell types in the testing set are not present in the training set (i.e. unknown cell types). Most scATAC-seq cell type annotation methods generally assign each cell in the testing set to one known type in the training set but neglect unknown cell types. Here, we present OVAAnno, an automatic cell types annotation method which utilizes open-set domain adaptation to detect unknown cell types in scATAC-seq data. Comprehensive experiments show that OVAAnno successfully identifies known and unknown cell types. Further experiments demonstrate that OVAAnno also performs well on scRNA-seq data. Our codes are available online at https://github.com/lisaber/OVAAnno/tree/master.
Collapse
Affiliation(s)
- Yuefan Lin
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
| | - Zixiang Pan
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
| | - Yuansong Zeng
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
| | - Yuedong Yang
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
| | - Zhiming Dai
- School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou, 510006, China
| |
Collapse
|