1
|
Manousidaki A, Little A, Xie Y. Clustering and visualization of single-cell RNA-seq data using path metrics. PLoS Comput Biol 2024; 20:e1012014. [PMID: 38809943 DOI: 10.1371/journal.pcbi.1012014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2023] [Accepted: 03/21/2024] [Indexed: 05/31/2024] Open
Abstract
Recent advances in single-cell technologies have enabled high-resolution characterization of tissue and cancer compositions. Although numerous tools for dimension reduction and clustering are available for single-cell data analyses, these methods often fail to simultaneously preserve local cluster structure and global data geometry. To address these challenges, we developed a novel analyses framework, Single-Cell Path Metrics Profiling (scPMP), using power-weighted path metrics, which measure distances between cells in a data-driven way. Unlike Euclidean distance and other commonly used distance metrics, path metrics are density sensitive and respect the underlying data geometry. By combining path metrics with multidimensional scaling, a low dimensional embedding of the data is obtained which preserves both the global data geometry and cluster structure. We evaluate the method both for clustering quality and geometric fidelity, and it outperforms current scRNAseq clustering algorithms on a wide range of benchmarking data sets.
Collapse
Affiliation(s)
- Andriana Manousidaki
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan, United States of America
| | - Anna Little
- Department of Mathematics, University of Utah, Salt Lake City, Utah, United States of America
| | - Yuying Xie
- Department of Statistics and Probability, Michigan State University, East Lansing, Michigan, United States of America
- Department of Computational Mathematics, Science and Engineering, Michigan State University, East Lansing, Michigan, United States of America
| |
Collapse
|
2
|
Li R, Du K, Zhang C, Shen X, Yun L, Wang S, Li Z, Sun Z, Wei J, Li Y, Guo B, Sun C. Single-cell transcriptome profiling reveals the spatiotemporal distribution of triterpenoid saponin biosynthesis and transposable element activity in Gynostemma pentaphyllum shoot apexes and leaves. FRONTIERS IN PLANT SCIENCE 2024; 15:1394587. [PMID: 38779067 PMCID: PMC11109411 DOI: 10.3389/fpls.2024.1394587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 04/24/2024] [Indexed: 05/25/2024]
Abstract
Gynostemma pentaphyllum (Thunb.) Makino is an important producer of dammarene-type triterpenoid saponins. These saponins (gypenosides) exhibit diverse pharmacological benefits such as anticancer, antidiabetic, and immunomodulatory effects, and have major potential in the pharmaceutical and health care industries. Here, we employed single-cell RNA sequencing (scRNA-seq) to profile the transcriptomes of more than 50,000 cells derived from G. pentaphyllum shoot apexes and leaves. Following cell clustering and annotation, we identified five major cell types in shoot apexes and four in leaves. Each cell type displayed substantial transcriptomic heterogeneity both within and between tissues. Examining gene expression patterns across various cell types revealed that gypenoside biosynthesis predominantly occurred in mesophyll cells, with heightened activity observed in shoot apexes compared to leaves. Furthermore, we explored the impact of transposable elements (TEs) on G. pentaphyllum transcriptomic landscapes. Our findings the highlighted the unbalanced expression of certain TE families across different cell types in shoot apexes and leaves, marking the first investigation of TE expression at the single-cell level in plants. Additionally, we observed dynamic expression of genes involved in gypenoside biosynthesis and specific TE families during epidermal and vascular cell development. The involvement of TE expression in regulating cell differentiation and gypenoside biosynthesis warrant further exploration. Overall, this study not only provides new insights into the spatiotemporal organization of gypenoside biosynthesis and TE activity in G. pentaphyllum shoot apexes and leaves but also offers valuable cellular and genetic resources for a deeper understanding of developmental and physiological processes at single-cell resolution in this species.
Collapse
Affiliation(s)
- Rucan Li
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Ke Du
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Chuyi Zhang
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Xiaofeng Shen
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Lingling Yun
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Shu Wang
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Ziqin Li
- College of Pharmacy, Shandong University of Traditional Chinese Medicine, Jinan, Shandong, China
| | - Zhiying Sun
- College of Pharmacy, Shandong University of Traditional Chinese Medicine, Jinan, Shandong, China
| | - Jianhe Wei
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Ying Li
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Baolin Guo
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| | - Chao Sun
- Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing, China
| |
Collapse
|
3
|
Park Y, Hauschild AC. The effect of data transformation on low-dimensional integration of single-cell RNA-seq. BMC Bioinformatics 2024; 25:171. [PMID: 38689234 PMCID: PMC11059821 DOI: 10.1186/s12859-024-05788-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Accepted: 04/16/2024] [Indexed: 05/02/2024] Open
Abstract
BACKGROUND Recent developments in single-cell RNA sequencing have opened up a multitude of possibilities to study tissues at the level of cellular populations. However, the heterogeneity in single-cell sequencing data necessitates appropriate procedures to adjust for technological limitations and various sources of noise when integrating datasets from different studies. While many analysis procedures employ various preprocessing steps, they often overlook the importance of selecting and optimizing the employed data transformation methods. RESULTS This work investigates data transformation approaches used in single-cell clustering analysis tools and their effects on batch integration analysis. In particular, we compare 16 transformations and their impact on the low-dimensional representations, aiming to reduce the batch effect and integrate multiple single-cell sequencing data. Our results show that data transformations strongly influence the results of single-cell clustering on low-dimensional data space, such as those generated by UMAP or PCA. Moreover, these changes in low-dimensional space significantly affect trajectory analysis using multiple datasets, as well. However, the performance of the data transformations greatly varies across datasets, and the optimal method was different for each dataset. Additionally, we explored how data transformation impacts the analysis of deep feature encodings using deep neural network-based models, including autoencoder-based models and proto-typical networks. Data transformation also strongly affects the outcome of deep neural network models. CONCLUSIONS Our findings suggest that the batch effect and noise in integrative analysis are highly influenced by data transformation. Low-dimensional features can integrate different batches well when proper data transformation is applied. Furthermore, we found that the batch mixing score on low-dimensional space can guide the selection of the optimal data transformation. In conclusion, data preprocessing is one of the most crucial analysis steps and needs to be cautiously considered in the integrative analysis of multiple scRNA-seq datasets.
Collapse
Affiliation(s)
- Youngjun Park
- Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany
- International Max Planck Research Schools for Genome Science, Georg-August-Universität Göttingen, Göttingen, Germany
| | - Anne-Christin Hauschild
- Department of Medical Informatics, University Medical Center Göttingen, Göttingen, Germany.
- Campus-Institute Data Science (CIDAS), Georg-August-Universität Göttingen, Göttingen, Germany.
| |
Collapse
|
4
|
Kim H, Chang W, Chae SJ, Park JE, Seo M, Kim JK. scLENS: data-driven signal detection for unbiased scRNA-seq data analysis. Nat Commun 2024; 15:3575. [PMID: 38678050 DOI: 10.1038/s41467-024-47884-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Accepted: 04/14/2024] [Indexed: 04/29/2024] Open
Abstract
High dimensionality and noise have limited the new biological insights that can be discovered in scRNA-seq data. While dimensionality reduction tools have been developed to extract biological signals from the data, they often require manual determination of signal dimension, introducing user bias. Furthermore, a common data preprocessing method, log normalization, can unintentionally distort signals in the data. Here, we develop scLENS, a dimensionality reduction tool that circumvents the long-standing issues of signal distortion and manual input. Specifically, we identify the primary cause of signal distortion during log normalization and effectively address it by uniformizing cell vector lengths with L2 normalization. Furthermore, we utilize random matrix theory-based noise filtering and a signal robustness test to enable data-driven determination of the threshold for signal dimensions. Our method outperforms 11 widely used dimensionality reduction tools and performs particularly well for challenging scRNA-seq datasets with high sparsity and variability. To facilitate the use of scLENS, we provide a user-friendly package that automates accurate signal detection of scRNA-seq data without manual time-consuming tuning.
Collapse
Affiliation(s)
- Hyun Kim
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea
| | - Won Chang
- Division of Statistics and Data Science, University of Cincinnati, Cincinnati, OH, 45221, USA
| | - Seok Joo Chae
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea
- Department of Mathematical Sciences, KAIST, Daejeon, 34141, Republic of Korea
| | - Jong-Eun Park
- Graduate School of Medical Science and Engineering, KAIST, Daejeon, 34141, Republic of Korea
| | - Minseok Seo
- Department of Computer and Information Science, Korea University, Sejong, 30019, Republic of Korea
| | - Jae Kyoung Kim
- Biomedical Mathematics Group, Pioneer Research Center for Mathematical and Computational Sciences, Institute for Basic Science, Daejeon, 34126, Republic of Korea.
- Department of Mathematical Sciences, KAIST, Daejeon, 34141, Republic of Korea.
| |
Collapse
|
5
|
An S, Shi J, Liu R, Chen Y, Wang J, Hu S, Xia X, Dong G, Bo X, He Z, Ying X. scDAC: deep adaptive clustering of single-cell transcriptomic data with coupled autoencoder and Dirichlet process mixture model. BIOINFORMATICS (OXFORD, ENGLAND) 2024; 40:btae198. [PMID: 38603616 DOI: 10.1093/bioinformatics/btae198] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 03/20/2024] [Accepted: 04/10/2024] [Indexed: 04/13/2024]
Abstract
MOTIVATION Clustering analysis for single-cell RNA sequencing (scRNA-seq) data is an important step in revealing cellular heterogeneity. Many clustering methods have been proposed to discover heterogenous cell types from scRNA-seq data. However, adaptive clustering with accurate cluster number reflecting intrinsic biology nature from large-scale scRNA-seq data remains quite challenging. RESULTS Here, we propose a single-cell Deep Adaptive Clustering (scDAC) model by coupling the Autoencoder (AE) and the Dirichlet Process Mixture Model (DPMM). By jointly optimizing the model parameters of AE and DPMM, scDAC achieves adaptive clustering with accurate cluster numbers on scRNA-seq data. We verify the performance of scDAC on five subsampled datasets with different numbers of cell types and compare it with 15 widely used clustering methods across nine scRNA-seq datasets. Our results demonstrate that scDAC can adaptively find accurate numbers of cell types or subtypes and outperforms other methods. Moreover, the performance of scDAC is robust to hyperparameter changes. AVAILABILITY AND IMPLEMENTATION The scDAC is implemented in Python. The source code is available at https://github.com/labomics/scDAC.
Collapse
Affiliation(s)
- Sijing An
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Jinhui Shi
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Runyan Liu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Yaowen Chen
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Jing Wang
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Shuofeng Hu
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Xinyu Xia
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Guohua Dong
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Xiaochen Bo
- Department of Bioinformatics, Institute of Health Service and Transfusion Medicine, Beijing 100850, China
| | - Zhen He
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| | - Xiaomin Ying
- Center for Computational Biology, Beijing Institute of Basic Medical Sciences, Beijing 100850, China
| |
Collapse
|
6
|
Shi Y, Wan J, Zhang X, Liang T, Yin Y. scCRT: a contrastive-based dimensionality reduction model for scRNA-seq trajectory inference. Brief Bioinform 2024; 25:bbae204. [PMID: 38701412 PMCID: PMC11066919 DOI: 10.1093/bib/bbae204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2023] [Revised: 03/28/2024] [Accepted: 04/15/2024] [Indexed: 05/05/2024] Open
Abstract
Trajectory inference is a crucial task in single-cell RNA-sequencing downstream analysis, which can reveal the dynamic processes of biological development, including cell differentiation. Dimensionality reduction is an important step in the trajectory inference process. However, most existing trajectory methods rely on cell features derived from traditional dimensionality reduction methods, such as principal component analysis and uniform manifold approximation and projection. These methods are not specifically designed for trajectory inference and fail to fully leverage prior information from upstream analysis, limiting their performance. Here, we introduce scCRT, a novel dimensionality reduction model for trajectory inference. In order to utilize prior information to learn accurate cells representation, scCRT integrates two feature learning components: a cell-level pairwise module and a cluster-level contrastive module. The cell-level module focuses on learning accurate cell representations in a reduced-dimensionality space while maintaining the cell-cell positional relationships in the original space. The cluster-level contrastive module uses prior cell state information to aggregate similar cells, preventing excessive dispersion in the low-dimensional space. Experimental findings from 54 real and 81 synthetic datasets, totaling 135 datasets, highlighted the superior performance of scCRT compared with commonly used trajectory inference methods. Additionally, an ablation study revealed that both cell-level and cluster-level modules enhance the model's ability to learn accurate cell features, facilitating cell lineage inference. The source code of scCRT is available at https://github.com/yuchen21-web/scCRT-for-scRNA-seq.
Collapse
Affiliation(s)
- Yuchen Shi
- Hangzhou Dianzi University, Hangzhou City, Zhejiang Province, China
| | - Jian Wan
- Hangzhou Dianzi University, the Key Laboratory of Biomedical Intelligent Computing Technology of Zhejiang Province, and Zhejiang University of Science and Technology, Hangzhou City, Zhejiang Province, China
| | - Xin Zhang
- Hangzhou Dianzi University, Hangzhou City, Zhejiang Province, China
| | - Tingting Liang
- Hangzhou Dianzi University, Hangzhou City, Zhejiang Province, China
| | - Yuyu Yin
- Hangzhou Dianzi University, Hangzhou City, Zhejiang Province, China
| |
Collapse
|
7
|
Ko KD, Sartorelli V. A deep learning adversarial autoencoder with dynamic batching displays high performance in denoising and ordering scRNA-seq data. iScience 2024; 27:109027. [PMID: 38361616 PMCID: PMC10867661 DOI: 10.1016/j.isci.2024.109027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2023] [Revised: 11/20/2023] [Accepted: 01/22/2024] [Indexed: 02/17/2024] Open
Abstract
By providing high-resolution of cell-to-cell variation in gene expression, single-cell RNA sequencing (scRNA-seq) offers insights into cell heterogeneity, differentiating dynamics, and disease mechanisms. However, challenges such as low capture rates and dropout events can introduce noise in data analysis. Here, we propose a deep neural generative framework, the dynamic batching adversarial autoencoder (DB-AAE), which excels at denoising scRNA-seq datasets. DB-AAE directly captures optimal features from input data and enhances feature preservation, including cell type-specific gene expression patterns. Comprehensive evaluation on simulated and real datasets demonstrates that DB-AAE outperforms other methods in denoising accuracy and biological signal preservation. It also improves the accuracy of other algorithms in establishing pseudo-time inference. This study highlights DB-AAE's effectiveness and potential as a valuable tool for enhancing the quality and reliability of downstream analyses in scRNA-seq research.
Collapse
Affiliation(s)
- Kyung Dae Ko
- Laboratory of Muscle Stem Cells & Gene Regulation, NIAMS, NIH, Bethesda, MD, USA
| | - Vittorio Sartorelli
- Laboratory of Muscle Stem Cells & Gene Regulation, NIAMS, NIH, Bethesda, MD, USA
| |
Collapse
|
8
|
Marghi Y, Gala R, Baftizadeh F, Sümbül U. Joint inference of discrete cell types and continuous type-specific variability in single-cell datasets with MMIDAS. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.10.02.560574. [PMID: 37873271 PMCID: PMC10592946 DOI: 10.1101/2023.10.02.560574] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/25/2023]
Abstract
Reproducible definition and identification of cell types is essential to enable investigations into their biological function, and understanding their relevance in the context of development, disease and evolution. Current approaches model variability in data as continuous latent factors, followed by clustering as a separate step, or immediately apply clustering on the data. We show that such approaches can suffer from qualitative mistakes in identifying cell types robustly, particularly when the number of such cell types is in the hundreds or even thousands. Here, we propose an unsupervised method, MMIDAS, which combines a generalized mixture model with a multi-armed deep neural network, to jointly infer the discrete type and continuous type-specific variability. Using four recent datasets of brain cells spanning different technologies, species, and conditions, we demonstrate that MMIDAS can identify reproducible cell types and infer cell type-dependent continuous variability in both uni-modal and multi-modal datasets.
Collapse
Affiliation(s)
| | - Rohan Gala
- Allen Institute, 615 Westlake Ave N, Seattle, WA, USA
| | | | - Uygar Sümbül
- Allen Institute, 615 Westlake Ave N, Seattle, WA, USA
- Paul G. Allen School of Computer Science & Engineering, University of Washington, Seattle, WA, USA
| |
Collapse
|
9
|
Zhou J, Chen S, Wu Y, Li H, Zhang B, Zhou L, Hu Y, Xiang Z, Li Z, Chen N, Han W, Xu C, Wang D, Gao X. PPML-Omics: A privacy-preserving federated machine learning method protects patients' privacy in omic data. SCIENCE ADVANCES 2024; 10:eadh8601. [PMID: 38295178 PMCID: PMC10830108 DOI: 10.1126/sciadv.adh8601] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/18/2023] [Accepted: 12/29/2023] [Indexed: 02/02/2024]
Abstract
Modern machine learning models toward various tasks with omic data analysis give rise to threats of privacy leakage of patients involved in those datasets. Here, we proposed a secure and privacy-preserving machine learning method (PPML-Omics) by designing a decentralized differential private federated learning algorithm. We applied PPML-Omics to analyze data from three sequencing technologies and addressed the privacy concern in three major tasks of omic data under three representative deep learning models. We examined privacy breaches in depth through privacy attack experiments and demonstrated that PPML-Omics could protect patients' privacy. In each of these applications, PPML-Omics was able to outperform methods of comparison under the same level of privacy guarantee, demonstrating the versatility of the method in simultaneously balancing the privacy-preserving capability and utility in omic data analysis. Furthermore, we gave the theoretical proof of the privacy-preserving capability of PPML-Omics, suggesting the first mathematically guaranteed method with robust and generalizable empirical performance in protecting patients' privacy in omic data.
Collapse
Affiliation(s)
- Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Siyuan Chen
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Yulian Wu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Haoyang Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Bin Zhang
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Longxi Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Yan Hu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Zihang Xiang
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Zhongxiao Li
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Ningning Chen
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Wenkai Han
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Chencheng Xu
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Di Wang
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
- Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal 23955-6900, Kingdom of Saudi Arabia
| |
Collapse
|
10
|
Tyler SR, Lozano-Ojalvo D, Guccione E, Schadt EE. Anti-correlated feature selection prevents false discovery of subpopulations in scRNAseq. Nat Commun 2024; 15:699. [PMID: 38267438 PMCID: PMC10808220 DOI: 10.1038/s41467-023-43406-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 11/07/2023] [Indexed: 01/26/2024] Open
Abstract
While sub-clustering cell-populations has become popular in single cell-omics, negative controls for this process are lacking. Popular feature-selection/clustering algorithms fail the null-dataset problem, allowing erroneous subdivisions of homogenous clusters until nearly each cell is called its own cluster. Using real and synthetic datasets, we find that anti-correlated gene selection reduces or eliminates erroneous subdivisions, increases marker-gene selection efficacy, and efficiently scales to millions of cells.
Collapse
Affiliation(s)
- Scott R Tyler
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
- Department of Oncological Sciences, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| | - Daniel Lozano-Ojalvo
- Department of Dermatology, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Ernesto Guccione
- Department of Oncological Sciences, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Center for Therapeutics Discovery, Department of Oncological Sciences and Pharmacological Sciences, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Bioinformatics for Next Generation Sequencing (BiNGS) Shared Resource Facility, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Eric E Schadt
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA.
| |
Collapse
|
11
|
Hu T, Allam M, Cai S, Henderson W, Yueh B, Garipcan A, Ievlev AV, Afkarian M, Beyaz S, Coskun AF. Single-cell spatial metabolomics with cell-type specific protein profiling for tissue systems biology. Nat Commun 2023; 14:8260. [PMID: 38086839 PMCID: PMC10716522 DOI: 10.1038/s41467-023-43917-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Accepted: 11/23/2023] [Indexed: 12/18/2023] Open
Abstract
Metabolic reprogramming in cancer and immune cells occurs to support their increasing energy needs in biological tissues. Here we propose Single Cell Spatially resolved Metabolic (scSpaMet) framework for joint protein-metabolite profiling of single immune and cancer cells in male human tissues by incorporating untargeted spatial metabolomics and targeted multiplexed protein imaging in a single pipeline. We utilized the scSpaMet to profile cell types and spatial metabolomic maps of 19507, 31156, and 8215 single cells in human lung cancer, tonsil, and endometrium tissues, respectively. The scSpaMet analysis revealed cell type-dependent metabolite profiles and local metabolite competition of neighboring single cells in human tissues. Deep learning-based joint embedding revealed unique metabolite states within cell types. Trajectory inference showed metabolic patterns along cell differentiation paths. Here we show scSpaMet's ability to quantify and visualize the cell-type specific and spatially resolved metabolic-protein mapping as an emerging tool for systems-level understanding of tissue biology.
Collapse
Affiliation(s)
- Thomas Hu
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, USA
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Mayar Allam
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, USA
| | - Shuangyi Cai
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, USA
| | - Walter Henderson
- Institute for Electronics and Nanotechnology, Georgia Institute of Technology, Atlanta, GA, USA
| | - Brian Yueh
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | | | - Anton V Ievlev
- Oak Ridge National Laboratory, Center for Nanophase Materials Sciences, Oak Ridge, TN, USA
| | - Maryam Afkarian
- Division of Nephrology, Department of Internal Medicine, University of California, Davis, CA, USA
| | - Semir Beyaz
- Cold Spring Harbor Laboratory, Cold Spring Harbor, NY, USA
| | - Ahmet F Coskun
- Wallace H. Coulter Department of Biomedical Engineering, Georgia Institute of Technology and Emory University, Atlanta, GA, USA.
- Interdisciplinary Bioengineering Graduate Program, Georgia Institute of Technology, Atlanta, GA, USA.
- Winship Cancer Institute, Emory University, Atlanta, GA, USA.
- Parker H. Petit Institute for Bioengineering and Bioscience, Georgia Institute of Technology, Atlanta, GA, USA.
| |
Collapse
|
12
|
Wang Z, Xie X, Liu S, Ji Z. scFseCluster: a feature selection-enhanced clustering for single-cell RNA-seq data. Life Sci Alliance 2023; 6:e202302103. [PMID: 37788907 PMCID: PMC10547911 DOI: 10.26508/lsa.202302103] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2023] [Revised: 09/21/2023] [Accepted: 09/22/2023] [Indexed: 10/05/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) enables researchers to reveal previously unknown cell heterogeneity and functional diversity, which is impossible with bulk RNA sequencing. Clustering approaches are widely used for analyzing scRNA-seq data and identifying cell types and states. In the past few years, various advanced computational strategies emerged. However, the low generalization and high computational cost are the main bottlenecks of existing methods. In this study, we established a novel computational framework, scFseCluster, for scRNA-seq clustering analysis. scFseCluster incorporates a metaheuristic algorithm (Feature Selection based on Quantum Squirrel Search Algorithm) to extract the optimal gene set, which largely guarantees the performance of cell clustering. We conducted simulation experiments in several aspects to verify the performance of the proposed approach. scFseCluster performed very well on eight benchmark scRNA-seq datasets because of the optimal gene sets obtained using the Feature Selection based on Quantum Squirrel Search Algorithm. The comparative study demonstrated the significant advantages of scFseCluster over seven State-of-the-Art algorithms. In addition, our analysis shows that feature selection on high-variable genes can significantly improve clustering performance. In conclusion, our study demonstrates that scFseCluster is a highly versatile tool for enhancing scRNA-seq data clustering analysis.
Collapse
Affiliation(s)
- Zongqin Wang
- https://ror.org/05td3s095 College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
| | - Xiaojun Xie
- https://ror.org/05td3s095 College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
- https://ror.org/05td3s095 Center for Data Science and Intelligent Computing, Nanjing Agricultural University, Nanjing, China
| | - Shouyang Liu
- https://ror.org/05td3s095 Academy for Advanced Interdisciplinary Studies, Nanjing Agricultural University, Nanjing, China
| | - Zhiwei Ji
- https://ror.org/05td3s095 College of Artificial Intelligence, Nanjing Agricultural University, Nanjing, China
- https://ror.org/05td3s095 Center for Data Science and Intelligent Computing, Nanjing Agricultural University, Nanjing, China
| |
Collapse
|
13
|
Baig Y, Ma HR, Xu H, You L. Autoencoder neural networks enable low dimensional structure analyses of microbial growth dynamics. Nat Commun 2023; 14:7937. [PMID: 38049401 PMCID: PMC10696002 DOI: 10.1038/s41467-023-43455-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2023] [Accepted: 11/09/2023] [Indexed: 12/06/2023] Open
Abstract
The ability to effectively represent microbiome dynamics is a crucial challenge in their quantitative analysis and engineering. By using autoencoder neural networks, we show that microbial growth dynamics can be compressed into low-dimensional representations and reconstructed with high fidelity. These low-dimensional embeddings are just as effective, if not better, than raw data for tasks such as identifying bacterial strains, predicting traits like antibiotic resistance, and predicting community dynamics. Additionally, we demonstrate that essential dynamical information of these systems can be captured using far fewer variables than traditional mechanistic models. Our work suggests that machine learning can enable the creation of concise representations of high-dimensional microbiome dynamics to facilitate data analysis and gain new biological insights.
Collapse
Affiliation(s)
- Yasa Baig
- Department of Physics, Duke University, Durham, NC, USA
- Department of Computer Science, Duke University, Durham, NC, USA
| | - Helena R Ma
- Department of Biomedical Engineering, Duke University, Durham, NC, USA
- Center for Quantitative Biodesign, Duke University, Durham, NC, USA
| | - Helen Xu
- Department of Computer Science, Duke University, Durham, NC, USA
| | - Lingchong You
- Department of Biomedical Engineering, Duke University, Durham, NC, USA.
- Center for Quantitative Biodesign, Duke University, Durham, NC, USA.
- Department of Molecular Genetics and Microbiology, Duke University School of Medicine, Durham, NC, USA.
| |
Collapse
|
14
|
Yin Q, Chen L. CellTICS: an explainable neural network for cell-type identification and interpretation based on single-cell RNA-seq data. Brief Bioinform 2023; 25:bbad449. [PMID: 38061196 PMCID: PMC10703497 DOI: 10.1093/bib/bbad449] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2023] [Revised: 10/30/2023] [Accepted: 11/14/2023] [Indexed: 12/18/2023] Open
Abstract
Identifying cell types is crucial for understanding the functional units of an organism. Machine learning has shown promising performance in identifying cell types, but many existing methods lack biological significance due to poor interpretability. However, it is of the utmost importance to understand what makes cells share the same function and form a specific cell type, motivating us to propose a biologically interpretable method. CellTICS prioritizes marker genes with cell-type-specific expression, using a hierarchy of biological pathways for neural network construction, and applying a multi-predictive-layer strategy to predict cell and sub-cell types. CellTICS usually outperforms existing methods in prediction accuracy. Moreover, CellTICS can reveal pathways that define a cell type or a cell type under specific physiological conditions, such as disease or aging. The nonlinear nature of neural networks enables us to identify many novel pathways. Interestingly, some of the pathways identified by CellTICS exhibit differential expression "variability" rather than differential expression across cell types, indicating that expression stochasticity within a pathway could be an important feature characteristic of a cell type. Overall, CellTICS provides a biologically interpretable method for identifying and characterizing cell types, shedding light on the underlying pathways that define cellular heterogeneity and its role in organismal function. CellTICS is available at https://github.com/qyyin0516/CellTICS.
Collapse
Affiliation(s)
- Qingyang Yin
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, United States
| | - Liang Chen
- Department of Quantitative and Computational Biology, University of Southern California, 1050 Childs Way, Los Angeles, CA 90089, United States
| |
Collapse
|
15
|
Sun P, Fan S, Li S, Zhao Y, Lu C, Wong KC, Li X. Automated exploitation of deep learning for cancer patient stratification across multiple types. Bioinformatics 2023; 39:btad654. [PMID: 37934154 PMCID: PMC10636288 DOI: 10.1093/bioinformatics/btad654] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2022] [Revised: 10/17/2023] [Indexed: 11/08/2023] Open
Abstract
MOTIVATION Recent frameworks based on deep learning have been developed to identify cancer subtypes from high-throughput gene expression profiles. Unfortunately, the performance of deep learning is highly dependent on its neural network architectures which are often hand-crafted with expertise in deep neural networks, meanwhile, the optimization and adjustment of the network are usually costly and time consuming. RESULTS To address such limitations, we proposed a fully automated deep neural architecture search model for diagnosing consensus molecular subtypes from gene expression data (DNAS). The proposed model uses ant colony algorithm, one of the heuristic swarm intelligence algorithms, to search and optimize neural network architecture, and it can automatically find the optimal deep learning model architecture for cancer diagnosis in its search space. We validated DNAS on eight colorectal cancer datasets, achieving the average accuracy of 95.48%, the average specificity of 98.07%, and the average sensitivity of 96.24%, respectively. Without the loss of generality, we investigated the general applicability of DNAS further on other cancer types from different platforms including lung cancer and breast cancer, and DNAS achieved an area under the curve of 95% and 96%, respectively. In addition, we conducted gene ontology enrichment and pathological analysis to reveal interesting insights into cancer subtype identification and characterization across multiple cancer types. AVAILABILITY AND IMPLEMENTATION The source code and data can be downloaded from https://github.com/userd113/DNAS-main. And the web server of DNAS is publicly accessible at 119.45.145.120:5001.
Collapse
Affiliation(s)
- Pingping Sun
- School of Information Science and Technology, Northeast Normal University, Jilin, China
| | - Shijie Fan
- School of Information Science and Technology, Northeast Normal University, Jilin, China
| | - Shaochuan Li
- School of Information Science and Technology, Northeast Normal University, Jilin, China
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Yingwei Zhao
- School of Information Science and Technology, Northeast Normal University, Jilin, China
| | - Chang Lu
- School of Information Science and Technology, Northeast Normal University, Jilin, China
- School of Psychology, Northeast Normal University, Jilin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin, China
| |
Collapse
|
16
|
Yang L, Ng YE, Sun H, Li Y, Chini LCS, LeBrasseur NK, Chen J, Zhang X. Single-cell Mayo Map (scMayoMap): an easy-to-use tool for cell type annotation in single-cell RNA-sequencing data analysis. BMC Biol 2023; 21:223. [PMID: 37858214 PMCID: PMC10588107 DOI: 10.1186/s12915-023-01728-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2023] [Accepted: 10/06/2023] [Indexed: 10/21/2023] Open
Abstract
BACKGROUND Single-cell RNA-sequencing (scRNA-seq) has become a widely used tool for both basic and translational biomedical research. In scRNA-seq data analysis, cell type annotation is an essential but challenging step. In the past few years, several annotation tools have been developed. These methods require either labeled training/reference datasets, which are not always available, or a list of predefined cell subset markers, which are subject to biases. Thus, a user-friendly and precise annotation tool is still critically needed. RESULTS We curated a comprehensive cell marker database named scMayoMapDatabase and developed a companion R package scMayoMap, an easy-to-use single-cell annotation tool, to provide fast and accurate cell type annotation. The effectiveness of scMayoMap was demonstrated in 48 independent scRNA-seq datasets across different platforms and tissues. Additionally, the scMayoMapDatabase can be integrated with other tools and further improve their performance. CONCLUSIONS scMayoMap and scMayoMapDatabase will help investigators to define the cell types in their scRNA-seq data in a streamlined and user-friendly way.
Collapse
Affiliation(s)
- Lu Yang
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA
| | - Yan Er Ng
- Robert and Arlene Kogod Center On Aging, Mayo Clinic, Rochester, MN, 55905, USA
| | - Haipeng Sun
- Department of Biochemistry and Microbiology, Rutgers University, New Brunswick, NJ, 08901, USA
| | - Ying Li
- Department of Quantitative Health Sciences, Mayo Clinic, Jacksonville, FL, 32224, USA
| | - Lucas C S Chini
- Robert and Arlene Kogod Center On Aging, Mayo Clinic, Rochester, MN, 55905, USA
| | - Nathan K LeBrasseur
- Robert and Arlene Kogod Center On Aging, Mayo Clinic, Rochester, MN, 55905, USA.
- Department of Physical Medicine and Rehabilitation, Mayo Clinic, Rochester, MN, 55905, USA.
| | - Jun Chen
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, 55905, USA.
- Center for Individualized Medicine, Mayo Clinic, Rochester, MN, 55905, USA.
| | - Xu Zhang
- Robert and Arlene Kogod Center On Aging, Mayo Clinic, Rochester, MN, 55905, USA.
- Department of Biochemistry and Molecular Biology, Mayo Clinic, Rochester, MN, 55905, USA.
| |
Collapse
|
17
|
Ma Y, Deng C, Zhou Y, Zhang Y, Qiu F, Jiang D, Zheng G, Li J, Shuai J, Zhang Y, Yang J, Su J. Polygenic regression uncovers trait-relevant cellular contexts through pathway activation transformation of single-cell RNA sequencing data. CELL GENOMICS 2023; 3:100383. [PMID: 37719150 PMCID: PMC10504677 DOI: 10.1016/j.xgen.2023.100383] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 05/26/2023] [Accepted: 07/25/2023] [Indexed: 09/19/2023]
Abstract
Advances in single-cell RNA sequencing (scRNA-seq) techniques have accelerated functional interpretation of disease-associated variants discovered from genome-wide association studies (GWASs). However, identification of trait-relevant cell populations is often impeded by inherent technical noise and high sparsity in scRNA-seq data. Here, we developed scPagwas, a computational approach that uncovers trait-relevant cellular context by integrating pathway activation transformation of scRNA-seq data and GWAS summary statistics. scPagwas effectively prioritizes trait-relevant genes, which facilitates identification of trait-relevant cell types/populations with high accuracy in extensive simulated and real datasets. Cellular-level association results identified a novel subpopulation of naive CD8+ T cells related to COVID-19 severity and oligodendrocyte progenitor cell and microglia subsets with critical pathways by which genetic variants influence Alzheimer's disease. Overall, our approach provides new insights for the discovery of trait-relevant cell types and improves the mechanistic understanding of disease variants from a pathway perspective.
Collapse
Affiliation(s)
- Yunlong Ma
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou, Zhejiang 325101, China
| | - Chunyu Deng
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150080, China
| | - Yijun Zhou
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou, Zhejiang 325101, China
| | - Yaru Zhang
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou, Zhejiang 325101, China
| | - Fei Qiu
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
| | - Dingping Jiang
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
| | - Gongwei Zheng
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
| | - Jingjing Li
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
| | - Jianwei Shuai
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou, Zhejiang 325101, China
| | - Yan Zhang
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, Heilongjiang 150080, China
| | - Jian Yang
- School of Life Sciences, Westlake University, Hangzhou, Zhejiang 310012, China
- Westlake Laboratory of Life Sciences and Biomedicine, Hangzhou, Zhejiang 310024, China
| | - Jianzhong Su
- School of Biomedical Engineering, School of OphthalmoFlogy & Optometry and Eye Hospital, Wenzhou Medical University, Wenzhou, Zhejiang 325027, China
- Oujiang Laboratory, Zhejiang Lab for Regenerative Medicine, Vision and Brain Health, Wenzhou, Zhejiang 325101, China
| |
Collapse
|
18
|
Zhang J, Li J, Lin L. Statistical and machine learning methods for immunoprofiling based on single-cell data. Hum Vaccin Immunother 2023:2234792. [PMID: 37485833 PMCID: PMC10373621 DOI: 10.1080/21645515.2023.2234792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 06/30/2023] [Accepted: 07/04/2023] [Indexed: 07/25/2023] Open
Abstract
Immunoprofiling has become a crucial tool for understanding the complex interactions between the immune system and diseases or interventions, such as therapies and vaccinations. Immune response biomarkers are critical for understanding those relationships and potentially developing personalized intervention strategies. Single-cell data have emerged as a promising source for identifying immune response biomarkers. In this review, we discuss the current state-of-the-art methods for immunoprofiling, including those for reducing the dimensionality of high-dimensional single-cell data and methods for clustering, classification, and prediction. We also draw attention to recent developments in data integration.
Collapse
Affiliation(s)
- Jingxuan Zhang
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| | - Jia Li
- Department of Statistics, Pennsylvania State University, University Park, PA, USA
| | - Lin Lin
- Department of Biostatistics and Bioinformatics, Duke University, Durham, NC, USA
| |
Collapse
|
19
|
Sheng Y, Barak B, Nitzan M. Robust reconstruction of single-cell RNA-seq data with iterative gene weight updates. Bioinformatics 2023; 39:i423-i430. [PMID: 37387155 DOI: 10.1093/bioinformatics/btad253] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Single-cell RNA-sequencing technologies have greatly enhanced our understanding of heterogeneous cell populations and underlying regulatory processes. However, structural (spatial or temporal) relations between cells are lost during cell dissociation. These relations are crucial for identifying associated biological processes. Many existing tissue-reconstruction algorithms use prior information about subsets of genes that are informative with respect to the structure or process to be reconstructed. When such information is not available, and in the general case when the input genes code for multiple processes, including being susceptible to noise, biological reconstruction is often computationally challenging. RESULTS We propose an algorithm that iteratively identifies manifold-informative genes using existing reconstruction algorithms for single-cell RNA-seq data as subroutine. We show that our algorithm improves the quality of tissue reconstruction for diverse synthetic and real scRNA-seq data, including data from the mammalian intestinal epithelium and liver lobules. AVAILABILITY AND IMPLEMENTATION The code and data for benchmarking are available at github.com/syq2012/iterative_weight_update_for_reconstruction.
Collapse
Affiliation(s)
- Yueqi Sheng
- School of Engineering and Applied Sciences, Harvard University, Boston, MA 02134, United States
| | - Boaz Barak
- School of Engineering and Applied Sciences, Harvard University, Boston, MA 02134, United States
| | - Mor Nitzan
- School of Computer Science and Engineering, Racah Institute of Physics, Faculty of Medicine, The Hebrew University of Jerusalem, Jerusalem 9190401, Israel
| |
Collapse
|
20
|
Fan Y, Wang Y, Wang F, Huang L, Yang Y, Wong KC, Li X. Reliable Identification and Interpretation of Single-Cell Molecular Heterogeneity and Transcriptional Regulation using Dynamic Ensemble Pruning. ADVANCED SCIENCE (WEINHEIM, BADEN-WURTTEMBERG, GERMANY) 2023:e2205442. [PMID: 37290050 PMCID: PMC10401140 DOI: 10.1002/advs.202205442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2022] [Revised: 05/11/2023] [Indexed: 06/10/2023]
Abstract
Unsupervised clustering is an essential step in identifying cell types from single-cell RNA sequencing (scRNA-seq) data. However, a common issue with unsupervised clustering models is that the optimization direction of the objective function and the final generated clustering labels in the absence of supervised information may be inconsistent or even arbitrary. To address this challenge, a dynamic ensemble pruning framework (DEPF) is proposed to identify and interpret single-cell molecular heterogeneity. In particular, a silhouette coefficient-based indicator is developed to determine the optimization direction of the bi-objective function. In addition, a hierarchical autoencoder is employed to project the high-dimensional data onto multiple low-dimensional latent space sets, and then a clustering ensemble is produced in the latent space by the basic clustering algorithm. Following that, a bi-objective fruit fly optimization algorithm is designed to prune dynamically the low-quality basic clustering in the ensemble. Multiple experiments are conducted on 28 real scRNA-seq datasets and one large real scRNA-seq dataset from diverse platforms and species to validate the effectiveness of the DEPF. In addition, biological interpretability and transcriptional and post-transcriptional regulatory are conducted to explore biological patterns from the cell types identified, which could provide novel insights into characterizing the mechanisms.
Collapse
Affiliation(s)
- Yi Fan
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Yunhe Wang
- School of Artificial Intelligence, Hebei University of Technology, Tianjin, China
| | - Fuzhou Wang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada
| | - Lei Huang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada
| | - Yuning Yang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
| | - Ka-C Wong
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin, China
| |
Collapse
|
21
|
Cheng Y, Fan X, Zhang J, Li Y. A scalable sparse neural network framework for rare cell type annotation of single-cell transcriptome data. Commun Biol 2023; 6:545. [PMID: 37210444 DOI: 10.1038/s42003-023-04928-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Accepted: 05/11/2023] [Indexed: 05/22/2023] Open
Abstract
Automatic cell type annotation methods are increasingly used in single-cell RNA sequencing (scRNA-seq) analysis due to their fast and precise advantages. However, current methods often fail to account for the imbalance of scRNA-seq datasets and ignore information from smaller populations, leading to significant biological analysis errors. Here, we introduce scBalance, an integrated sparse neural network framework that incorporates adaptive weight sampling and dropout techniques for auto-annotation tasks. Using 20 scRNA-seq datasets with varying scales and degrees of imbalance, we demonstrate that scBalance outperforms current methods in both intra- and inter-dataset annotation tasks. Additionally, scBalance displays impressive scalability in identifying rare cell types in million-level datasets, as shown in the bronchoalveolar cell landscape. scBalance is also significantly faster than commonly used tools and comes in a user-friendly format, making it a superior tool for scRNA-seq analysis on the Python-based platform.
Collapse
Affiliation(s)
- Yuqi Cheng
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong SAR, China
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| | - Xingyu Fan
- School of Information and Software Engineering, University of Electronic Science and Technology of China, 610054, Chengdu, China
| | - Jianing Zhang
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong SAR, China
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK), Hong Kong SAR, China.
- The CUHK Shenzhen Research Institute, Hi-Tech Park, Nanshan, 518057, Shenzhen, China.
| |
Collapse
|
22
|
Xu J, Zhang A, Liu F, Chen L, Zhang X. CIForm as a Transformer-based model for cell-type annotation of large-scale single-cell RNA-seq data. Brief Bioinform 2023:7169137. [PMID: 37200157 DOI: 10.1093/bib/bbad195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2023] [Revised: 04/03/2023] [Accepted: 04/30/2023] [Indexed: 05/20/2023] Open
Abstract
Single-cell omics technologies have made it possible to analyze the individual cells within a biological sample, providing a more detailed understanding of biological systems. Accurately determining the cell type of each cell is a crucial goal in single-cell RNA-seq (scRNA-seq) analysis. Apart from overcoming the batch effects arising from various factors, single-cell annotation methods also face the challenge of effectively processing large-scale datasets. With the availability of an increase in the scRNA-seq datasets, integrating multiple datasets and addressing batch effects originating from diverse sources are also challenges in cell-type annotation. In this work, to overcome the challenges, we developed a supervised method called CIForm based on the Transformer for cell-type annotation of large-scale scRNA-seq data. To assess the effectiveness and robustness of CIForm, we have compared it with some leading tools on benchmark datasets. Through the systematic comparisons under various cell-type annotation scenarios, we exhibit that the effectiveness of CIForm is particularly pronounced in cell-type annotation. The source code and data are available at https://github.com/zhanglab-wbgcas/CIForm.
Collapse
Affiliation(s)
- Jing Xu
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
- University of Chinese Academy of Sciences, Beijing 100049, China
| | - Aidi Zhang
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
| | - Fang Liu
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
| | - Liang Chen
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
| | - Xiujun Zhang
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan 430074, China
| |
Collapse
|
23
|
Zhang S, Li X, Lin J, Lin Q, Wong KC. Review of single-cell RNA-seq data clustering for cell-type identification and characterization. RNA (NEW YORK, N.Y.) 2023; 29:517-530. [PMID: 36737104 PMCID: PMC10158997 DOI: 10.1261/rna.078965.121] [Citation(s) in RCA: 6] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/27/2022] [Accepted: 01/03/2023] [Indexed: 05/06/2023]
Abstract
In recent years, the advances in single-cell RNA-seq techniques have enabled us to perform large-scale transcriptomic profiling at single-cell resolution in a high-throughput manner. Unsupervised learning such as data clustering has become the central component to identify and characterize novel cell types and gene expression patterns. In this study, we review the existing single-cell RNA-seq data clustering methods with critical insights into the related advantages and limitations. In addition, we also review the upstream single-cell RNA-seq data processing techniques such as quality control, normalization, and dimension reduction. We conduct performance comparison experiments to evaluate several popular single-cell RNA-seq clustering approaches on simulated and multiple single-cell transcriptomic data sets.
Collapse
Affiliation(s)
- Shixiong Zhang
- School of Computer Science and Technology, Xidian University, Xi'an 710071, China
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin 130012, China
| | - Jiecong Lin
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Qiuzhen Lin
- College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518060, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| |
Collapse
|
24
|
Nguyen T, Wei Y, Nakada Y, Chen JY, Zhou Y, Walcott G, Zhang J. Analysis of cardiac single-cell RNA-sequencing data can be improved by the use of artificial-intelligence-based tools. Sci Rep 2023; 13:6821. [PMID: 37100826 PMCID: PMC10133286 DOI: 10.1038/s41598-023-32293-1] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2022] [Accepted: 03/25/2023] [Indexed: 04/28/2023] Open
Abstract
Single-cell RNA sequencing (scRNAseq) enables researchers to identify and characterize populations and subpopulations of different cell types in hearts recovering from myocardial infarction (MI) by characterizing the transcriptomes in thousands of individual cells. However, the effectiveness of the currently available tools for processing and interpreting these immense datasets is limited. We incorporated three Artificial Intelligence (AI) techniques into a toolkit for evaluating scRNAseq data: AI Autoencoding separates data from different cell types and subpopulations of cell types (cluster analysis); AI Sparse Modeling identifies genes and signaling mechanisms that are differentially activated between subpopulations (pathway/gene set enrichment analysis), and AI Semisupervised Learning tracks the transformation of cells from one subpopulation into another (trajectory analysis). Autoencoding was often used in data denoising; yet, in our pipeline, Autoencoding was exclusively used for cell embedding and clustering. The performance of our AI scRNAseq toolkit and other highly cited non-AI tools was evaluated with three scRNAseq datasets obtained from the Gene Expression Omnibus database. Autoencoder was the only tool to identify differences between the cardiomyocyte subpopulations found in mice that underwent MI or sham-MI surgery on postnatal day (P) 1. Statistically significant differences between cardiomyocytes from P1-MI mice and mice that underwent MI on P8 were identified for six cell-cycle phases and five signaling pathways when the data were analyzed via Sparse Modeling, compared to just one cell-cycle phase and one pathway when the data were analyzed with non-AI techniques. Only Semisupervised Learning detected trajectories between the predominant cardiomyocyte clusters in hearts collected on P28 from pigs that underwent apical resection (AR) on P1, and on P30 from pigs that underwent AR on P1 and MI on P28. In another dataset, the pig scRNAseq data were collected after the injection of CCND2-overexpression Human-induced Pluripotent Stem Cell-derived cardiomyocytes (CCND2hiPSC) into injured P28 pig heart; only the AI-based technique could demonstrate that the host cardiomyocytes increase proliferating by through the HIPPO/YAP and MAPK signaling pathways. For the cluster, pathway/gene set enrichment, and trajectory analysis of scRNAseq datasets generated from studies of myocardial regeneration in mice and pigs, our AI-based toolkit identified results that non-AI techniques did not discover. These different results were validated and were important in explaining myocardial regeneration.
Collapse
Affiliation(s)
- Thanh Nguyen
- Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL, 35233, USA
| | - Yuhua Wei
- Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL, 35233, USA
| | - Yuji Nakada
- Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL, 35233, USA
| | - Jake Y Chen
- Informatics Institute, School of Medicine, University of Alabama at Birmingham, Birmingham, AL, 35233, USA
| | - Yang Zhou
- Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL, 35233, USA
| | - Gregory Walcott
- Department of Medicine, Cardiovascular Diseases, University of Alabama at Birmingham, Birmingham, AL, 35233, USA
| | - Jianyi Zhang
- Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL, 35233, USA.
- Department of Medicine, Cardiovascular Diseases, University of Alabama at Birmingham, Birmingham, AL, 35233, USA.
- Department of Biomedical Engineering, School of Medicine and School of Engineering, University of Alabama at Birmingham, 1670 University Blvd, Volker Hall G094J, Birmingham, AL, 35233, USA.
| |
Collapse
|
25
|
Durmaz A, Gurnari C, Hershberger CE, Pagliuca S, Daniels N, Awada H, Awada H, Adema V, Mori M, Ponvilawan B, Kubota Y, Kewan T, Bahaj WS, Barnard J, Scott J, Padgett RA, Haferlach T, Maciejewski JP, Visconte V. A multimodal analysis of genomic and RNA splicing features in myeloid malignancies. iScience 2023; 26:106238. [PMID: 36926651 PMCID: PMC10011742 DOI: 10.1016/j.isci.2023.106238] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2022] [Revised: 01/12/2023] [Accepted: 02/15/2023] [Indexed: 02/22/2023] Open
Abstract
RNA splicing dysfunctions are more widespread than what is believed by only estimating the effects resulting by splicing factor mutations (SFMT) in myeloid neoplasia (MN). The genetic complexity of MN is amenable to machine learning (ML) strategies. We applied an integrative ML approach to identify co-varying features by combining genomic lesions (mutations, deletions, and copy number), exon-inclusion ratio as measure of RNA splicing (percent spliced in, PSI), and gene expression (GE) of 1,258 MN and 63 normal controls. We identified 15 clusters based on mutations, GE, and PSI. Different PSI levels were present at various extents regardless of SFMT suggesting that changes in RNA splicing were not strictly related to SFMT. Combination of PSI and GE further distinguished the features and identified PSI similarities and differences, common pathways, and expression signatures across clusters. Thus, multimodal features can resolve the complex architecture of MN and help identifying convergent molecular and transcriptomic pathways amenable to therapies.
Collapse
Affiliation(s)
- Arda Durmaz
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
- Systems Biology and Bioinformatics Department, School of Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - Carmelo Gurnari
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
- Department of Biomedicine and Prevention, PhD in Immunology, Molecular Medicine and Applied Biotechnology, University of Rome Tor Vergata, Rome, Italy
| | | | - Simona Pagliuca
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
- Department of Clinical Hematology, CHRU de Nancy, Nancy, France
| | - Noah Daniels
- Department of Cardiovascular & Metabolic Sciences, Cleveland Clinic, Cleveland, OH, USA
| | - Hassan Awada
- Roswell Park Comprehensive Cancer Center, Buffalo, NY, USA
| | - Hussein Awada
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Vera Adema
- MD Anderson Cancer Center, Houston, TX, USA
| | - Minako Mori
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Ben Ponvilawan
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Yasuo Kubota
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Tariq Kewan
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Waled S. Bahaj
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
| | - John Barnard
- Department of Quantitative Health Sciences, Cleveland Clinic, Cleveland, OH, USA
| | - Jacob Scott
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
- Systems Biology and Bioinformatics Department, School of Medicine, Case Western Reserve University, Cleveland, OH, USA
| | - Richard A. Padgett
- Department of Cardiovascular & Metabolic Sciences, Cleveland Clinic, Cleveland, OH, USA
| | | | - Jaroslaw P. Maciejewski
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
| | - Valeria Visconte
- Department of Translational Hematology and Oncology Research, Taussig Cancer Institute, Cleveland Clinic, Cleveland, OH, USA
- Corresponding author
| |
Collapse
|
26
|
Choi Y, Li R, Quon G. siVAE: interpretable deep generative models for single-cell transcriptomes. Genome Biol 2023; 24:29. [PMID: 36803416 PMCID: PMC9940350 DOI: 10.1186/s13059-023-02850-y] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 01/06/2023] [Indexed: 02/22/2023] Open
Abstract
Neural networks such as variational autoencoders (VAE) perform dimensionality reduction for the visualization and analysis of genomic data, but are limited in their interpretability: it is unknown which data features are represented by each embedding dimension. We present siVAE, a VAE that is interpretable by design, thereby enhancing downstream analysis tasks. Through interpretation, siVAE also identifies gene modules and hubs without explicit gene network inference. We use siVAE to identify gene modules whose connectivity is associated with diverse phenotypes such as iPSC neuronal differentiation efficiency and dementia, showcasing the wide applicability of interpretable generative models for genomic data analysis.
Collapse
Affiliation(s)
- Yongin Choi
- Graduate Group in Biomedical Engineering, University of California, Davis, Davis, CA, USA
- Genome Center, University of California, Davis, Davis, CA, USA
| | - Ruoxin Li
- Genome Center, University of California, Davis, Davis, CA, USA
- Graduate Group in Biostatistics, University of California, Davis, Davis, CA, USA
| | - Gerald Quon
- Graduate Group in Biomedical Engineering, University of California, Davis, Davis, CA, USA.
- Genome Center, University of California, Davis, Davis, CA, USA.
- Department of Molecular and Cellular Biology, University of California, Davis, Davis, CA, USA.
| |
Collapse
|
27
|
Zhang Y, Tran D, Nguyen T, Dascalu SM, Harris FC. A robust and accurate single-cell data trajectory inference method using ensemble pseudotime. BMC Bioinformatics 2023; 24:55. [PMID: 36803767 PMCID: PMC9942315 DOI: 10.1186/s12859-023-05179-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2022] [Accepted: 02/09/2023] [Indexed: 02/22/2023] Open
Abstract
BACKGROUND The advance in single-cell RNA sequencing technology has enhanced the analysis of cell development by profiling heterogeneous cells in individual cell resolution. In recent years, many trajectory inference methods have been developed. They have focused on using the graph method to infer the trajectory using single-cell data, and then calculate the geodesic distance as the pseudotime. However, these methods are vulnerable to errors caused by the inferred trajectory. Therefore, the calculated pseudotime suffers from such errors. RESULTS We proposed a novel framework for trajectory inference called the single-cell data Trajectory inference method using Ensemble Pseudotime inference (scTEP). scTEP utilizes multiple clustering results to infer robust pseudotime and then uses the pseudotime to fine-tune the learned trajectory. We evaluated the scTEP using 41 real scRNA-seq data sets, all of which had the ground truth development trajectory. We compared the scTEP with state-of-the-art methods using the aforementioned data sets. Experiments on real linear and non-linear data sets demonstrate that our scTEP performed superior on more data sets than any other method. The scTEP also achieved a higher average and lower variance on most metrics than other state-of-the-art methods. In terms of trajectory inference capacity, the scTEP outperforms those methods. In addition, the scTEP is more robust to the unavoidable errors resulting from clustering and dimension reduction. CONCLUSION The scTEP demonstrates that utilizing multiple clustering results for the pseudotime inference procedure enhances its robustness. Furthermore, robust pseudotime strengthens the accuracy of trajectory inference, which is the most crucial component in the pipeline. scTEP is available at https://cran.r-project.org/package=scTEP .
Collapse
Affiliation(s)
- Yifan Zhang
- Department of Computer Science and Engineering, University of Nevada, Reno, Reno, NV, USA.
| | - Duc Tran
- grid.266818.30000 0004 1936 914XDepartment of Computer Science and Engineering, University of Nevada, Reno, Reno, NV USA
| | - Tin Nguyen
- grid.266818.30000 0004 1936 914XDepartment of Computer Science and Engineering, University of Nevada, Reno, Reno, NV USA
| | - Sergiu M. Dascalu
- grid.266818.30000 0004 1936 914XDepartment of Computer Science and Engineering, University of Nevada, Reno, Reno, NV USA
| | - Frederick C. Harris
- grid.266818.30000 0004 1936 914XDepartment of Computer Science and Engineering, University of Nevada, Reno, Reno, NV USA
| |
Collapse
|
28
|
Yu Z, Su Y, Lu Y, Yang Y, Wang F, Zhang S, Chang Y, Wong KC, Li X. Topological identification and interpretation for single-cell gene regulation elucidation across multiple platforms using scMGCA. Nat Commun 2023; 14:400. [PMID: 36697410 PMCID: PMC9877026 DOI: 10.1038/s41467-023-36134-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 01/16/2023] [Indexed: 01/26/2023] Open
Abstract
Single-cell RNA sequencing provides high-throughput gene expression information to explore cellular heterogeneity at the individual cell level. A major challenge in characterizing high-throughput gene expression data arises from challenges related to dimensionality, and the prevalence of dropout events. To address these concerns, we develop a deep graph learning method, scMGCA, for single-cell data analysis. scMGCA is based on a graph-embedding autoencoder that simultaneously learns cell-cell topology representation and cluster assignments. We show that scMGCA is accurate and effective for cell segregation and batch effect correction, outperforming other state-of-the-art models across multiple platforms. In addition, we perform genomic interpretation on the key compressed transcriptomic space of the graph-embedding autoencoder to demonstrate the underlying gene regulation mechanism. We demonstrate that in a pancreatic ductal adenocarcinoma dataset, scMGCA successfully provides annotations on the specific cell types and reveals differential gene expression levels across multiple tumor-associated and cell signalling pathways.
Collapse
Affiliation(s)
- Zhuohan Yu
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Yanchi Su
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Yifu Lu
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Yuning Yang
- Donnelly Centre for Cellular and Biomolecular Research, University of Toronto, Toronto, ON, Canada
| | - Fuzhou Wang
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Shixiong Zhang
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China
| | - Yi Chang
- School of Artificial Intelligence, Jilin University, Jilin, China
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Hong Kong SAR, China.
| | - Xiangtao Li
- School of Artificial Intelligence, Jilin University, Jilin, China.
| |
Collapse
|
29
|
Wang J, Xia J, Wang H, Su Y, Zheng CH. scDCCA: deep contrastive clustering for single-cell RNA-seq data based on auto-encoder network. Brief Bioinform 2023; 24:6984787. [PMID: 36631401 DOI: 10.1093/bib/bbac625] [Citation(s) in RCA: 7] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2022] [Revised: 12/12/2022] [Accepted: 12/19/2022] [Indexed: 01/13/2023] Open
Abstract
The advances in single-cell ribonucleic acid sequencing (scRNA-seq) allow researchers to explore cellular heterogeneity and human diseases at cell resolution. Cell clustering is a prerequisite in scRNA-seq analysis since it can recognize cell identities. However, the high dimensionality, noises and significant sparsity of scRNA-seq data have made it a big challenge. Although many methods have emerged, they still fail to fully explore the intrinsic properties of cells and the relationship among cells, which seriously affects the downstream clustering performance. Here, we propose a new deep contrastive clustering algorithm called scDCCA. It integrates a denoising auto-encoder and a dual contrastive learning module into a deep clustering framework to extract valuable features and realize cell clustering. Specifically, to better characterize and learn data representations robustly, scDCCA utilizes a denoising Zero-Inflated Negative Binomial model-based auto-encoder to extract low-dimensional features. Meanwhile, scDCCA incorporates a dual contrastive learning module to capture the pairwise proximity of cells. By increasing the similarities between positive pairs and the differences between negative ones, the contrasts at both the instance and the cluster level help the model learn more discriminative features and achieve better cell segregation. Furthermore, scDCCA joins feature learning with clustering, which realizes representation learning and cell clustering in an end-to-end manner. Experimental results of 14 real datasets validate that scDCCA outperforms eight state-of-the-art methods in terms of accuracy, generalizability, scalability and efficiency. Cell visualization and biological analysis demonstrate that scDCCA significantly improves clustering and facilitates downstream analysis for scRNA-seq data. The code is available at https://github.com/WJ319/scDCCA.
Collapse
Affiliation(s)
- Jing Wang
- Anhui Provincial Key Laboratory of Multimodal Cognitive Computation, School of Computer Science and Technology, Anhui University, Hefei, China
| | - Junfeng Xia
- Institutes of Physical Science and Information Technology, Anhui University, Hefei, China
| | - Haiyun Wang
- School of Mathematics and Systems Science, Xinjiang University, Urumqi, China
| | - Yansen Su
- School of Artificial Intelligence, Anhui University, Hefei, China
| | - Chun-Hou Zheng
- School of Artificial Intelligence, Anhui University, Hefei, China
| |
Collapse
|
30
|
Wang HY, Zhao JP, Zheng CH, Su YS. scGMAAE: Gaussian mixture adversarial autoencoders for diversification analysis of scRNA-seq data. Brief Bioinform 2023; 24:6966535. [PMID: 36592058 DOI: 10.1093/bib/bbac585] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2022] [Revised: 11/14/2022] [Accepted: 11/29/2022] [Indexed: 01/03/2023] Open
Abstract
The progress of single-cell RNA sequencing (scRNA-seq) has led to a large number of scRNA-seq data, which are widely used in biomedical research. The noise in the raw data and tens of thousands of genes pose a challenge to capture the real structure and effective information of scRNA-seq data. Most of the existing single-cell analysis methods assume that the low-dimensional embedding of the raw data belongs to a Gaussian distribution or a low-dimensional nonlinear space without any prior information, which limits the flexibility and controllability of the model to a great extent. In addition, many existing methods need high computational cost, which makes them difficult to be used to deal with large-scale datasets. Here, we design and develop a depth generation model named Gaussian mixture adversarial autoencoders (scGMAAE), assuming that the low-dimensional embedding of different types of cells follows different Gaussian distributions, integrating Bayesian variational inference and adversarial training, as to give the interpretable latent representation of complex data and discover the statistical distribution of different types of cells. The scGMAAE is provided with good controllability, interpretability and scalability. Therefore, it can process large-scale datasets in a short time and give competitive results. scGMAAE outperforms existing methods in several ways, including dimensionality reduction visualization, cell clustering, differential expression analysis and batch effect removal. Importantly, compared with most deep learning methods, scGMAAE requires less iterations to generate the best results.
Collapse
Affiliation(s)
- Hai-Yun Wang
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - Jian-Ping Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.,Institute of Mathematics and Physics, Xinjiang University, Urumqi, China
| | - Chun-Hou Zheng
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.,School of Artificial Intelligence, Anhui University, Hefei, China
| | - Yan-Sen Su
- School of Artificial Intelligence, Anhui University, Hefei, China
| |
Collapse
|
31
|
Liu Y, Li HD, Xu Y, Liu YW, Peng X, Wang J. IsoCell: An Approach to Enhance Single Cell Clustering by Integrating Isoform-Level Expression Through Orthogonal Projection. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:465-475. [PMID: 35100120 DOI: 10.1109/tcbb.2022.3147193] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Single cell RNA sequencing (scRNA-seq) provides a powerful approach for profiling transcriptomes at single cell resolution. An essential application of scRNA-seq is the discovery of cell types with the aid of clustering analysis. Currently, existing single cell clustering methods are exclusively based on gene-level expression data, without considering alternative splicing information. It has been shown that alternative splicing has an important influence on biological processes such as cell differentiation and cell cycle. We therefore hypothesize that adding information about alternative splicing may help enhance single cell clustering. This motivates us to develop a way to integrate isoform-level expression and gene-level expression. We report an approach to enhance single cell clustering by integrating isoform-level expression through orthogonal projection. First, we construct an orthogonal projection matrix based on gene expression data. Second, isoforms are projected to the gene space to remove the redundant information between them. Third, isoform selection is performed based on the residual of the projected expression and the selected isoforms are combined with gene expression data for subsequent clustering. We applied our method to sixteen scRNA-seq datasets. We find that alternative splicing contains differential information among cell types and can be integrated to enhance single cell clustering. Compared with using only gene-level expression data, the integration of isoform-level expression leads to better clustering performances for most of the datasets. The integration of isoform-level expression also has potential in the detection of novel cell subgroups. Our study shows that integrating isoform and gene-level expression is a promising way to improve single cell clustering. The IsoCell R package is freely available at both Github (https://github.com/genemine/IsoCell) and Zenodo (https://zenodo.org/record/4395707).
Collapse
|
32
|
Li MM, Huang K, Zitnik M. Graph representation learning in biomedicine and healthcare. Nat Biomed Eng 2022; 6:1353-1369. [PMID: 36316368 PMCID: PMC10699434 DOI: 10.1038/s41551-022-00942-x] [Citation(s) in RCA: 30] [Impact Index Per Article: 15.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Accepted: 08/09/2022] [Indexed: 11/11/2022]
Abstract
Networks-or graphs-are universal descriptors of systems of interacting elements. In biomedicine and healthcare, they can represent, for example, molecular interactions, signalling pathways, disease co-morbidities or healthcare systems. In this Perspective, we posit that representation learning can realize principles of network medicine, discuss successes and current limitations of the use of representation learning on graphs in biomedicine and healthcare, and outline algorithmic strategies that leverage the topology of graphs to embed them into compact vectorial spaces. We argue that graph representation learning will keep pushing forward machine learning for biomedicine and healthcare applications, including the identification of genetic variants underlying complex traits, the disentanglement of single-cell behaviours and their effects on health, the assistance of patients in diagnosis and treatment, and the development of safe and effective medicines.
Collapse
Affiliation(s)
- Michelle M Li
- Bioinformatics and Integrative Genomics Program, Harvard Medical School, Boston, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
| | - Kexin Huang
- Health Data Science Program, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Marinka Zitnik
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
- Harvard Data Science Initiative, Cambridge, MA, USA.
| |
Collapse
|
33
|
Wang HY, Zhao JP, Su YS, Zheng CH. scCDG: A Method Based on DAE and GCN for scRNA-Seq Data Analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:3685-3694. [PMID: 34752401 DOI: 10.1109/tcbb.2021.3126641] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
Identifying cell types is one of the main goals of single-cell RNA sequencing (scRNA-seq) analysis, and clustering is a common method for this item. However, the massive amount of data and the excess noise level bring challenge for single cell clustering. To address this challenge, in this paper, we introduced a novel method named single-cell clustering based on denoising autoencoder and graph convolution network (scCDG), which consists of two core models. The first model is a denoising autoencoder (DAE) used to fit the data distribution for data denoising. The second model is a graph autoencoder using graph convolution network (GCN), which projects the data into a low-dimensional space (compressed) preserving topological structure information and feature information in scRNA-seq data simultaneously. Extensive analysis on seven real scRNA-seq datasets demonstrate that scCDG outperforms state-of-the-art methods in some research sub-fields, including single cell clustering, visualization of transcriptome landscape, and trajectory inference.
Collapse
|
34
|
Brendel M, Su C, Bai Z, Zhang H, Elemento O, Wang F. Application of Deep Learning on Single-cell RNA Sequencing Data Analysis: A Review. GENOMICS, PROTEOMICS & BIOINFORMATICS 2022; 20:814-835. [PMID: 36528240 PMCID: PMC10025684 DOI: 10.1016/j.gpb.2022.11.011] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/23/2022] [Revised: 08/17/2022] [Accepted: 11/24/2022] [Indexed: 12/23/2022]
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a routinely used technique to quantify the gene expression profile of thousands of single cells simultaneously. Analysis of scRNA-seq data plays an important role in the study of cell states and phenotypes, and has helped elucidate biological processes, such as those occurring during the development of complex organisms, and improved our understanding of disease states, such as cancer, diabetes, and coronavirus disease 2019 (COVID-19). Deep learning, a recent advance of artificial intelligence that has been used to address many problems involving large datasets, has also emerged as a promising tool for scRNA-seq data analysis, as it has a capacity to extract informative and compact features from noisy, heterogeneous, and high-dimensional scRNA-seq data to improve downstream analysis. The present review aims at surveying recently developed deep learning techniques in scRNA-seq data analysis, identifying key steps within the scRNA-seq data analysis pipeline that have been advanced by deep learning, and explaining the benefits of deep learning over more conventional analytic tools. Finally, we summarize the challenges in current deep learning approaches faced within scRNA-seq data and discuss potential directions for improvements in deep learning algorithms for scRNA-seq data analysis.
Collapse
Affiliation(s)
- Matthew Brendel
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA; Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Chang Su
- Department of Health Service Administration and Policy, Temple University, Philadelphia, PA 19122, USA.
| | - Zilong Bai
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Hao Zhang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Olivier Elemento
- Institute for Computational Biomedicine, Caryl and Israel Englander Institute for Precision Medicine, Department of Physiology and Biophysics, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA
| | - Fei Wang
- Department of Population Health Sciences, Weill Cornell Medicine, Cornell University, New York, NY 10065, USA.
| |
Collapse
|
35
|
Han W, Cheng Y, Chen J, Zhong H, Hu Z, Chen S, Zong L, Hong L, Chan TF, King I, Gao X, Li Y. Self-supervised contrastive learning for integrative single cell RNA-seq data analysis. Brief Bioinform 2022; 23:6695268. [PMID: 36089561 PMCID: PMC9487595 DOI: 10.1093/bib/bbac377] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 06/20/2022] [Indexed: 12/14/2022] Open
Abstract
We present a novel self-supervised Contrastive LEArning framework for single-cell ribonucleic acid (RNA)-sequencing (CLEAR) data representation and the downstream analysis. Compared with current methods, CLEAR overcomes the heterogeneity of the experimental data with a specifically designed representation learning task and thus can handle batch effects and dropout events simultaneously. It achieves superior performance on a broad range of fundamental tasks, including clustering, visualization, dropout correction, batch effect removal, and pseudo-time inference. The proposed method successfully identifies and illustrates inflammatory-related mechanisms in a COVID-19 disease study with 43 695 single cells from peripheral blood mononuclear cells.
Collapse
Affiliation(s)
- Wenkai Han
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST) , Thuwal, 23955, Saudi Arabia
| | - Yuqi Cheng
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK) , Hong Kong SAR, China
- Weill Cornell Graduate School of Medical Sciences, Weill Cornell Medicine , New York, NY, 10065, USA
| | - Jiayang Chen
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK) , Hong Kong SAR, China
| | - Huawen Zhong
- Biological and Environmental Sciences & Engineering Division (BESE), Red Sea Research Center (RSRC), King Abdullah University of Science and Technology (KAUST) , Thuwal, 23955, Saudi Arabia
| | - Zhihang Hu
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK) , Hong Kong SAR, China
| | - Siyuan Chen
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST) , Thuwal, 23955, Saudi Arabia
| | - Licheng Zong
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK) , Hong Kong SAR, China
| | - Liang Hong
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK) , Hong Kong SAR, China
| | - Ting-Fung Chan
- School of Life Sciences and State Key Laboratory of Agrobiotechnology, The Chinese University of Hong Kong , Hong Kong SAR, China
| | - Irwin King
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK) , Hong Kong SAR, China
| | - Xin Gao
- Computer, Electrical and Mathematical Sciences and Engineering Division (CEMSE), Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST) , Thuwal, 23955, Saudi Arabia
- BioMap , Beijing, China
| | - Yu Li
- Department of Computer Science and Engineering (CSE), The Chinese University of Hong Kong (CUHK) , Hong Kong SAR, China
- The CUHK Shenzhen Research Institute, Hi-Tech Park , Nanshan, Shenzhen, 518057, China
| |
Collapse
|
36
|
Li Z, Wang Y, Ganan-Gomez I, Colla S, Do KA. A machine learning-based method for automatically identifying novel cells in annotating single-cell RNA-seq data. Bioinformatics 2022; 38:4885-4892. [PMID: 36083008 PMCID: PMC9801963 DOI: 10.1093/bioinformatics/btac617] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2021] [Revised: 09/06/2022] [Accepted: 09/08/2022] [Indexed: 01/07/2023] Open
Abstract
MOTIVATION Single-cell RNA sequencing (scRNA-seq) has been widely used to decompose complex tissues into functionally distinct cell types. The first and usually the most important step of scRNA-seq data analysis is to accurately annotate the cell labels. In recent years, many supervised annotation methods have been developed and shown to be more convenient and accurate than unsupervised cell clustering. One challenge faced by all the supervised annotation methods is the identification of the novel cell type, which is defined as the cell type that is not present in the training data, only exists in the testing data. Existing methods usually label the cells simply based on the correlation coefficients or confidence scores, which sometimes results in an excessive number of unlabeled cells. RESULTS We developed a straightforward yet effective method combining autoencoder with iterative feature selection to automatically identify novel cells from scRNA-seq data. Our method trains an autoencoder with the labeled training data and applies the autoencoder to the testing data to obtain reconstruction errors. By iteratively selecting features that demonstrate a bi-modal pattern and reclustering the cells using the selected feature, our method can accurately identify novel cells that are not present in the training data. We further combined this approach with a support vector machine to provide a complete solution for annotating the full range of cell types. Extensive numerical experiments using five real scRNA-seq datasets demonstrated favorable performance of the proposed method over existing methods serving similar purposes. AVAILABILITY AND IMPLEMENTATION Our R software package CAMLU is publicly available through the Zenodo repository (https://doi.org/10.5281/zenodo.7054422) or GitHub repository (https://github.com/ziyili20/CAMLU). SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ziyi Li
- To whom correspondence should be addressed. or
| | - Yizhuo Wang
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Irene Ganan-Gomez
- Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Simona Colla
- Department of Leukemia, The University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Kim-Anh Do
- To whom correspondence should be addressed. or
| |
Collapse
|
37
|
Ke M, Elshenawy B, Sheldon H, Arora A, Buffa FM. Single cell RNA-sequencing: A powerful yet still challenging technology to study cellular heterogeneity. Bioessays 2022; 44:e2200084. [PMID: 36068142 DOI: 10.1002/bies.202200084] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Revised: 08/18/2022] [Accepted: 08/19/2022] [Indexed: 11/11/2022]
Abstract
Almost all biomedical research to date has relied upon mean measurements from cell populations, however it is well established that what it is observed at this macroscopic level can be the result of many interactions of several different single cells. Thus, the observable macroscopic 'average' cannot outright be used as representative of the 'average cell'. Rather, it is the resulting emerging behaviour of the actions and interactions of many different cells. Single-cell RNA sequencing (scRNA-Seq) enables the comparison of the transcriptomes of individual cells. This provides high-resolution maps of the dynamic cellular programmes allowing us to answer fundamental biological questions on their function and evolution. It also allows to address medical questions such as the role of rare cell populations contributing to disease progression and therapeutic resistance. Furthermore, it provides an understanding of context-specific dependencies, namely the behaviour and function that a cell has in a specific context, which can be crucial to understand some complex diseases, such as diabetes, cardiovascular disease and cancer. Here, we provide an overview of scRNA-Seq, including a comparative review of emerging technologies and computational pipelines. We discuss the current and emerging applications and focus on tumour heterogeneity a clear example of how scRNA-Seq can provide new understanding of a complex disease. Additionally, we review the limitations and highlight the need of powerful computational pipelines and reproducible protocols for the broader acceptance of this technique in basic and clinical research.
Collapse
Affiliation(s)
- May Ke
- Department of Oncology, Medical Sciences Division, University of Oxford, Oxford, UK
| | - Badran Elshenawy
- Department of Oncology, Medical Sciences Division, University of Oxford, Oxford, UK
| | - Helen Sheldon
- Department of Oncology, Medical Sciences Division, University of Oxford, Oxford, UK
| | - Anjali Arora
- Department of Oncology, Medical Sciences Division, University of Oxford, Oxford, UK
| | - Francesca M Buffa
- Department of Oncology, Medical Sciences Division, University of Oxford, Oxford, UK.,Department of Computing Sciences, Bocconi University, Milano, Italy.,Institute for Data Science and Analytics, Bocconi University, Milano, Italy
| |
Collapse
|
38
|
Nguyen T, Wei Y, Nakada Y, Zhou Y, Zhang J. Cardiomyocyte Cell-Cycle Regulation in Neonatal Large Mammals: Single Nucleus RNA-Sequencing Data Analysis via an Artificial-Intelligence–Based Pipeline. Front Bioeng Biotechnol 2022; 10:914450. [PMID: 35860330 PMCID: PMC9289371 DOI: 10.3389/fbioe.2022.914450] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2022] [Accepted: 05/18/2022] [Indexed: 11/20/2022] Open
Abstract
Adult mammalian cardiomyocytes have very limited capacity to proliferate and repair the myocardial infarction. However, when apical resection (AR) was performed in pig hearts on postnatal day (P) 1 (ARP1) and acute myocardial infarction (MI) was induced on P28 (MIP28), the animals recovered with no evidence of myocardial scarring or decline in contractile performance. Furthermore, the repair process appeared to be driven by cardiomyocyte proliferation, but the regulatory molecules that govern the ARP1-induced enhancement of myocardial recovery remain unclear. Single-nucleus RNA sequencing (snRNA-seq) data collected from fetal pig hearts and the hearts of pigs that underwent ARP1, MIP28, both ARP1 and MI, or neither myocardial injury were evaluated via autoencoder, cluster analysis, sparse learning, and semisupervised learning. Ten clusters of cardiomyocytes (CM1–CM10) were identified across all experimental groups and time points. CM1 was only observed in ARP1 hearts on P28 and was enriched for the expression of T-box transcription factors 5 and 20 (TBX5 and TBX20, respectively), Erb-B2 receptor tyrosine kinase 4 (ERBB4), and G Protein-Coupled Receptor Kinase 5 (GRK5), as well as genes associated with the proliferation and growth of cardiac muscle. CM1 cardiomyocytes also highly expressed genes for glycolysis while lowly expressed genes for adrenergic signaling, which suggested that CM1 were immature cardiomyocytes. Thus, we have identified a cluster of cardiomyocytes, CM1, in neonatal pig hearts that appeared to be generated in response to AR injury on P1 and may have been primed for activation of CM cell-cycle activation and proliferation by the upregulation of TBX5, TBX20, ERBB4, and GRK5.
Collapse
Affiliation(s)
- Thanh Nguyen
- Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Yuhua Wei
- Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Yuji Nakada
- Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Yang Zhou
- Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL, United States
| | - Jianyi Zhang
- Department of Biomedical Engineering, University of Alabama at Birmingham, Birmingham, AL, United States
- Cardiovascular Diseases, Department of Medicine, University of Alabama at Birmingham, Birmingham, AL, United States
- *Correspondence: Jianyi Zhang,
| |
Collapse
|
39
|
Wang HY, Zhao JP, Zheng CH, Su YS. scCNC: A method based on Capsule Network for Clustering scRNA-seq Data. Bioinformatics 2022; 38:3703-3709. [PMID: 35699473 DOI: 10.1093/bioinformatics/btac393] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2021] [Revised: 05/28/2022] [Accepted: 06/11/2022] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION A large number of studies have shown that clustering is a crucial step in scRNA-seq analysis. Most existing methods are based on unsupervised learning without the prior exploitation of any domain knowledge, which does not utilize available gold-standard labels. When confronted by the high dimensionality and general dropout events of scRNA-seq data, purely unsupervised clustering methods may not produce biologically interpretable clusters, which complicates cell type assignment. RESULTS In this paper, we propose a semi-supervised clustering method based on a capsule network named scCNC, that integrates domain knowledge into the clustering step. Significantly, we also propose a Semi-supervised Greedy Iterative Training (SGIT) method used to train the whole network. Experiments on some real scRNA-seq datasets show that scCNC can significantly improve clustering performance and facilitate downstream analyses. AVAILABILITY The source code of scCNC is freely available at https://github.com/WHY-17/scCNC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hai-Yun Wang
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China
| | - Jian-Ping Zhao
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.,Institute of Mathematics and Physics, Xinjiang University, Urumqi, China
| | - Chun-Hou Zheng
- College of Mathematics and System Sciences, Xinjiang University, Urumqi, China.,School of Artificial Intelligence, Anhui University, Hefei, China
| | - Yan-Sen Su
- School of Artificial Intelligence, Anhui University, Hefei, China
| |
Collapse
|
40
|
scEFSC: Accurate Single-cell RNA-seq Data Analysis via Ensemble Consensus Clustering Based on Multiple Feature Selections. Comput Struct Biotechnol J 2022; 20:2181-2197. [PMID: 35615016 PMCID: PMC9108753 DOI: 10.1016/j.csbj.2022.04.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2021] [Revised: 04/09/2022] [Accepted: 04/17/2022] [Indexed: 11/21/2022] Open
|
41
|
Wang H, Ma X. Learning deep features and topological structure of cells for clustering of scRNA-sequencing data. Brief Bioinform 2022; 23:6549863. [PMID: 35302164 DOI: 10.1093/bib/bbac068] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2021] [Revised: 01/10/2022] [Accepted: 02/09/2022] [Indexed: 02/01/2023] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) measures gene transcriptome at the cell level, paving the way for the identification of cell subpopulations. Although deep learning has been successfully applied to scRNA-seq data, these algorithms are criticized for the undesirable performance and interpretability of patterns because of the noises, high-dimensionality and extraordinary sparsity of scRNA-seq data. To address these issues, a novel deep learning subspace clustering algorithm (aka scGDC) for cell types in scRNA-seq data is proposed, which simultaneously learns the deep features and topological structure of cells. Specifically, scGDC extends auto-encoder by introducing a self-representation layer to extract deep features of cells, and learns affinity graph of cells, which provide a better and more comprehensive strategy to characterize structure of cell types. To address heterogeneity of scRNA-seq data, scGDC projects cells of various types onto different subspaces, where types, particularly rare cell types, are well discriminated by utilizing generative adversarial learning. Furthermore, scGDC joins deep feature extraction, structural learning and cell type discovery, where features of cells are extracted under the guidance of cell types, thereby improving performance of algorithms. A total of 15 scRNA-seq datasets from various tissues and organisms with the number of cells ranging from 56 to 63 103 are adopted to validate performance of algorithms, and experimental results demonstrate that scGDC significantly outperforms 14 state-of-the-art methods in terms of various measurements (on average 25.51% by improvement), where (rare) cell types are significantly associated with topology of affinity graph of cells. The proposed model and algorithm provide an effective strategy for the analysis of scRNA-seq data (The software is coded using python, and is freely available for academic https://github.com/xkmaxidian/scGDC).
Collapse
Affiliation(s)
- Haiyue Wang
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
| | - Xiaoke Ma
- School of Computer Science and Technology, Xidian University, Xi'an, 710071, China
| |
Collapse
|
42
|
Abram KJ, McCloskey D. A Comprehensive Evaluation of Metabolomics Data Preprocessing Methods for Deep Learning. Metabolites 2022; 12:metabo12030202. [PMID: 35323644 PMCID: PMC8948616 DOI: 10.3390/metabo12030202] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/28/2021] [Revised: 02/15/2022] [Accepted: 02/17/2022] [Indexed: 12/04/2022] Open
Abstract
Machine learning has greatly advanced over the past decade, owing to advances in algorithmic innovations, hardware acceleration, and benchmark datasets to train on domains such as computer vision, natural-language processing, and more recently the life sciences. In particular, the subfield of machine learning known as deep learning has found applications in genomics, proteomics, and metabolomics. However, a thorough assessment of how the data preprocessing methods required for the analysis of life science data affect the performance of deep learning is lacking. This work contributes to filling that gap by assessing the impact of commonly used as well as newly developed methods employed in data preprocessing workflows for metabolomics that span from raw data to processed data. The results from these analyses are summarized into a set of best practices that can be used by researchers as a starting point for downstream classification and reconstruction tasks using deep learning.
Collapse
|
43
|
A novel method for single-cell data imputation using subspace regression. Sci Rep 2022; 12:2697. [PMID: 35177662 PMCID: PMC8854597 DOI: 10.1038/s41598-022-06500-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2021] [Accepted: 01/27/2022] [Indexed: 12/13/2022] Open
Abstract
Recent advances in biochemistry and single-cell RNA sequencing (scRNA-seq) have allowed us to monitor the biological systems at the single-cell resolution. However, the low capture of mRNA material within individual cells often leads to inaccurate quantification of genetic material. Consequently, a significant amount of expression values are reported as missing, which are often referred to as dropouts. To overcome this challenge, we develop a novel imputation method, named single-cell Imputation via Subspace Regression (scISR), that can reliably recover the dropout values of scRNA-seq data. The scISR method first uses a hypothesis-testing technique to identify zero-valued entries that are most likely affected by dropout events and then estimates the dropout values using a subspace regression model. Our comprehensive evaluation using 25 publicly available scRNA-seq datasets and various simulation scenarios against five state-of-the-art methods demonstrates that scISR is better than other imputation methods in recovering scRNA-seq expression profiles via imputation. scISR consistently improves the quality of cluster analysis regardless of dropout rates, normalization techniques, and quantification schemes. The source code of scISR can be found on GitHub at https://github.com/duct317/scISR.
Collapse
|
44
|
Wang Y, Wong KC, Li X. Exploring high-throughput biomolecular data with multiobjective robust continuous clustering. Inf Sci (N Y) 2022. [DOI: 10.1016/j.ins.2021.11.030] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
|
45
|
Yin Q, Wang Y, Guan J, Ji G. scIAE: an integrative autoencoder-based ensemble classification framework for single-cell RNA-seq data. Brief Bioinform 2021; 23:6463428. [PMID: 34913057 DOI: 10.1093/bib/bbab508] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2021] [Revised: 10/28/2021] [Accepted: 11/04/2021] [Indexed: 12/12/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) allows quantitative analysis of gene expression at the level of single cells, beneficial to study cell heterogeneity. The recognition of cell types facilitates the construction of cell atlas in complex tissues or organisms, which is the basis of almost all downstream scRNA-seq data analyses. Using disease-related scRNA-seq data to perform the prediction of disease status can facilitate the specific diagnosis and personalized treatment of disease. Since single-cell gene expression data are high-dimensional and sparse with dropouts, we propose scIAE, an integrative autoencoder-based ensemble classification framework, to firstly perform multiple random projections and apply integrative and devisable autoencoders (integrating stacked, denoising and sparse autoencoders) to obtain compressed representations. Then base classifiers are built on the lower-dimensional representations and the predictions from all base models are integrated. The comparison of scIAE and common feature extraction methods shows that scIAE is effective and robust, independent of the choice of dimension, which is beneficial to subsequent cell classification. By testing scIAE on different types of data and comparing it with existing general and single-cell-specific classification methods, it is proven that scIAE has a great classification power in cell type annotation intradataset, across batches, across platforms and across species, and also disease status prediction. The architecture of scIAE is flexible and devisable, and it is available at https://github.com/JGuan-lab/scIAE.
Collapse
Affiliation(s)
- Qingyang Yin
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China.,Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California 90089, USA
| | - Yang Wang
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China
| | - Jinting Guan
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China.,National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361102, China
| | - Guoli Ji
- Department of Automation, Xiamen University, Xiamen, Fujian 361102, China.,National Institute for Data Science in Health and Medicine, Xiamen University, Xiamen, Fujian 361102, China
| |
Collapse
|
46
|
Sparsely Connected Autoencoders: A Multi-Purpose Tool for Single Cell omics Analysis. Int J Mol Sci 2021; 22:ijms222312755. [PMID: 34884559 PMCID: PMC8657975 DOI: 10.3390/ijms222312755] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2021] [Revised: 11/12/2021] [Accepted: 11/23/2021] [Indexed: 02/02/2023] Open
Abstract
Background: Biological processes are based on complex networks of cells and molecules. Single cell multi-omics is a new tool aiming to provide new incites in the complex network of events controlling the functionality of the cell. Methods: Since single cell technologies provide many sample measurements, they are the ideal environment for the application of Deep Learning and Machine Learning approaches. An autoencoder is composed of an encoder and a decoder sub-model. An autoencoder is a very powerful tool in data compression and noise removal. However, the decoder model remains a black box from which is impossible to depict the contribution of the single input elements. We have recently developed a new class of autoencoders, called Sparsely Connected Autoencoders (SCA), which have the advantage of providing a controlled association among the input layer and the decoder module. This new architecture has the benefit that the decoder model is not a black box anymore and can be used to depict new biologically interesting features from single cell data. Results: Here, we show that SCA hidden layer can grab new information usually hidden in single cell data, like providing clustering on meta-features difficult, i.e. transcription factors expression, or not technically not possible, i.e. miRNA expression, to depict in single cell RNAseq data. Furthermore, SCA representation of cell clusters has the advantage of simulating a conventional bulk RNAseq, which is a data transformation allowing the identification of similarity among independent experiments. Conclusions: In our opinion, SCA represents the bioinformatics version of a universal “Swiss-knife” for the extraction of hidden knowledgeable features from single cell omics data.
Collapse
|
47
|
Park Y, Hauschild AC, Heider D. Transfer learning compensates limited data, batch effects and technological heterogeneity in single-cell sequencing. NAR Genom Bioinform 2021; 3:lqab104. [PMID: 34805988 PMCID: PMC8598306 DOI: 10.1093/nargab/lqab104] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2021] [Revised: 10/07/2021] [Accepted: 10/18/2021] [Indexed: 12/18/2022] Open
Abstract
Tremendous advances in next-generation sequencing technology have enabled the accumulation of large amounts of omics data in various research areas over the past decade. However, study limitations due to small sample sizes, especially in rare disease clinical research, technological heterogeneity and batch effects limit the applicability of traditional statistics and machine learning analysis. Here, we present a meta-transfer learning approach to transfer knowledge from big data and reduce the search space in data with small sample sizes. Few-shot learning algorithms integrate meta-learning to overcome data scarcity and data heterogeneity by transferring molecular pattern recognition models from datasets of unrelated domains. We explore few-shot learning models with large scale public dataset, TCGA (The Cancer Genome Atlas) and GTEx dataset, and demonstrate their potential as pre-training dataset in other molecular pattern recognition tasks. Our results show that meta-transfer learning is very effective for datasets with a limited sample size. Furthermore, we show that our approach can transfer knowledge across technological heterogeneity, for example, from bulk cell to single-cell data. Our approach can overcome study size constraints, batch effects and technical limitations in analyzing single-cell data by leveraging existing bulk-cell sequencing data.
Collapse
Affiliation(s)
- Youngjun Park
- Data Science in Biomedicine, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg 35039, Germany
| | - Anne-Christin Hauschild
- Data Science in Biomedicine, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg 35039, Germany
| | - Dominik Heider
- Data Science in Biomedicine, Faculty of Mathematics and Computer Science, Philipps-University of Marburg, Marburg 35039, Germany
| |
Collapse
|
48
|
Asada K, Takasawa K, Machino H, Takahashi S, Shinkai N, Bolatkan A, Kobayashi K, Komatsu M, Kaneko S, Okamoto K, Hamamoto R. Single-Cell Analysis Using Machine Learning Techniques and Its Application to Medical Research. Biomedicines 2021; 9:biomedicines9111513. [PMID: 34829742 PMCID: PMC8614827 DOI: 10.3390/biomedicines9111513] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2021] [Revised: 10/06/2021] [Accepted: 10/19/2021] [Indexed: 01/14/2023] Open
Abstract
In recent years, the diversity of cancer cells in tumor tissues as a result of intratumor heterogeneity has attracted attention. In particular, the development of single-cell analysis technology has made a significant contribution to the field; technologies that are centered on single-cell RNA sequencing (scRNA-seq) have been reported to analyze cancer constituent cells, identify cell groups responsible for therapeutic resistance, and analyze gene signatures of resistant cell groups. However, although single-cell analysis is a powerful tool, various issues have been reported, including batch effects and transcriptional noise due to gene expression variation and mRNA degradation. To overcome these issues, machine learning techniques are currently being introduced for single-cell analysis, and promising results are being reported. In addition, machine learning has also been used in various ways for single-cell analysis, such as single-cell assay of transposase accessible chromatin sequencing (ATAC-seq), chromatin immunoprecipitation sequencing (ChIP-seq) analysis, and multi-omics analysis; thus, it contributes to a deeper understanding of the characteristics of human diseases, especially cancer, and supports clinical applications. In this review, we present a comprehensive introduction to the implementation of machine learning techniques in medical research for single-cell analysis, and discuss their usefulness and future potential.
Collapse
Affiliation(s)
- Ken Asada
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan; (K.T.); (H.M.); (S.T.); (N.S.); (A.B.); (M.K.)
- Correspondence: (K.A.); (R.H.); Tel.: +81-3-3547-5271 (R.H.)
| | - Ken Takasawa
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan; (K.T.); (H.M.); (S.T.); (N.S.); (A.B.); (M.K.)
| | - Hidenori Machino
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan; (K.T.); (H.M.); (S.T.); (N.S.); (A.B.); (M.K.)
| | - Satoshi Takahashi
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan; (K.T.); (H.M.); (S.T.); (N.S.); (A.B.); (M.K.)
| | - Norio Shinkai
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan; (K.T.); (H.M.); (S.T.); (N.S.); (A.B.); (M.K.)
- Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University, 1-5-45 Yushima, Bunkyo-ku, Tokyo 113-8510, Japan
| | - Amina Bolatkan
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan; (K.T.); (H.M.); (S.T.); (N.S.); (A.B.); (M.K.)
- Division of Medical AI Research and Development, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan; (K.K.); (S.K.)
| | - Kazuma Kobayashi
- Division of Medical AI Research and Development, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan; (K.K.); (S.K.)
| | - Masaaki Komatsu
- Cancer Translational Research Team, RIKEN Center for Advanced Intelligence Project, 1-4-1 Nihonbashi, Chuo-ku, Tokyo 103-0027, Japan; (K.T.); (H.M.); (S.T.); (N.S.); (A.B.); (M.K.)
| | - Syuzo Kaneko
- Division of Medical AI Research and Development, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan; (K.K.); (S.K.)
| | - Koji Okamoto
- Division of Cancer Differentiation, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan;
| | - Ryuji Hamamoto
- Department of NCC Cancer Science, Graduate School of Medical and Dental Sciences, Tokyo Medical and Dental University, 1-5-45 Yushima, Bunkyo-ku, Tokyo 113-8510, Japan
- Division of Medical AI Research and Development, National Cancer Center Research Institute, 5-1-1 Tsukiji, Chuo-ku, Tokyo 104-0045, Japan; (K.K.); (S.K.)
- Correspondence: (K.A.); (R.H.); Tel.: +81-3-3547-5271 (R.H.)
| |
Collapse
|
49
|
Fujisawa K, Shimo M, Taguchi YH, Ikematsu S, Miyata R. PCA-based unsupervised feature extraction for gene expression analysis of COVID-19 patients. Sci Rep 2021; 11:17351. [PMID: 34456333 PMCID: PMC8403676 DOI: 10.1038/s41598-021-95698-w] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2021] [Accepted: 07/23/2021] [Indexed: 01/08/2023] Open
Abstract
Coronavirus disease 2019 (COVID-19) is raging worldwide. This potentially fatal infectious disease is caused by severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2). However, the complete mechanism of COVID-19 is not well understood. Therefore, we analyzed gene expression profiles of COVID-19 patients to identify disease-related genes through an innovative machine learning method that enables a data-driven strategy for gene selection from a data set with a small number of samples and many candidates. Principal-component-analysis-based unsupervised feature extraction (PCAUFE) was applied to the RNA expression profiles of 16 COVID-19 patients and 18 healthy control subjects. The results identified 123 genes as critical for COVID-19 progression from 60,683 candidate probes, including immune-related genes. The 123 genes were enriched in binding sites for transcription factors NFKB1 and RELA, which are involved in various biological phenomena such as immune response and cell survival: the primary mediator of canonical nuclear factor-kappa B (NF-κB) activity is the heterodimer RelA-p50. The genes were also enriched in histone modification H3K36me3, and they largely overlapped the target genes of NFKB1 and RELA. We found that the overlapping genes were downregulated in COVID-19 patients. These results suggest that canonical NF-κB activity was suppressed by H3K36me3 in COVID-19 patient blood.
Collapse
Affiliation(s)
- Kota Fujisawa
- School of Life Science and Technology, Tokyo Institute of Technology, Tokyo, 152-8550, Japan.
| | - Mamoru Shimo
- Graduate School of Engineering and Science, University of the Ryukyus, Okinawa, 903-0213, Japan
| | - Y-H Taguchi
- Department of Physics, Chuo University, Tokyo, 112-8551, Japan
| | - Shinya Ikematsu
- Department of Bioresources Engineering, National Institute of Technology, OkinawaCollege, Okinawa, 905-2192, Japan
| | - Ryota Miyata
- Faculty of Engineering, University of the Ryukyus, Okinawa, 903-0213, Japan.
| |
Collapse
|
50
|
Park Y, Heider D, Hauschild AC. Integrative Analysis of Next-Generation Sequencing for Next-Generation Cancer Research toward Artificial Intelligence. Cancers (Basel) 2021; 13:3148. [PMID: 34202427 PMCID: PMC8269018 DOI: 10.3390/cancers13133148] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2021] [Revised: 06/16/2021] [Accepted: 06/21/2021] [Indexed: 12/18/2022] Open
Abstract
The rapid improvement of next-generation sequencing (NGS) technologies and their application in large-scale cohorts in cancer research led to common challenges of big data. It opened a new research area incorporating systems biology and machine learning. As large-scale NGS data accumulated, sophisticated data analysis methods became indispensable. In addition, NGS data have been integrated with systems biology to build better predictive models to determine the characteristics of tumors and tumor subtypes. Therefore, various machine learning algorithms were introduced to identify underlying biological mechanisms. In this work, we review novel technologies developed for NGS data analysis, and we describe how these computational methodologies integrate systems biology and omics data. Subsequently, we discuss how deep neural networks outperform other approaches, the potential of graph neural networks (GNN) in systems biology, and the limitations in NGS biomedical research. To reflect on the various challenges and corresponding computational solutions, we will discuss the following three topics: (i) molecular characteristics, (ii) tumor heterogeneity, and (iii) drug discovery. We conclude that machine learning and network-based approaches can add valuable insights and build highly accurate models. However, a well-informed choice of learning algorithm and biological network information is crucial for the success of each specific research question.
Collapse
Affiliation(s)
- Youngjun Park
- Department of Mathematics and Computer Science, Philipps-University of Marburg, 35032 Marburg, Germany; (Y.P.); (D.H.)
| | - Dominik Heider
- Department of Mathematics and Computer Science, Philipps-University of Marburg, 35032 Marburg, Germany; (Y.P.); (D.H.)
| | - Anne-Christin Hauschild
- Department of Mathematics and Computer Science, Philipps-University of Marburg, 35032 Marburg, Germany; (Y.P.); (D.H.)
- Department of Medical Informatics, University Medical Center Göttingen, 37075 Göttingen, Germany
| |
Collapse
|