1
|
Theunissen L, Mortier T, Saeys Y, Waegeman W. Evaluation of out-of-distribution detection methods for data shifts in single-cell transcriptomics. Brief Bioinform 2025; 26:bbaf239. [PMID: 40439669 PMCID: PMC12121363 DOI: 10.1093/bib/bbaf239] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2025] [Revised: 04/01/2025] [Accepted: 05/05/2025] [Indexed: 06/02/2025] Open
Abstract
Automatic cell-type annotation methods assign cell-type labels to new, unlabeled datasets by leveraging relationships from a reference RNA-seq atlas. However, new datasets may include labels absent from the reference dataset or exhibit feature distributions that diverge from it. These scenarios can significantly affect the reliability of cell type predictions, a factor often overlooked in current automatic annotation methods. The field of out-of-distribution detection (OOD), primarily focused on computer vision, addresses the identification of instances that differ from the training distribution. Therefore, the implementation of OOD methods in the context of novel cell type annotation and data shift detection for single-cell transcriptomics may enhance annotation accuracy and trustworthiness. We evaluate six OOD detection methods: LogitNorm, MC dropout, Deep Ensembles, Energy-based OOD, Deep NN, and Posterior networks, for their annotation and OOD detection performance in both synthetical and real-life application settings. We show that OOD detection methods can accurately identify novel cell types and demonstrate potential to detect significant data shifts in non-integrated datasets. Moreover, we find that integration of the OOD datasets does not interfere with OOD detection of novel cell types.
Collapse
Affiliation(s)
- Lauren Theunissen
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research and VIB Center for AI and Computational Biology (VIB.AI), 9000 Ghent, Belgium
- Department of Data-analysis and Mathematical Modeling, Ghent University Faculty of Bioscience Engineering, 9000 Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University Faculty of Sciences, 9000 Ghent, Belgium
| | - Thomas Mortier
- Department of Data-analysis and Mathematical Modeling, Ghent University Faculty of Bioscience Engineering, 9000 Ghent, Belgium
- Department of Environment, Ghent University Faculty of Bioscience Engineering, 9000 Ghent, Belgium
| | - Yvan Saeys
- Data Mining and Modeling for Biomedicine, VIB Center for Inflammation Research and VIB Center for AI and Computational Biology (VIB.AI), 9000 Ghent, Belgium
- Department of Applied Mathematics, Computer Science and Statistics, Ghent University Faculty of Sciences, 9000 Ghent, Belgium
| | - Willem Waegeman
- Department of Data-analysis and Mathematical Modeling, Ghent University Faculty of Bioscience Engineering, 9000 Ghent, Belgium
| |
Collapse
|
2
|
Traversa D, Chiara M. Mapping Cell Identity from scRNA-seq: A primer on computational methods. Comput Struct Biotechnol J 2025; 27:1559-1569. [PMID: 40270709 PMCID: PMC12017876 DOI: 10.1016/j.csbj.2025.03.051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2024] [Revised: 03/29/2025] [Accepted: 03/31/2025] [Indexed: 04/25/2025] Open
Abstract
Single cell (sc) technologies mark a conceptual and methodological breakthrough in our way to study cells, the base units of life. Thanks to these technological developments, large-scale initiatives are currently ongoing aimed at mapping of all the cell types in the human body, with the ambitious aim to gain a cell-level resolution of physiological development and disease. Since its broad applicability and ease of interpretation scRNA-seq is probably the most common sc-based application. This assay uses high throughput RNA sequencing to capture gene expression profiles at the sc-level. Subsequently, under the assumption that differences in transcriptional programs correspond to distinct cellular identities, ad-hoc computational methods are used to infer cell types from gene expression patterns. A wide array of computational methods were developed for this task. However, depending on the underlying algorithmic approach and associated computational requirements, each method might have a specific range of application, with implications that are not always clear to the end user. Here we will provide a concise overview on state-of-the-art computational methods for cell identity annotation in scRNA-seq, tailored for new users and non-computational scientists. To this end, we classify existing tools in five main categories, and discuss their key strengths, limitations and range of application.
Collapse
Affiliation(s)
- Daniele Traversa
- Department of Biosciences, Università degli Studi di Milano, via Celoria 26, Milan 20133, Italy
| | - Matteo Chiara
- Department of Biosciences, Università degli Studi di Milano, via Celoria 26, Milan 20133, Italy
| |
Collapse
|
3
|
Chi Y, Marini S, Wang GZ. BrainCellR: A precise cell type nomenclature pipeline for comparative analysis across brain single-cell datasets. Comput Struct Biotechnol J 2024; 23:4306-4314. [PMID: 39687760 PMCID: PMC11648093 DOI: 10.1016/j.csbj.2024.11.038] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2024] [Revised: 11/24/2024] [Accepted: 11/25/2024] [Indexed: 12/18/2024] Open
Abstract
Single-cell studies in neuroscience require precise cell type classification and consistent nomenclature that allows for meaningful comparisons across diverse datasets. Current approaches often lack the ability to identify fine-grained cell types and establish standardized annotations at the cluster level, hindering comprehensive understanding of the brain's cellular composition. To facilitate data integration across multiple models and datasets, we designed BrainCellR. This pipeline provides researchers with a powerful and user-friendly tool for efficient cell type classification and nomination from single-cell transcriptomic data. While initially focused on brain studies, BrainCellR is applicable to other tissues with complex cellular compositions. BrainCellR goes beyond conventional classification approaches by incorporating a standardized nomenclature system for cell types at the cluster level. This feature enables consistent and comparable annotations across different studies, promoting data integration and providing deeper insights into the complex cellular landscape of the brain. All documents for BrainCellR, including source code, user manual and tutorials, are freely available at https://github.com/WangLab-SINH/BrainCellR.
Collapse
Affiliation(s)
- Yuhao Chi
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| | - Simone Marini
- Department of Epidemiology, University of Florida, Gainesville, FL, USA
| | - Guang-Zhong Wang
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai 200031, China
| |
Collapse
|
4
|
Ramarapu R, Wulcan JM, Chang H, Moore PF, Vernau W, Keller SM. Single cell RNA-sequencing of feline peripheral immune cells with V(D)J repertoire and cross species analysis of T lymphocytes. Front Immunol 2024; 15:1438004. [PMID: 39620216 PMCID: PMC11604454 DOI: 10.3389/fimmu.2024.1438004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2024] [Accepted: 09/23/2024] [Indexed: 12/11/2024] Open
Abstract
Introduction The domestic cat (Felis catus) is a valued companion animal and a model for virally induced cancers and immunodeficiencies. However, species-specific limitations such as a scarcity of immune cell markers constrain our ability to resolve immune cell subsets at sufficient detail. The goal of this study was to characterize circulating feline T cells and other leukocytes based on their transcriptomic landscape and T-cell receptor repertoire using single cell RNA-sequencing. Methods Peripheral blood from 4 healthy cats was enriched for T cells by flow cytometry cell sorting using a mouse anti-feline CD5 monoclonal antibody. Libraries for whole transcriptome, αβ T cell receptor transcripts and γδ T cell receptor transcripts were constructed using the 10x Genomics Chromium Next GEM Single Cell 5' reagent kit and the Chromium Single Cell V(D)J Enrichment Kit with custom reverse primers for the feline orthologs. Results Unsupervised clustering of whole transcriptome data revealed 7 major cell populations - T cells, neutrophils, monocytic cells, B cells, plasmacytoid dendritic cells, mast cells and platelets. Sub cluster analysis of T cells resolved naive (CD4+ and CD8+), CD4+ effector T cells, CD8+ cytotoxic T cells and γδ T cells. Cross species analysis revealed a high conservation of T cell subsets along an effector gradient with equitable representation of veterinary species (horse, dog, pig) and humans with the cat. Our V(D)J repertoire analysis identified a subset of CD8+ cytotoxic T cells with skewed TRA and TRB gene usage, conserved TRA and TRB junctional motifs, restricted TRA/TRB pairing and reduced diversity in TRG junctional length. We also identified a public γδ T cell subset with invariant TRD and TRG chains and a CD4+ TEM-like phenotype. Among monocytic cells, we resolved three clusters of classical monocytes with polarization into pro- and anti-inflammatory phenotypes in addition to a cluster of conventional dendritic cells. Lastly, our neutrophil sub clustering revealed a larger mature neutrophil cluster and a smaller exhausted/activated cluster. Discussion Our study is the first to characterize subsets of circulating T cells utilizing an integrative approach of single cell RNA-sequencing, V(D)J repertoire analysis and cross species analysis. In addition, we characterize the transcriptome of several myeloid cell subsets and demonstrate immune cell relatedness across different species.
Collapse
MESH Headings
- Animals
- Cats
- Single-Cell Analysis
- Transcriptome
- Species Specificity
- T-Lymphocytes/immunology
- T-Lymphocytes/metabolism
- T-Lymphocyte Subsets/immunology
- T-Lymphocyte Subsets/metabolism
- Dogs
- Sequence Analysis, RNA
- Receptors, Antigen, T-Cell, gamma-delta/genetics
- Receptors, Antigen, T-Cell, gamma-delta/immunology
- Receptors, Antigen, T-Cell, gamma-delta/metabolism
- RNA-Seq
- V(D)J Recombination/genetics
Collapse
Affiliation(s)
- Raneesh Ramarapu
- Department of Surgical and Radiological Sciences, School of Veterinary Medicine, University of California, Davis, Davis, CA, United States
- Department of Anatomy, Physiology and Cell Biology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United States
| | - Judit M. Wulcan
- Department of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United States
| | - Haiyang Chang
- Department of Mathematics and Statistics, University of Guelph, Guelph, ON, Canada
| | - Peter F. Moore
- Department of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United States
| | - William Vernau
- Department of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United States
| | - Stefan M. Keller
- Department of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, Davis, CA, United States
| |
Collapse
|
5
|
Xia Y, Liu Y, Li T, He S, Chang H, Wang Y, Zhang Y, Ge W. Assessing parameter efficient methods for pre-trained language model in annotating scRNA-seq data. Methods 2024; 228:12-21. [PMID: 38759908 DOI: 10.1016/j.ymeth.2024.05.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2024] [Revised: 04/28/2024] [Accepted: 05/10/2024] [Indexed: 05/19/2024] Open
Abstract
Annotating cell types of single-cell RNA sequencing (scRNA-seq) data is crucial for studying cellular heterogeneity in the tumor microenvironment. Recently, large-scale pre-trained language models (PLMs) have achieved significant progress in cell-type annotation of scRNA-seq data. This approach effectively addresses previous methods' shortcomings in performance and generalization. However, fine-tuning PLMs for different downstream tasks demands considerable computational resources, rendering it impractical. Hence, a new research branch introduces parameter-efficient fine-tuning (PEFT). This involves optimizing a few parameters while leaving the majority unchanged, leading to substantial reductions in computational expenses. Here, we utilize scBERT, a large-scale pre-trained model, to explore the capabilities of three PEFT methods in scRNA-seq cell type annotation. Extensive benchmark studies across several datasets demonstrate the superior applicability of PEFT methods. Furthermore, downstream analysis using models obtained through PEFT showcases their utility in novel cell type discovery and model interpretability for potential marker genes. Our findings underscore the considerable potential of PEFT in PLM-based cell type annotation, presenting novel perspectives for the analysis of scRNA-seq data.
Collapse
Affiliation(s)
- Yucheng Xia
- Institute of Optics and Electronics, Chinese Academy of Sciences, Chengdu, 610209, China
| | - Yuhang Liu
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Tianhao Li
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Sihan He
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Hong Chang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Yaqing Wang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Yongqing Zhang
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China
| | - Wenyi Ge
- School of Computer Science, Chengdu University of Information Technology, Chengdu, 610225, China.
| |
Collapse
|
6
|
Fu Q, Dong C, Liu Y, Xia X, Liu G, Zhong F, Liu L. A comparison of scRNA-seq annotation methods based on experimentally labeled immune cell subtype dataset. Brief Bioinform 2024; 25:bbae392. [PMID: 39120646 PMCID: PMC11312369 DOI: 10.1093/bib/bbae392] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2024] [Revised: 07/04/2024] [Accepted: 07/30/2024] [Indexed: 08/10/2024] Open
Abstract
Cell-type annotation is a critical step in single-cell data analysis. With the development of numerous cell annotation methods, it is necessary to evaluate these methods to help researchers use them effectively. Reference datasets are essential for evaluation, but currently, the cell labels of reference datasets mainly come from computational methods, which may have computational biases and may not reflect the actual cell-type outcomes. This study first constructed an experimentally labeled immune cell-subtype single-cell dataset of the same batch and systematically evaluated 18 cell annotation methods. We assessed those methods under five scenarios, including intra-dataset validation, immune cell-subtype validation, unsupervised clustering, inter-dataset annotation, and unknown cell-type prediction. Accuracy and ARI were evaluation metrics. The results showed that SVM, scBERT, and scDeepSort were the best-performing supervised methods. Seurat was the best-performing unsupervised clustering method, but it couldn't fully fit the actual cell-type distribution. Our results indicated that experimentally labeled immune cell-subtype datasets revealed the deficiencies of unsupervised clustering methods and provided new dataset support for supervised methods.
Collapse
Affiliation(s)
- Qiqing Fu
- Institutes of Biomedical Sciences, Fudan University, 200032 Shanghai, P.R. China
| | - Chenyu Dong
- Institutes of Biomedical Sciences, Fudan University, 200032 Shanghai, P.R. China
| | - Yunhe Liu
- Institutes of Biomedical Sciences, Fudan University, 200032 Shanghai, P.R. China
| | - Xiaoqiong Xia
- Institutes of Biomedical Sciences, Fudan University, 200032 Shanghai, P.R. China
| | - Gang Liu
- Institutes of Biomedical Sciences, Fudan University, 200032 Shanghai, P.R. China
| | - Fan Zhong
- Intelligent Medicine Institute, Fudan University, 200032 Shanghai, P.R. China
| | - Lei Liu
- Intelligent Medicine Institute, Fudan University, 200032 Shanghai, P.R. China
| |
Collapse
|
7
|
Ramarapu R, Wulcan JM, Chang H, Moore PF, Vernau W, Keller SM. Single cell RNA-sequencing of feline peripheral immune cells with V(D)J repertoire and cross species analysis of T lymphocytes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.21.595010. [PMID: 38826195 PMCID: PMC11142102 DOI: 10.1101/2024.05.21.595010] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Introduction The domestic cat (Felis catus) is a valued companion animal and a model for virally induced cancers and immunodeficiencies. However, species-specific limitations such as a scarcity of immune cell markers constrain our ability to resolve immune cell subsets at sufficient detail. The goal of this study was to characterize circulating feline T cells and other leukocytes based on their transcriptomic landscape and T-cell receptor repertoire using single cell RNA-sequencing. Methods Peripheral blood from 4 healthy cats was enriched for T cells by flow cytometry cell sorting using a mouse anti-feline CD5 monoclonal antibody. Libraries for whole transcriptome, alpha/beta T cell receptor transcripts and gamma/delta T cell receptor transcripts were constructed using the 10x Genomics Chromium Next GEM Single Cell 5' reagent kit and the Chromium Single Cell V(D)J Enrichment Kit with custom reverse primers for the feline orthologs. Results Unsupervised clustering of whole transcriptome data revealed 7 major cell populations - T cells, neutrophils, monocytic cells, B cells, plasmacytoid dendritic cells, mast cells and platelets. Sub cluster analysis of T cells resolved naive (CD4+ and CD8+), CD4+ effector T cells, CD8+ cytotoxic T cells and gamma/delta T cells. Cross species analysis revealed a high conservation of T cell subsets along an effector gradient with equitable representation of veterinary species (horse, dog, pig) and humans with the cat. Our V(D)J repertoire analysis demonstrated a skewed T-cell receptor alpha gene usage and a restricted T-cell receptor gamma junctional length in CD8+ cytotoxic T cells compared to other alpha/beta T cell subsets. Among myeloid cells, we resolved three clusters of classical monocytes with polarization into pro- and anti-inflammatory phenotypes in addition to a cluster of conventional dendritic cells. Lastly, our neutrophil sub clustering revealed a larger mature neutrophil cluster and a smaller exhausted/activated cluster. Discussion Our study is the first to characterize subsets of circulating T cells utilizing an integrative approach of single cell RNA-sequencing, V(D)J repertoire analysis and cross species analysis. In addition, we characterize the transcriptome of several myeloid cell subsets and demonstrate immune cell relatedness across different species.
Collapse
Affiliation(s)
- Raneesh Ramarapu
- Department of Surgical and Radiological Sciences, School of Veterinary Medicine, University of California Davis, Davis, CA, USA
- Department of Anatomy, Physiology and Cell Biology, School of Veterinary Medicine, University of California Davis, Davis, CA, USA
| | - Judit M Wulcan
- Department of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, CA, United States
| | - Haiyang Chang
- Department of Mathematics and Statistics, University of Guelph, Guelph, ON, Canada
| | - Peter F Moore
- Department of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, CA, United States
| | - William Vernau
- Department of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, CA, United States
| | - Stefan M Keller
- Department of Pathology, Microbiology and Immunology, School of Veterinary Medicine, University of California, Davis, CA, United States
| |
Collapse
|
8
|
Jia R, Ren YZ, Li PN, Gao R, Zhang YS. SCSMD: Single Cell Consistent Clustering based on Spectral Matrix Decomposition. Brief Bioinform 2024; 25:bbae273. [PMID: 38855914 PMCID: PMC11163303 DOI: 10.1093/bib/bbae273] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Revised: 04/25/2024] [Accepted: 05/30/2024] [Indexed: 06/11/2024] Open
Abstract
Cluster analysis, a pivotal step in single-cell sequencing data analysis, presents substantial opportunities to effectively unveil the molecular mechanisms underlying cellular heterogeneity and intercellular phenotypic variations. However, the inherent imperfections arise as different clustering algorithms yield diverse estimates of cluster numbers and cluster assignments. This study introduces Single Cell Consistent Clustering based on Spectral Matrix Decomposition (SCSMD), a comprehensive clustering approach that integrates the strengths of multiple methods to determine the optimal clustering scheme. Testing the performance of SCSMD across different distances and employing the bespoke evaluation metric, the methodological selection undergoes validation to ensure the optimal efficacy of the SCSMD. A consistent clustering test is conducted on 15 authentic scRNA-seq datasets. The application of SCSMD to human embryonic stem cell scRNA-seq data successfully identifies known cell types and delineates their developmental trajectories. Similarly, when applied to glioblastoma cells, SCSMD accurately detects pre-existing cell types and provides finer sub-division within one of the original clusters. The results affirm the robust performance of our SCSMD method in terms of both the number of clusters and cluster assignments. Moreover, we have broadened the application scope of SCSMD to encompass larger datasets, thereby furnishing additional evidence of its superiority. The findings suggest that SCSMD is poised for application to additional scRNA-seq datasets and for further downstream analyses.
Collapse
Affiliation(s)
- Ran Jia
- School of Mathematics and Statistics, Shandong University, Weihai 264209, Shandong, China
| | - Ying-Zan Ren
- School of Mathematics and Statistics, Shandong University, Weihai 264209, Shandong, China
| | - Po-Nian Li
- College of Mathematics and Informatics, South China Agricultural University, Guangzhou, Guangdong, China
| | - Rui Gao
- School of Control Science and Engineering, Shandong University, Jinan 250100, Shandong, China
| | - Yu-Sen Zhang
- School of Mathematics and Statistics, Shandong University, Weihai 264209, Shandong, China
| |
Collapse
|
9
|
Shu C, Street K, Breton CV, Bastain TM, Wilson ML. A review of single-cell transcriptomics and epigenomics studies in maternal and child health. Epigenomics 2024; 16:775-793. [PMID: 38709139 PMCID: PMC11318716 DOI: 10.1080/17501911.2024.2343276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Accepted: 04/11/2024] [Indexed: 05/07/2024] Open
Abstract
Single-cell sequencing technologies enhance our understanding of cellular dynamics throughout pregnancy. We outlined the workflow of single-cell sequencing techniques and reviewed single-cell studies in maternal and child health. We conducted a literature review of single cell studies on maternal and child health using PubMed. We summarized the findings from 16 single-cell atlases of the human and mammalian placenta across gestational stages and 31 single-cell studies on maternal exposures and complications including infection, obesity, diet, gestational diabetes, pre-eclampsia, environmental exposure and preterm birth. Single-cell studies provides insights on novel cell types in placenta and cell type-specific marks associated with maternal exposures and complications.
Collapse
Affiliation(s)
- Chang Shu
- Center for Genetic Epidemiology, Division of Epidemiology & Genetics, Department of Population & Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA USA
| | - Kelly Street
- Division of Biostatistics, Department of Population & Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA USA
| | - Carrie V Breton
- Division of Environmental Health, Department of Population & Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA USA
| | - Theresa M Bastain
- Division of Environmental Health, Department of Population & Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, CA USA
| | - Melissa L Wilson
- Division of Disease Prevention, Policy, & Global Health, Department of Population & Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles,CA USA
| |
Collapse
|
10
|
Brlek P, Bulić L, Bračić M, Projić P, Škaro V, Shah N, Shah P, Primorac D. Implementing Whole Genome Sequencing (WGS) in Clinical Practice: Advantages, Challenges, and Future Perspectives. Cells 2024; 13:504. [PMID: 38534348 PMCID: PMC10969765 DOI: 10.3390/cells13060504] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/06/2024] [Revised: 03/04/2024] [Accepted: 03/11/2024] [Indexed: 03/28/2024] Open
Abstract
The integration of whole genome sequencing (WGS) into all aspects of modern medicine represents the next step in the evolution of healthcare. Using this technology, scientists and physicians can observe the entire human genome comprehensively, generating a plethora of new sequencing data. Modern computational analysis entails advanced algorithms for variant detection, as well as complex models for classification. Data science and machine learning play a crucial role in the processing and interpretation of results, using enormous databases and statistics to discover new and support current genotype-phenotype correlations. In clinical practice, this technology has greatly enabled the development of personalized medicine, approaching each patient individually and in accordance with their genetic and biochemical profile. The most propulsive areas include rare disease genomics, oncogenomics, pharmacogenomics, neonatal screening, and infectious disease genomics. Another crucial application of WGS lies in the field of multi-omics, working towards the complete integration of human biomolecular data. Further technological development of sequencing technologies has led to the birth of third and fourth-generation sequencing, which include long-read sequencing, single-cell genomics, and nanopore sequencing. These technologies, alongside their continued implementation into medical research and practice, show great promise for the future of the field of medicine.
Collapse
Affiliation(s)
- Petar Brlek
- St. Catherine Specialty Hospital, 10000 Zagreb, Croatia; (P.B.)
- International Center for Applied Biological Research, 10000 Zagreb, Croatia
- School of Medicine, Josip Juraj Strossmayer University of Osijek, 31000 Osijek, Croatia
| | - Luka Bulić
- St. Catherine Specialty Hospital, 10000 Zagreb, Croatia; (P.B.)
| | - Matea Bračić
- St. Catherine Specialty Hospital, 10000 Zagreb, Croatia; (P.B.)
| | - Petar Projić
- International Center for Applied Biological Research, 10000 Zagreb, Croatia
| | | | - Nidhi Shah
- Dartmouth Hitchcock Medical Center, Lebannon, NH 03766, USA
| | - Parth Shah
- Dartmouth Hitchcock Medical Center, Lebannon, NH 03766, USA
| | - Dragan Primorac
- St. Catherine Specialty Hospital, 10000 Zagreb, Croatia; (P.B.)
- International Center for Applied Biological Research, 10000 Zagreb, Croatia
- School of Medicine, Josip Juraj Strossmayer University of Osijek, 31000 Osijek, Croatia
- Medical School, University of Split, 21000 Split, Croatia
- Eberly College of Science, The Pennsylvania State University, State College, PA 16802, USA
- The Henry C. Lee College of Criminal Justice and Forensic Sciences, University of New Haven, West Haven, CT 06516, USA
- REGIOMED Kliniken, 96450 Coburg, Germany
- Medical School, University of Rijeka, 51000 Rijeka, Croatia
- Faculty of Dental Medicine and Health, Josip Juraj Strossmayer University of Osijek, 31000 Osijek, Croatia
- Medical School, University of Mostar, 88000 Mostar, Bosnia and Herzegovina
- National Forensic Sciences University, Gujarat 382007, India
| |
Collapse
|
11
|
Wang X, Chai Z, Li S, Liu Y, Li C, Jiang Y, Liu Q. CTISL: a dynamic stacking multi-class classification approach for identifying cell types from single-cell RNA-seq data. Bioinformatics 2024; 40:btae063. [PMID: 38317054 PMCID: PMC10873586 DOI: 10.1093/bioinformatics/btae063] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2024] [Revised: 02/15/2024] [Accepted: 02/15/2024] [Indexed: 02/07/2024] Open
Abstract
MOTIVATION Effective identification of cell types is of critical importance in single-cell RNA-sequencing (scRNA-seq) data analysis. To date, many supervised machine learning-based predictors have been implemented to identify cell types from scRNA-seq datasets. Despite the technical advances of these state-of-the-art tools, most existing predictors were single classifiers, of which the performances can still be significantly improved. It is therefore highly desirable to employ the ensemble learning strategy to develop more accurate computational models for robust and comprehensive identification of cell types on scRNA-seq datasets. RESULTS We propose a two-layer stacking model, termed CTISL (Cell Type Identification by Stacking ensemble Learning), which integrates multiple classifiers to identify cell types. In the first layer, given a reference scRNA-seq dataset with known cell types, CTISL dynamically combines multiple cell-type-specific classifiers (i.e. support-vector machine and logistic regression) as the base learners to deliver the outcomes for the input of a meta-classifier in the second layer. We conducted a total of 24 benchmarking experiments on 17 human and mouse scRNA-seq datasets to evaluate and compare the prediction performance of CTISL and other state-of-the-art predictors. The experiment results demonstrate that CTISL achieves superior or competitive performance compared to these state-of-the-art approaches. We anticipate that CTISL can serve as a useful and reliable tool for cost-effective identification of cell types from scRNA-seq datasets. AVAILABILITY AND IMPLEMENTATION The webserver and source code are freely available at http://bigdata.biocie.cn/CTISLweb/home and https://zenodo.org/records/10568906, respectively.
Collapse
Affiliation(s)
- Xiao Wang
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Ziyi Chai
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Shaohua Li
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
| | - Yan Liu
- School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing 210094, China
| | - Chen Li
- Department of Biochemistry and Molecular Biology, Monash University, Melbourne, VIC 3800, Australia
| | - Yu Jiang
- Department of Animal Genetics, Breeding and Reproduction, College of Animal Science and Technology, Northwest A&F University, Yangling 712100, China
| | - Quanzhong Liu
- Department of Software Engineering, College of Information Engineering, Northwest A&F University, Yangling 712100, China
- Shaanxi Engineering Research Center of Agricultural Information Intelligent Perception and Analysis, Northwest A&F University, Yangling 712100, China
| |
Collapse
|
12
|
Li Z, Yang P. Inferring Novel Cells in Single-Cell RNA-Sequencing Data. Methods Mol Biol 2024; 2812:143-154. [PMID: 39068360 DOI: 10.1007/978-1-0716-3886-6_7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/30/2024]
Abstract
Single-cell RNA-sequencing (scRNA-seq) is a powerful technology that allows researchers to study gene expression heterogeneity within a tissue or cell population. One of the major advantages of scRNA-seq is that it allows researchers to identify and characterize novel cell types or subpopulations within a tissue that may be missed by traditional bulk RNA-sequencing methods. Although many existing methods have been developed to recognize known cell types, inferring novel cells may still be challenging in routine scRNA-seq analysis. Here we describe three lines of methods for inferring novel cells: unsupervised and outlier-detection-based methods, supervised and semi-supervised methods, and copy number variation (CNV)-based methods, as well as the corresponding situations that each method applies. We also provide implementation code and example usages to illustrate the available methods.
Collapse
Affiliation(s)
- Ziyi Li
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA.
| | - Peng Yang
- Department of Statistics, Rice University, Houston, TX, USA
| |
Collapse
|
13
|
Li Y, Wu M, Ma S, Wu M. ZINBMM: a general mixture model for simultaneous clustering and gene selection using single-cell transcriptomic data. Genome Biol 2023; 24:208. [PMID: 37697330 PMCID: PMC10496184 DOI: 10.1186/s13059-023-03046-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2022] [Accepted: 08/22/2023] [Indexed: 09/13/2023] Open
Abstract
Clustering is a critical component of single-cell RNA sequencing (scRNA-seq) data analysis and can help reveal cell types and infer cell lineages. Despite considerable successes, there are few methods tailored to investigating cluster-specific genes contributing to cell heterogeneity, which can promote biological understanding of cell heterogeneity. In this study, we propose a zero-inflated negative binomial mixture model (ZINBMM) that simultaneously achieves effective scRNA-seq data clustering and gene selection. ZINBMM conducts a systemic analysis on raw counts, accommodating both batch effects and dropout events. Simulations and the analysis of five scRNA-seq datasets demonstrate the practical applicability of ZINBMM.
Collapse
Affiliation(s)
- Yang Li
- Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China
- RSS and China-Re Life Joint Lab on Public Health and Risk Management, Renmin University of China, Beijing, China
- Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Mingcong Wu
- Center for Applied Statistics and School of Statistics, Renmin University of China, Beijing, China
- Statistical Consulting Center, Renmin University of China, Beijing, China
| | - Shuangge Ma
- Department of Biostatistics, Yale University, New Haven, USA
| | - Mengyun Wu
- School of Statistics and Management, Shanghai University of Finance and Economics, Shanghai, China.
| |
Collapse
|
14
|
Nie X, Qin D, Zhou X, Duo H, Hao Y, Li B, Liang G. Clustering ensemble in scRNA-seq data analysis: Methods, applications and challenges. Comput Biol Med 2023; 159:106939. [PMID: 37075602 DOI: 10.1016/j.compbiomed.2023.106939] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2023] [Revised: 03/31/2023] [Accepted: 04/14/2023] [Indexed: 04/21/2023]
Abstract
With the rapid development of single-cell RNA-sequencing techniques, various computational methods and tools were proposed to analyze these high-throughput data, which led to an accelerated reveal of potential biological information. As one of the core steps of single-cell transcriptome data analysis, clustering plays a crucial role in identifying cell types and interpreting cellular heterogeneity. However, the results generated by different clustering methods showed distinguishing, and those unstable partitions can affect the accuracy of the analysis to a certain extent. To overcome this challenge and obtain more accurate results, currently clustering ensemble is frequently applied to cluster analysis of single-cell transcriptome datasets, and the results generated by all clustering ensembles are nearly more reliable than those from most of the single clustering partitions. In this review, we summarize applications and challenges of the clustering ensemble method in single-cell transcriptome data analysis, and provide constructive thoughts and references for researchers in this field.
Collapse
Affiliation(s)
- Xiner Nie
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, China; College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Dan Qin
- Department of Biology, College of Science, Northeastern University, Boston, MA, 02115, USA
| | - Xinyi Zhou
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Hongrui Duo
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Youjin Hao
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China
| | - Bo Li
- College of Life Sciences, Chongqing Normal University, Chongqing, 400044, PR China.
| | - Guizhao Liang
- Key Laboratory of Biorheological Science and Technology, Ministry of Education, Bioengineering College, Chongqing University, Chongqing, 400044, China.
| |
Collapse
|
15
|
Gundogdu P, Alamo I, Nepomuceno-Chamorro IA, Dopazo J, Loucera C. SigPrimedNet: A Signaling-Informed Neural Network for scRNA-seq Annotation of Known and Unknown Cell Types. BIOLOGY 2023; 12:biology12040579. [PMID: 37106779 PMCID: PMC10135788 DOI: 10.3390/biology12040579] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/27/2022] [Revised: 03/04/2023] [Accepted: 04/08/2023] [Indexed: 04/29/2023]
Abstract
Single-cell RNA sequencing is increasing our understanding of the behavior of complex tissues or organs, by providing unprecedented details on the complex cell type landscape at the level of individual cells. Cell type definition and functional annotation are key steps to understanding the molecular processes behind the underlying cellular communication machinery. However, the exponential growth of scRNA-seq data has made the task of manually annotating cells unfeasible, due not only to an unparalleled resolution of the technology but to an ever-increasing heterogeneity of the data. Many supervised and unsupervised methods have been proposed to automatically annotate cells. Supervised approaches for cell-type annotation outperform unsupervised methods except when new (unknown) cell types are present. Here, we introduce SigPrimedNet an artificial neural network approach that leverages (i) efficient training by means of a sparsity-inducing signaling circuits-informed layer, (ii) feature representation learning through supervised training, and (iii) unknown cell-type identification by fitting an anomaly detection method on the learned representation. We show that SigPrimedNet can efficiently annotate known cell types while keeping a low false-positive rate for unseen cells across a set of publicly available datasets. In addition, the learned representation acts as a proxy for signaling circuit activity measurements, which provide useful estimations of the cell functionalities.
Collapse
Affiliation(s)
- Pelin Gundogdu
- Computational Medicine Platform, Andalusian Public Foundation Progress and Health-FPS, 41013 Sevilla, Spain
- Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Hospital Virgen del Rocio, 41013 Sevilla, Spain
| | - Inmaculada Alamo
- Computational Medicine Platform, Andalusian Public Foundation Progress and Health-FPS, 41013 Sevilla, Spain
- Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Hospital Virgen del Rocio, 41013 Sevilla, Spain
| | | | - Joaquin Dopazo
- Computational Medicine Platform, Andalusian Public Foundation Progress and Health-FPS, 41013 Sevilla, Spain
- Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Hospital Virgen del Rocio, 41013 Sevilla, Spain
- Bioinformatics in Rare Diseases (BiER), Centro de Investigación Biomédica en Red de Enfermedades Raras (CIBERER), FPS, Hospital Virgen del Rocío, 41013 Sevilla, Spain
- FPS/ELIXIR-es, Hospital Virgen del Rocío, 42013 Sevilla, Spain
| | - Carlos Loucera
- Computational Medicine Platform, Andalusian Public Foundation Progress and Health-FPS, 41013 Sevilla, Spain
- Computational Systems Medicine, Institute of Biomedicine of Seville (IBIS), Hospital Virgen del Rocio, 41013 Sevilla, Spain
| |
Collapse
|
16
|
Ma W, Lu J, Wu H. Cellcano: supervised cell type identification for single cell ATAC-seq data. Nat Commun 2023; 14:1864. [PMID: 37012226 PMCID: PMC10070275 DOI: 10.1038/s41467-023-37439-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Accepted: 03/15/2023] [Indexed: 04/05/2023] Open
Abstract
Computational cell type identification is a fundamental step in single-cell omics data analysis. Supervised celltyping methods have gained increasing popularity in single-cell RNA-seq data because of the superior performance and the availability of high-quality reference datasets. Recent technological advances in profiling chromatin accessibility at single-cell resolution (scATAC-seq) have brought new insights to the understanding of epigenetic heterogeneity. With continuous accumulation of scATAC-seq datasets, supervised celltyping method specifically designed for scATAC-seq is in urgent need. Here we develop Cellcano, a computational method based on a two-round supervised learning algorithm to identify cell types from scATAC-seq data. The method alleviates the distributional shift between reference and target data and improves the prediction performance. After systematically benchmarking Cellcano on 50 well-designed celltyping tasks from various datasets, we show that Cellcano is accurate, robust, and computationally efficient. Cellcano is well-documented and freely available at https://marvinquiet.github.io/Cellcano/ .
Collapse
Affiliation(s)
- Wenjing Ma
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA
| | - Jiaying Lu
- Department of Computer Science, Emory University, 400 Dowman Drive, Atlanta, GA, 30322, USA
| | - Hao Wu
- Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, 1068 Xueyuan Avenue, Shenzhen University Town, Shenzhen, 518055, P. R. China.
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, 1518 Clifton Road NE, Atlanta, GA, 30322, USA.
| |
Collapse
|
17
|
Deng T, Chen S, Zhang Y, Xu Y, Feng D, Wu H, Sun X. A cofunctional grouping-based approach for non-redundant feature gene selection in unannotated single-cell RNA-seq analysis. Brief Bioinform 2023; 24:bbad042. [PMID: 36754847 PMCID: PMC10025445 DOI: 10.1093/bib/bbad042] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Revised: 12/05/2022] [Accepted: 01/18/2023] [Indexed: 02/10/2023] Open
Abstract
Feature gene selection has significant impact on the performance of cell clustering in single-cell RNA sequencing (scRNA-seq) analysis. A well-rounded feature selection (FS) method should consider relevance, redundancy and complementarity of the features. Yet most existing FS methods focus on gene relevance to the cell types but neglect redundancy and complementarity, which undermines the cell clustering performance. We develop a novel computational method GeneClust to select feature genes for scRNA-seq cell clustering. GeneClust groups genes based on their expression profiles, then selects genes with the aim of maximizing relevance, minimizing redundancy and preserving complementarity. It can work as a plug-in tool for FS with any existing cell clustering method. Extensive benchmark results demonstrate that GeneClust significantly improve the clustering performance. Moreover, GeneClust can group cofunctional genes in biological process and pathway into clusters, thus providing a means of investigating gene interactions and identifying potential genes relevant to biological characteristics of the dataset. GeneClust is freely available at https://github.com/ToryDeng/scGeneClust.
Collapse
Affiliation(s)
- Tao Deng
- School of Data Science, The Chinese University of Hong Kong—Shenzhen, Guangdong, China
| | - Siyu Chen
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| | - Ying Zhang
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| | - Yuanbin Xu
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| | - Da Feng
- School of Pharmacy, Tongji Medical College, Huazhong University of Sciences and Technology, Hubei, China
| | - Hao Wu
- Department of Biostatistics and Bioinformatics, Rollins School of Public Health, Emory University, GA, USA
- Faculty of Computer Science and Control Engineering, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Xiaobo Sun
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| |
Collapse
|
18
|
Knight CH, Khan F, Patel A, Gill US, Okosun J, Wang J. IBRAP: integrated benchmarking single-cell RNA-sequencing analytical pipeline. Brief Bioinform 2023; 24:bbad061. [PMID: 36847692 PMCID: PMC10025434 DOI: 10.1093/bib/bbad061] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2022] [Revised: 12/19/2022] [Accepted: 02/02/2023] [Indexed: 03/01/2023] Open
Abstract
Single-cell ribonucleic acid (RNA)-sequencing (scRNA-seq) is a powerful tool to study cellular heterogeneity. The high dimensional data generated from this technology are complex and require specialized expertise for analysis and interpretation. The core of scRNA-seq data analysis contains several key analytical steps, which include pre-processing, quality control, normalization, dimensionality reduction, integration and clustering. Each step often has many algorithms developed with varied underlying assumptions and implications. With such a diverse choice of tools available, benchmarking analyses have compared their performances and demonstrated that tools operate differentially according to the data types and complexity. Here, we present Integrated Benchmarking scRNA-seq Analytical Pipeline (IBRAP), which contains a suite of analytical components that can be interchanged throughout the pipeline alongside multiple benchmarking metrics that enable users to compare results and determine the optimal pipeline combinations for their data. We apply IBRAP to single- and multi-sample integration analysis using primary pancreatic tissue, cancer cell line and simulated data accompanied with ground truth cell labels, demonstrating the interchangeable and benchmarking functionality of IBRAP. Our results confirm that the optimal pipelines are dependent on individual samples and studies, further supporting the rationale and necessity of our tool. We then compare reference-based cell annotation with unsupervised analysis, both included in IBRAP, and demonstrate the superiority of the reference-based method in identifying robust major and minor cell types. Thus, IBRAP presents a valuable tool to integrate multiple samples and studies to create reference maps of normal and diseased tissues, facilitating novel biological discovery using the vast volume of scRNA-seq data available.
Collapse
Affiliation(s)
- Connor H Knight
- Centre for Cancer Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London EC1M 6BQ
| | - Faraz Khan
- Centre for Cancer Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London EC1M 6BQ
| | - Ankit Patel
- Centre for Cancer Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London EC1M 6BQ
| | - Upkar S Gill
- Centre for Immunobiology, Blizard Institute, Faculty of Medicine and Dentistry Medicine & Dentistry, Queen Mary University of London, London E1 2AT, United Kingdom
| | - Jessica Okosun
- Centre for Haemato-Oncology, Barts Cancer Institute, Queen Mary University of London, London EC1M 6BQ
| | - Jun Wang
- Centre for Cancer Genomics and Computational Biology, Barts Cancer Institute, Queen Mary University of London, London EC1M 6BQ
| |
Collapse
|
19
|
Dong X, Bacher R. Analysis of Single-Cell RNA-seq Data. Methods Mol Biol 2023; 2629:95-114. [PMID: 36929075 DOI: 10.1007/978-1-0716-2986-4_6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023]
Abstract
As single-cell RNA sequencing experiments continue to advance scientific discoveries across biological disciplines, an increasing number of analysis tools and workflows for analyzing the data have been developed. In this chapter, we describe a standard workflow and elaborate on relevant data analysis tools for analyzing single-cell RNA sequencing data. We provide recommendations for the appropriate use of commonly used methods, with code examples and analysis interpretations.
Collapse
Affiliation(s)
- Xiaoru Dong
- Department of Biostatistics, University of Florida, Gainesville, Florida, USA
| | - Rhonda Bacher
- Department of Biostatistics, University of Florida, Gainesville, Florida, USA.
| |
Collapse
|
20
|
Su M, Pan T, Chen QZ, Zhou WW, Gong Y, Xu G, Yan HY, Li S, Shi QZ, Zhang Y, He X, Jiang CJ, Fan SC, Li X, Cairns MJ, Wang X, Li YS. Data analysis guidelines for single-cell RNA-seq in biomedical studies and clinical applications. Mil Med Res 2022; 9:68. [PMID: 36461064 PMCID: PMC9716519 DOI: 10.1186/s40779-022-00434-8] [Citation(s) in RCA: 35] [Impact Index Per Article: 11.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 11/18/2022] [Indexed: 12/03/2022] Open
Abstract
The application of single-cell RNA sequencing (scRNA-seq) in biomedical research has advanced our understanding of the pathogenesis of disease and provided valuable insights into new diagnostic and therapeutic strategies. With the expansion of capacity for high-throughput scRNA-seq, including clinical samples, the analysis of these huge volumes of data has become a daunting prospect for researchers entering this field. Here, we review the workflow for typical scRNA-seq data analysis, covering raw data processing and quality control, basic data analysis applicable for almost all scRNA-seq data sets, and advanced data analysis that should be tailored to specific scientific questions. While summarizing the current methods for each analysis step, we also provide an online repository of software and wrapped-up scripts to support the implementation. Recommendations and caveats are pointed out for some specific analysis tasks and approaches. We hope this resource will be helpful to researchers engaging with scRNA-seq, in particular for emerging clinical applications.
Collapse
Affiliation(s)
- Min Su
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Tao Pan
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Qiu-Zhen Chen
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Wei-Wei Zhou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081 Heilongjiang China
| | - Yi Gong
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
- Department of Immunology, Nanjing Medical University, Nanjing, 211166 China
| | - Gang Xu
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Huan-Yu Yan
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Si Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Qiao-Zhen Shi
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Ya Zhang
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| | - Xiao He
- Department of Laboratory Medicine, Women and Children’s Hospital of Chongqing Medical University, Chongqing, 401174 China
| | | | - Shi-Cai Fan
- Shenzhen Institute for Advanced Study, University of Electronic Science and Technology of China, Shenzhen, 518110 Guangdong China
| | - Xia Li
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081 Heilongjiang China
| | - Murray J. Cairns
- School of Biomedical Sciences and Pharmacy, Faculty of Health and Medicine, the University of Newcastle, University Drive, Callaghan, NSW 2308 Australia
- Precision Medicine Research Program, Hunter Medical Research Institute, New Lambton Heights, NSW 2305 Australia
| | - Xi Wang
- State Key Laboratory of Reproductive Medicine, Nanjing Medical University, Nanjing, 211166 China
| | - Yong-Sheng Li
- College of Biomedical Information and Engineering, the First Affiliated Hospital of Hainan Medical University, Hainan Medical University, Haikou, 571199 Hainan China
| |
Collapse
|
21
|
Watson ER, Mora A, Taherian Fard A, Mar JC. How does the structure of data impact cell-cell similarity? Evaluating how structural properties influence the performance of proximity metrics in single cell RNA-seq data. Brief Bioinform 2022; 23:bbac387. [PMID: 36151725 PMCID: PMC9677483 DOI: 10.1093/bib/bbac387] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 07/26/2022] [Accepted: 08/11/2022] [Indexed: 12/14/2022] Open
Abstract
Accurately identifying cell-populations is paramount to the quality of downstream analyses and overall interpretations of single-cell RNA-seq (scRNA-seq) datasets but remains a challenge. The quality of single-cell clustering depends on the proximity metric used to generate cell-to-cell distances. Accordingly, proximity metrics have been benchmarked for scRNA-seq clustering, typically with results averaged across datasets to identify a highest performing metric. However, the 'best-performing' metric varies between studies, with the performance differing significantly between datasets. This suggests that the unique structural properties of an scRNA-seq dataset, specific to the biological system under study, have a substantial impact on proximity metric performance. Previous benchmarking studies have omitted to factor the structural properties into their evaluations. To address this gap, we developed a framework for the in-depth evaluation of the performance of 17 proximity metrics with respect to core structural properties of scRNA-seq data, including sparsity, dimensionality, cell-population distribution and rarity. We find that clustering performance can be improved substantially by the selection of an appropriate proximity metric and neighbourhood size for the structural properties of a dataset, in addition to performing suitable pre-processing and dimensionality reduction. Furthermore, popular metrics such as Euclidean and Manhattan distance performed poorly in comparison to several lessor applied metrics, suggesting that the default metric for many scRNA-seq methods should be re-evaluated. Our findings highlight the critical nature of tailoring scRNA-seq analyses pipelines to the dataset under study and provide practical guidance for researchers looking to optimize cell-similarity search for the structural properties of their own data.
Collapse
Affiliation(s)
- Ebony Rose Watson
- Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, QLD, Australia
| | - Ariane Mora
- School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, Australia
| | - Atefeh Taherian Fard
- Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, QLD, Australia
| | - Jessica Cara Mar
- Australian Institute for Bioengineering and Nanotechnology, The University of Queensland, Brisbane, QLD, Australia
| |
Collapse
|
22
|
Liu G, Li M, Wang H, Lin S, Xu J, Li R, Tang M, Li C. D3K: The Dissimilarity-Density-Dynamic Radius K-means Clustering Algorithm for scRNA-Seq Data. Front Genet 2022; 13:912711. [PMID: 35846121 PMCID: PMC9284269 DOI: 10.3389/fgene.2022.912711] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2022] [Accepted: 04/25/2022] [Indexed: 12/02/2022] Open
Abstract
A single-cell sequencing data set has always been a challenge for clustering because of its high dimension and multi-noise points. The traditional K-means algorithm is not suitable for this type of data. Therefore, this study proposes a Dissimilarity-Density-Dynamic Radius-K-means clustering algorithm. The algorithm adds the dynamic radius parameter to the calculation. It flexibly adjusts the active radius according to the data characteristics, which can eliminate the influence of noise points and optimize the clustering results. At the same time, the algorithm calculates the weight through the dissimilarity density of the data set, the average contrast of candidate clusters, and the dissimilarity of candidate clusters. It obtains a set of high-quality initial center points, which solves the randomness of the K-means algorithm in selecting the center points. Finally, compared with similar algorithms, this algorithm shows a better clustering effect on single-cell data. Each clustering index is higher than other single-cell clustering algorithms, which overcomes the shortcomings of the traditional K-means algorithm.
Collapse
Affiliation(s)
- Guoyun Liu
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Manzhi Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
- Key Laboratory of Data Science and Smart Education, Ministry of Education, Hainan Normal University, Haikou, China
- *Correspondence: Manzhi Li,
| | - Hongtao Wang
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Shijun Lin
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| | - Junlin Xu
- College of Information Science and Engineering, Hunan University, Changsha, China
| | - Ruixi Li
- Geneis Beijing Co., Ltd., Beijing, China
| | - Min Tang
- School of Life Sciences, Jiangsu University, Zhenjiang, China
| | - Chun Li
- School of Mathematics and Statistics, Hainan Normal University, Haikou, China
| |
Collapse
|