1
|
Anuarbekov A, Kléma J. Utilizing RNA-seq data in monotone iterative generalized linear model to elevate prior knowledge quality of the circRNA-miRNA-mRNA regulatory axis. BMC Bioinformatics 2025; 26:139. [PMID: 40426030 PMCID: PMC12117772 DOI: 10.1186/s12859-025-06161-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2025] [Accepted: 05/07/2025] [Indexed: 05/29/2025] Open
Abstract
BACKGROUND Current experimental data on RNA interactions remain limited, particularly for non-coding RNAs, many of which have only recently been discovered and operate within complex regulatory networks. Researchers often rely on in-silico interaction detection algorithms, such as TargetScan, which are based on biochemical sequence alignment. However, these algorithms have limited performance. RNA-seq expression data can provide valuable insights into regulatory networks, especially for understudied interactions such as circRNA-miRNA-mRNA. By integrating RNA-seq data with prior interaction networks obtained experimentally or through in-silico predictions, researchers can discover novel interactions, validate existing ones, and improve interaction prediction accuracy. RESULTS This paper introduces Pi-GMIFS, an extension of the generalized monotone incremental forward stagewise (GMIFS) regression algorithm that incorporates prior knowledge. The algorithm first estimates prior response values through a prior-only regression, interpolates between these prior values and the original data, and then applies the GMIFS method. Our experimental results on circRNA-miRNA-mRNA regulatory interaction networks demonstrate that Pi-GMIFS consistently enhances precision and recall in RNA interaction prediction by leveraging implicit information from bulk RNA-seq expression data, outperforming the initial prior knowledge. CONCLUSION Pi-GMIFS is a robust algorithm for inferring acyclic interaction networks when the variable ordering is known. Its effectiveness was confirmed through extensive experimental validation. We proved that RNA-seq data of a representative size help infer previously unknown interactions available in TarBase v9 and improve the quality of circRNA disease annotation.
Collapse
Affiliation(s)
- Alikhan Anuarbekov
- Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague, Technicka 2, 16627, Prague, Czech Republic
| | - Jiří Kléma
- Department of Computer Science, Faculty of Electrical Engineering, Czech Technical University in Prague, Technicka 2, 16627, Prague, Czech Republic.
| |
Collapse
|
2
|
Sun F, Li H, Sun D, Fu S, Gu L, Shao X, Wang Q, Dong X, Duan B, Xing F, Wu J, Xiao M, Zhao F, Han JDJ, Liu Q, Fan X, Li C, Wang C, Shi T. Single-cell omics: experimental workflow, data analyses and applications. SCIENCE CHINA. LIFE SCIENCES 2025; 68:5-102. [PMID: 39060615 DOI: 10.1007/s11427-023-2561-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 04/18/2024] [Indexed: 07/28/2024]
Abstract
Cells are the fundamental units of biological systems and exhibit unique development trajectories and molecular features. Our exploration of how the genomes orchestrate the formation and maintenance of each cell, and control the cellular phenotypes of various organismsis, is both captivating and intricate. Since the inception of the first single-cell RNA technology, technologies related to single-cell sequencing have experienced rapid advancements in recent years. These technologies have expanded horizontally to include single-cell genome, epigenome, proteome, and metabolome, while vertically, they have progressed to integrate multiple omics data and incorporate additional information such as spatial scRNA-seq and CRISPR screening. Single-cell omics represent a groundbreaking advancement in the biomedical field, offering profound insights into the understanding of complex diseases, including cancers. Here, we comprehensively summarize recent advances in single-cell omics technologies, with a specific focus on the methodology section. This overview aims to guide researchers in selecting appropriate methods for single-cell sequencing and related data analysis.
Collapse
Affiliation(s)
- Fengying Sun
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China
| | - Haoyan Li
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
| | - Dongqing Sun
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Shaliu Fu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China
| | - Lei Gu
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Xin Shao
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314103, China
| | - Qinqin Wang
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China
| | - Xin Dong
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Bin Duan
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China
| | - Feiyang Xing
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China
| | - Jun Wu
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China
| | - Minmin Xiao
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China.
| | - Fangqing Zhao
- Beijing Institutes of Life Science, Chinese Academy of Sciences, Beijing, 100101, China.
| | - Jing-Dong J Han
- Peking-Tsinghua Center for Life Sciences, Academy for Advanced Interdisciplinary Studies, Center for Quantitative Biology (CQB), Peking University, Beijing, 100871, China.
| | - Qi Liu
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Translational Medical Center for Stem Cell Therapy and Institute for Regenerative Medicine, Shanghai East Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Research Institute of Intelligent Computing, Zhejiang Lab, Hangzhou, 311121, China.
- Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai, 201210, China.
| | - Xiaohui Fan
- Pharmaceutical Informatics Institute, College of Pharmaceutical Sciences, Zhejiang University, Hangzhou, 310058, China.
- National Key Laboratory of Chinese Medicine Modernization, Innovation Center of Yangtze River Delta, Zhejiang University, Jiaxing, 314103, China.
- Zhejiang Key Laboratory of Precision Diagnosis and Therapy for Major Gynecological Diseases, Women's Hospital, Zhejiang University School of Medicine, Hangzhou, 310006, China.
| | - Chen Li
- Center for Single-cell Omics, School of Public Health, Shanghai Jiao Tong University School of Medicine, Shanghai, 200025, China.
| | - Chenfei Wang
- Key Laboratory of Spine and Spinal Cord Injury Repair and Regeneration (Tongji University), Ministry of Education, Orthopaedic Department, Tongji Hospital, Bioinformatics Department, School of Life Sciences and Technology, Tongji University, Shanghai, 200082, China.
- Frontier Science Center for Stem Cells, School of Life Sciences and Technology, Tongji University, Shanghai, 200092, China.
| | - Tieliu Shi
- Department of Clinical Laboratory, the Affiliated Wuhu Hospital of East China Normal University (The Second People's Hospital of Wuhu City), Wuhu, 241000, China.
- Center for Bioinformatics and Computational Biology, Shanghai Key Laboratory of Regulatory Biology, the Institute of Biomedical Sciences and School of Life Sciences, East China Normal University, Shanghai, 200241, China.
- Key Laboratory of Advanced Theory and Application in Statistics and Data Science-MOE, School of Statistics, East China Normal University, Shanghai, 200062, China.
| |
Collapse
|
3
|
Cuevas-Diaz Duran R, Wei H, Wu J. Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets. BMC Genomics 2024; 25:444. [PMID: 38711017 PMCID: PMC11073985 DOI: 10.1186/s12864-024-10364-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 04/29/2024] [Indexed: 05/08/2024] Open
Abstract
BACKGROUND Normalization is a critical step in the analysis of single-cell RNA-sequencing (scRNA-seq) datasets. Its main goal is to make gene counts comparable within and between cells. To do so, normalization methods must account for technical and biological variability. Numerous normalization methods have been developed addressing different sources of dispersion and making specific assumptions about the count data. MAIN BODY The selection of a normalization method has a direct impact on downstream analysis, for example differential gene expression and cluster identification. Thus, the objective of this review is to guide the reader in making an informed decision on the most appropriate normalization method to use. To this aim, we first give an overview of the different single cell sequencing platforms and methods commonly used including isolation and library preparation protocols. Next, we discuss the inherent sources of variability of scRNA-seq datasets. We describe the categories of normalization methods and include examples of each. We also delineate imputation and batch-effect correction methods. Furthermore, we describe data-driven metrics commonly used to evaluate the performance of normalization methods. We also discuss common scRNA-seq methods and toolkits used for integrated data analysis. CONCLUSIONS According to the correction performed, normalization methods can be broadly classified as within and between-sample algorithms. Moreover, with respect to the mathematical model used, normalization methods can further be classified into: global scaling methods, generalized linear models, mixed methods, and machine learning-based methods. Each of these methods depict pros and cons and make different statistical assumptions. However, there is no better performing normalization method. Instead, metrics such as silhouette width, K-nearest neighbor batch-effect test, or Highly Variable Genes are recommended to assess the performance of normalization methods.
Collapse
Affiliation(s)
- Raquel Cuevas-Diaz Duran
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, Nuevo Leon, 64710, Mexico.
| | - Haichao Wei
- The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
- Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, 77030, USA
| | - Jiaqian Wu
- The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
- Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, 77030, USA.
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, 77030, USA.
| |
Collapse
|
4
|
Islam MT, Liu Y, Hassan MM, Abraham PE, Merlet J, Townsend A, Jacobson D, Buell CR, Tuskan GA, Yang X. Advances in the Application of Single-Cell Transcriptomics in Plant Systems and Synthetic Biology. BIODESIGN RESEARCH 2024; 6:0029. [PMID: 38435807 PMCID: PMC10905259 DOI: 10.34133/bdr.0029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 01/28/2024] [Indexed: 03/05/2024] Open
Abstract
Plants are complex systems hierarchically organized and composed of various cell types. To understand the molecular underpinnings of complex plant systems, single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool for revealing high resolution of gene expression patterns at the cellular level and investigating the cell-type heterogeneity. Furthermore, scRNA-seq analysis of plant biosystems has great potential for generating new knowledge to inform plant biosystems design and synthetic biology, which aims to modify plants genetically/epigenetically through genome editing, engineering, or re-writing based on rational design for increasing crop yield and quality, promoting the bioeconomy and enhancing environmental sustainability. In particular, data from scRNA-seq studies can be utilized to facilitate the development of high-precision Build-Design-Test-Learn capabilities for maximizing the targeted performance of engineered plant biosystems while minimizing unintended side effects. To date, scRNA-seq has been demonstrated in a limited number of plant species, including model plants (e.g., Arabidopsis thaliana), agricultural crops (e.g., Oryza sativa), and bioenergy crops (e.g., Populus spp.). It is expected that future technical advancements will reduce the cost of scRNA-seq and consequently accelerate the application of this emerging technology in plants. In this review, we summarize current technical advancements in plant scRNA-seq, including sample preparation, sequencing, and data analysis, to provide guidance on how to choose the appropriate scRNA-seq methods for different types of plant samples. We then highlight various applications of scRNA-seq in both plant systems biology and plant synthetic biology research. Finally, we discuss the challenges and opportunities for the application of scRNA-seq in plants.
Collapse
Affiliation(s)
- Md Torikul Islam
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Yang Liu
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Md Mahmudul Hassan
- Department of Genetics and Plant Breeding,
Patuakhali Science and Technology University, Dumki, Patuakhali 8602, Bangladesh
| | - Paul E. Abraham
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Jean Merlet
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- Bredesen Center for Interdisciplinary Research and Graduate Education,
University of Tennessee Knoxville, Knoxville, TN 37996, USA
| | - Alice Townsend
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- Bredesen Center for Interdisciplinary Research and Graduate Education,
University of Tennessee Knoxville, Knoxville, TN 37996, USA
| | - Daniel Jacobson
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - C. Robin Buell
- Center for Applied Genetic Technologies,
University of Georgia, Athens, GA 30602, USA
- Department of Crop and Soil Sciences,
University of Georgia, Athens, GA 30602, USA
- Institute of Plant Breeding, Genetics, and Genomics,
University of Georgia, Athens, GA 30602, USA
| | - Gerald A. Tuskan
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Xiaohan Yang
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| |
Collapse
|
5
|
Hsu CY, Chang CJ, Liu Q, Shyr Y. scKWARN: Kernel-weighted-average robust normalization for single-cell RNA-seq data. Bioinformatics 2024; 40:btae008. [PMID: 38237908 PMCID: PMC10868328 DOI: 10.1093/bioinformatics/btae008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2023] [Revised: 12/05/2023] [Accepted: 01/04/2024] [Indexed: 02/09/2024] Open
Abstract
MOTIVATION Single-cell RNA-seq normalization is an essential step to correct unwanted biases caused by sequencing depth, capture efficiency, dropout, and other technical factors. Existing normalization methods primarily reduce biases arising from sequencing depth by modeling count-depth relationship and/or assuming a specific distribution for read counts. However, these methods may lead to over or under-correction due to presence of technical biases beyond sequencing depth and the restrictive assumption on models and distributions. RESULTS We present scKWARN, a Kernel Weighted Average Robust Normalization designed to correct known or hidden technical confounders without assuming specific data distributions or count-depth relationships. scKWARN generates a pseudo expression profile for each cell by borrowing information from its fuzzy technical neighbors through a kernel smoother. It then compares this profile against the reference derived from cells with the same bimodality patterns to determine the normalization factor. As demonstrated in both simulated and real datasets, scKWARN outperforms existing methods in removing a variety of technical biases while preserving true biological heterogeneity. AVAILABILITY AND IMPLEMENTATION scKWARN is freely available at https://github.com/cyhsuTN/scKWARN.
Collapse
Affiliation(s)
- Chih-Yuan Hsu
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Chia-Jung Chang
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Department of Biomedical Engineering, National Cheng Kung University, Tainan 701, Taiwan
| | - Qi Liu
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| | - Yu Shyr
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, United States
- Center for Quantitative Sciences, Vanderbilt University Medical Center, Nashville, TN 37203, United States
| |
Collapse
|
6
|
Paas-Oliveros E, Hernández-Lemus E, de Anda-Jáuregui G. Computational single cell oncology: state of the art. Front Genet 2023; 14:1256991. [PMID: 38028624 PMCID: PMC10663273 DOI: 10.3389/fgene.2023.1256991] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 10/24/2023] [Indexed: 12/01/2023] Open
Abstract
Single cell computational analysis has emerged as a powerful tool in the field of oncology, enabling researchers to decipher the complex cellular heterogeneity that characterizes cancer. By leveraging computational algorithms and bioinformatics approaches, this methodology provides insights into the underlying genetic, epigenetic and transcriptomic variations among individual cancer cells. In this paper, we present a comprehensive overview of single cell computational analysis in oncology, discussing the key computational techniques employed for data processing, analysis, and interpretation. We explore the challenges associated with single cell data, including data quality control, normalization, dimensionality reduction, clustering, and trajectory inference. Furthermore, we highlight the applications of single cell computational analysis, including the identification of novel cell states, the characterization of tumor subtypes, the discovery of biomarkers, and the prediction of therapy response. Finally, we address the future directions and potential advancements in the field, including the development of machine learning and deep learning approaches for single cell analysis. Overall, this paper aims to provide a roadmap for researchers interested in leveraging computational methods to unlock the full potential of single cell analysis in understanding cancer biology with the goal of advancing precision oncology. For this purpose, we also include a notebook that instructs on how to apply the recommended tools in the Preprocessing and Quality Control section.
Collapse
Affiliation(s)
- Ernesto Paas-Oliveros
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
| | - Enrique Hernández-Lemus
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
- Center for Complexity Sciences, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | - Guillermo de Anda-Jáuregui
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
- Center for Complexity Sciences, Universidad Nacional Autónoma de México, Mexico City, Mexico
- Investigadores por Mexico, Conahcyt, Mexico City, Mexico
| |
Collapse
|
7
|
Chicco D, Ferraro Petrillo U, Cattaneo G. Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment. PLoS Comput Biol 2023; 19:e1011272. [PMID: 37471333 PMCID: PMC10358940 DOI: 10.1371/journal.pcbi.1011272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/22/2023] Open
Abstract
Some scientific studies involve huge amounts of bioinformatics data that cannot be analyzed on personal computers usually employed by researchers for day-to-day activities but rather necessitate effective computational infrastructures that can work in a distributed way. For this purpose, distributed computing systems have become useful tools to analyze large amounts of bioinformatics data and to generate relevant results on virtual environments, where software can be executed for hours or even days without affecting the personal computer or laptop of a researcher. Even if distributed computing resources have become pivotal in multiple bioinformatics laboratories, often researchers and students use them in the wrong ways, making mistakes that can cause the distributed computers to underperform or that can even generate wrong outcomes. In this context, we present here ten quick tips for the usage of Apache Spark distributed computing systems for bioinformatics analyses: ten simple guidelines that, if taken into account, can help users avoid common mistakes and can help them run their bioinformatics analyses smoothly. Even if we designed our recommendations for beginners and students, they should be followed by experts too. We think our quick tips can help anyone make use of Apache Spark distributed computing systems more efficiently and ultimately help generate better, more reliable scientific results.
Collapse
Affiliation(s)
- Davide Chicco
- Institute of Health Policy Management and Evaluation, University of Toronto, Toronto, Ontario, Canada
| | | | - Giuseppe Cattaneo
- Dipartimento di Informatica, Università di Salerno, Fisciano (Salerno), Italy
| |
Collapse
|
8
|
Ahlmann-Eltze C, Huber W. Comparison of transformations for single-cell RNA-seq data. Nat Methods 2023; 20:665-672. [PMID: 37037999 PMCID: PMC10172138 DOI: 10.1038/s41592-023-01814-1] [Citation(s) in RCA: 42] [Impact Index Per Article: 21.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2021] [Accepted: 02/11/2023] [Indexed: 04/12/2023]
Abstract
The count table, a numeric matrix of genes × cells, is the basic input data structure in the analysis of single-cell RNA-sequencing data. A common preprocessing step is to adjust the counts for variable sampling efficiency and to transform them so that the variance is similar across the dynamic range. These steps are intended to make subsequent application of generic statistical methods more palatable. Here, we describe four transformation approaches based on the delta method, model residuals, inferred latent expression state and factor analysis. We compare their strengths and weaknesses and find that the latter three have appealing theoretical properties; however, in benchmarks using simulated and real-world data, it turns out that a rather simple approach, namely, the logarithm with a pseudo-count followed by principal-component analysis, performs as well or better than the more sophisticated alternatives. This result highlights limitations of current theoretical analysis as assessed by bottom-line performance benchmarks.
Collapse
Affiliation(s)
- Constantin Ahlmann-Eltze
- Genome Biology Unit, EMBL, Heidelberg, Germany.
- Faculty of Biosciences, Heidelberg University, Heidelberg, Germany.
| | | |
Collapse
|
9
|
Lazzardi S, Valle F, Mazzolini A, Scialdone A, Caselle M, Osella M. Emergent statistical laws in single-cell transcriptomic data. Phys Rev E 2023; 107:044403. [PMID: 37198814 DOI: 10.1103/physreve.107.044403] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 03/24/2023] [Indexed: 05/19/2023]
Abstract
Large-scale data on single-cell gene expression have the potential to unravel the specific transcriptional programs of different cell types. The structure of these expression datasets suggests a similarity with several other complex systems that can be analogously described through the statistics of their basic building blocks. Transcriptomes of single cells are collections of messenger RNA abundances transcribed from a common set of genes just as books are different collections of words from a shared vocabulary, genomes of different species are specific compositions of genes belonging to evolutionary families, and ecological niches can be described by their species abundances. Following this analogy, we identify several emergent statistical laws in single-cell transcriptomic data closely similar to regularities found in linguistics, ecology, or genomics. A simple mathematical framework can be used to analyze the relations between different laws and the possible mechanisms behind their ubiquity. Importantly, treatable statistical models can be useful tools in transcriptomics to disentangle the actual biological variability from general statistical effects present in most component systems and from the consequences of the sampling process inherent to the experimental technique.
Collapse
Affiliation(s)
- Silvia Lazzardi
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Filippo Valle
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Andrea Mazzolini
- Laboratoire de Physique de l'École Normale Supérieure (PSL University), CNRS, Sorbonne Université and Université de Paris, 75005 Paris, France
| | - Antonio Scialdone
- Institute of Epigenetics and Stem Cells, Helmholtz Zentrum München, Feodor-Lynen-Straße 21, 81377 München, Germany and Institute of Functional Epigenetics and Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstraße 1, 85764 Neuherberg, Germany
| | - Michele Caselle
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Matteo Osella
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| |
Collapse
|
10
|
Ke M, Elshenawy B, Sheldon H, Arora A, Buffa FM. Single cell RNA-sequencing: A powerful yet still challenging technology to study cellular heterogeneity. Bioessays 2022; 44:e2200084. [PMID: 36068142 DOI: 10.1002/bies.202200084] [Citation(s) in RCA: 32] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Revised: 08/18/2022] [Accepted: 08/19/2022] [Indexed: 11/11/2022]
Abstract
Almost all biomedical research to date has relied upon mean measurements from cell populations, however it is well established that what it is observed at this macroscopic level can be the result of many interactions of several different single cells. Thus, the observable macroscopic 'average' cannot outright be used as representative of the 'average cell'. Rather, it is the resulting emerging behaviour of the actions and interactions of many different cells. Single-cell RNA sequencing (scRNA-Seq) enables the comparison of the transcriptomes of individual cells. This provides high-resolution maps of the dynamic cellular programmes allowing us to answer fundamental biological questions on their function and evolution. It also allows to address medical questions such as the role of rare cell populations contributing to disease progression and therapeutic resistance. Furthermore, it provides an understanding of context-specific dependencies, namely the behaviour and function that a cell has in a specific context, which can be crucial to understand some complex diseases, such as diabetes, cardiovascular disease and cancer. Here, we provide an overview of scRNA-Seq, including a comparative review of emerging technologies and computational pipelines. We discuss the current and emerging applications and focus on tumour heterogeneity a clear example of how scRNA-Seq can provide new understanding of a complex disease. Additionally, we review the limitations and highlight the need of powerful computational pipelines and reproducible protocols for the broader acceptance of this technique in basic and clinical research.
Collapse
Affiliation(s)
- May Ke
- Department of Oncology, Medical Sciences Division, University of Oxford, Oxford, UK
| | - Badran Elshenawy
- Department of Oncology, Medical Sciences Division, University of Oxford, Oxford, UK
| | - Helen Sheldon
- Department of Oncology, Medical Sciences Division, University of Oxford, Oxford, UK
| | - Anjali Arora
- Department of Oncology, Medical Sciences Division, University of Oxford, Oxford, UK
| | - Francesca M Buffa
- Department of Oncology, Medical Sciences Division, University of Oxford, Oxford, UK.,Department of Computing Sciences, Bocconi University, Milano, Italy.,Institute for Data Science and Analytics, Bocconi University, Milano, Italy
| |
Collapse
|
11
|
Affiliation(s)
- Greg Gibson
- School of Biological Sciences and Center for Integrative Genomics, Georgia Institute of Technology, Atlanta, Georgia, United States of America
| |
Collapse
|
12
|
Choudhary S, Satija R. Comparison and evaluation of statistical error models for scRNA-seq. Genome Biol 2022; 23:27. [PMID: 35042561 PMCID: PMC8764781 DOI: 10.1186/s13059-021-02584-9] [Citation(s) in RCA: 268] [Impact Index Per Article: 89.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2021] [Accepted: 12/20/2021] [Indexed: 01/31/2023] Open
Abstract
BACKGROUND Heterogeneity in single-cell RNA-seq (scRNA-seq) data is driven by multiple sources, including biological variation in cellular state as well as technical variation introduced during experimental processing. Deconvolving these effects is a key challenge for preprocessing workflows. Recent work has demonstrated the importance and utility of count models for scRNA-seq analysis, but there is a lack of consensus on which statistical distributions and parameter settings are appropriate. RESULTS Here, we analyze 59 scRNA-seq datasets that span a wide range of technologies, systems, and sequencing depths in order to evaluate the performance of different error models. We find that while a Poisson error model appears appropriate for sparse datasets, we observe clear evidence of overdispersion for genes with sufficient sequencing depth in all biological systems, necessitating the use of a negative binomial model. Moreover, we find that the degree of overdispersion varies widely across datasets, systems, and gene abundances, and argues for a data-driven approach for parameter estimation. CONCLUSIONS Based on these analyses, we provide a set of recommendations for modeling variation in scRNA-seq data, particularly when using generalized linear models or likelihood-based approaches for preprocessing and downstream analysis.
Collapse
Affiliation(s)
- Saket Choudhary
- New York Genome Center, 101 Avenue of the Americas, New York, 100013 USA
| | - Rahul Satija
- New York Genome Center, 101 Avenue of the Americas, New York, 100013 USA
- Center for Genomics and Systems Biology, New York University, 12 Waverly Pl, New York, 10003 USA
| |
Collapse
|
13
|
Kiaee K, Jodat YA, Bassous NJ, Matharu N, Shin SR. Transcriptomic Mapping of Neural Diversity, Differentiation and Functional Trajectory in iPSC-Derived 3D Brain Organoid Models. Cells 2021; 10:3422. [PMID: 34943930 PMCID: PMC8700452 DOI: 10.3390/cells10123422] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2021] [Revised: 11/26/2021] [Accepted: 11/27/2021] [Indexed: 11/17/2022] Open
Abstract
Experimental models of the central nervous system (CNS) are imperative for developmental and pathophysiological studies of neurological diseases. Among these models, three-dimensional (3D) induced pluripotent stem cell (iPSC)-derived brain organoid models have been successful in mitigating some of the drawbacks of 2D models; however, they are plagued by high organoid-to-organoid variability, making it difficult to compare specific gene regulatory pathways across 3D organoids with those of the native brain. Single-cell RNA sequencing (scRNA-seq) transcriptome datasets have recently emerged as powerful tools to perform integrative analyses and compare variability across organoids. However, transcriptome studies focusing on late-stage neural functionality development have been underexplored. Here, we combine and analyze 8 brain organoid transcriptome databases to study the correlation between differentiation protocols and their resulting cellular functionality across various 3D organoid and exogenous brain models. We utilize dimensionality reduction methods including principal component analysis (PCA) and uniform manifold approximation projection (UMAP) to identify and visualize cellular diversity among 3D models and subsequently use gene set enrichment analysis (GSEA) and developmental trajectory inference to quantify neuronal behaviors such as axon guidance, synapse transmission and action potential. We showed high similarity in cellular composition, cellular differentiation pathways and expression of functional genes in human brain organoids during induction and differentiation phases, i.e., up to 3 months in culture. However, during the maturation phase, i.e., 6-month timepoint, we observed significant developmental deficits and depletion of neuronal and astrocytes functional genes as indicated by our GSEA results. Our results caution against use of organoids to model pathophysiology and drug response at this advanced time point and provide insights to tune in vitro iPSC differentiation protocols to achieve desired neuronal functionality and improve current protocols.
Collapse
Affiliation(s)
- Kiavash Kiaee
- Division of Engineering in Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Cambridge, MA 02139, USA; (Y.A.J.); (N.J.B.)
- Department of Mechanical Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA
| | - Yasamin A. Jodat
- Division of Engineering in Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Cambridge, MA 02139, USA; (Y.A.J.); (N.J.B.)
- Department of Mechanical Engineering, Stevens Institute of Technology, Hoboken, NJ 07030, USA
| | - Nicole J. Bassous
- Division of Engineering in Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Cambridge, MA 02139, USA; (Y.A.J.); (N.J.B.)
| | - Navneet Matharu
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA 94143, USA;
- Institute for Human Genetics, University of California San Francisco, San Francisco, CA 94143, USA
- Innovative Genomics Institute, University of California San Francisco, San Francisco, CA 94720, USA
| | - Su Ryon Shin
- Division of Engineering in Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Cambridge, MA 02139, USA; (Y.A.J.); (N.J.B.)
| |
Collapse
|