1
|
van der Ploeg GR, Westerhuis JA, Heintz-Buschart A, Smilde AK. parafac4microbiome: exploratory analysis of longitudinal microbiome data using parallel factor analysis. mSystems 2025:e0047225. [PMID: 40396737 DOI: 10.1128/msystems.00472-25] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2025] [Accepted: 04/16/2025] [Indexed: 05/22/2025] Open
Abstract
Studies investigating microbial temporal dynamics are increasingly common, leveraging longitudinal designs that collect microbial abundance data across multiple time points from the same subjects. Traditional exploratory approaches like principal component analysis fail to fully utilize this structure. By organizing data as a three-way array-subjects as rows, microbial abundances as columns, and time points as the third dimension-multi-way methods such as parallel factor analysis (PARAFAC) can better capture temporal and structural patterns. This study demonstrates PARAFAC as a method to explore longitudinal microbiome data using three exemplary studies. In the first example, a long-time series of in vitro microbiomes, PARAFAC identifies primary time-resolved variations. The second example, a longitudinal infant gut microbiome study, shows that PARAFAC can distinguish subject groups and enhance comparative analysis, even with moderate missing data. In the third example, a gingivitis intervention study of the oral microbiome, PARAFAC enables the identification of microbial subcommunities of interest through post-hoc clustering. These examples highlight PARAFAC's broad applicability for analyzing longitudinal microbiome data across diverse environments. The approach is implemented in the R package parafac4microbiome, available on the Comprehensive R Archive Network (CRAN), providing researchers with accessible tools for similar analyses.IMPORTANCEUnderstanding how microbiomes change over time can give us valuable insights into their role in health and disease. Many traditional methods like principal component analysis miss important patterns in data collected over time, but parallel factor analysis (PARAFAC) helps uncover these trends in a much clearer way. Using this approach, we were able to identify key changes in microbiomes across different settings, like lab experiments, the infant gut, and the mouth. PARAFAC also works well even when some data is missing, which is a common issue. To make this tool accessible, we have included it in a user-friendly R package, enabling other researchers to analyze microbiome dynamics in their own studies and explore how these changes might influence health and treatments.
Collapse
Affiliation(s)
- G R van der Ploeg
- Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, the Netherlands
| | - J A Westerhuis
- Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, the Netherlands
| | - A Heintz-Buschart
- Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, the Netherlands
| | - A K Smilde
- Biosystems Data Analysis, Swammerdam Institute for Life Sciences, University of Amsterdam, Amsterdam, the Netherlands
| |
Collapse
|
2
|
Sankowski R, Prinz M. A dynamic and multimodal framework to define microglial states. Nat Neurosci 2025:10.1038/s41593-025-01978-3. [PMID: 40394327 DOI: 10.1038/s41593-025-01978-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2024] [Accepted: 04/22/2025] [Indexed: 05/22/2025]
Abstract
The widespread use of single-cell RNA sequencing has generated numerous purportedly distinct and novel subsets of microglia. Here, we challenge this fragmented paradigm by proposing that microglia exist along a continuum rather than as discrete entities. We identify a methodological over-reliance on computational clustering algorithms as the fundamental issue, with arbitrary cluster numbers being interpreted as biological reality. Evidence suggests that the observed transcriptional diversity stems from a combination of microglial plasticity and technical noise, resulting in terminology describing largely overlapping cellular states. We introduce a continuous model of microglial states, where cell positioning along the continuum is determined by biological aging and cell-specific molecular contexts. The model accommodates the dynamic nature of microglia. We advocate for a parsimonious approach toward classification and terminology that acknowledges the continuous spectrum of microglial states, toward a robust framework for understanding these essential immune cells of the CNS.
Collapse
Affiliation(s)
- Roman Sankowski
- Institute of Neuropathology, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
| | - Marco Prinz
- Institute of Neuropathology, Faculty of Medicine, University of Freiburg, Freiburg, Germany.
- Signalling Research Centres BIOSS and CIBSS, University of Freiburg, Freiburg, Germany.
| |
Collapse
|
3
|
Pavel A, Grønberg MG, Clemmensen LH. The impact of dropouts in scRNAseq dense neighborhood analysis. Comput Struct Biotechnol J 2025; 27:1278-1285. [PMID: 40225837 PMCID: PMC11992407 DOI: 10.1016/j.csbj.2025.03.033] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Revised: 03/19/2025] [Accepted: 03/20/2025] [Indexed: 04/15/2025] Open
Abstract
Single cell RNA sequencing (scRNAseq) provides the possibility to investigate transcriptomic profiles on a single cell level. However, the data show unique challenges in comparison to bulk transcriptomic data, one being high dropout rates, which yields high sparsity data. Many classical analysis and preprocessing pipelines are based on the assumption that poor data can be counteracted by quantity and that similar cells (samples) are close to each other in space. Clustering is commonly used to detect clusters (dense local cell neighborhoods) under the assumption that similar cells are close to each other in space (where close is dependent on the (distance) metric used). The most commonly used clustering methodologies to detect dense local neighborhoods are based on graph clustering on a nearest neighbor graph. However, high dropout rates may break this assumption and make it difficult to reliably detect such dense local neighborhoods. We assess the cluster homogeneity and stability under increasing degrees of dropouts in one of the most popular clustering pipelines (dimensionality reduction + graph based clustering), as provided by scRNAseq analyses packages Seurat and Scanpy. Our study showcases that while the default pipeline performs well in terms of cluster homogeneity (i.e., cells in a cluster are of the same type), also with increasing dropout rates, the stability of clusters (i.e., cell pairs consistently being in the same cluster) decreases. This implies that sub-populations within cell types are increasingly difficult to identify under increasing dropout rates because observations are not consistently close. Our results challenge the current practice of using default clustering pipelines and the general assumption of identifiable local neighborhoods on high dropout data. Hence, these results suggest that careful consideration in interpretation and downstream analysis need to be made when relying on local neighborhoods and clusters on scRNAseq data. In addition, these results call for extensive benchmarking, to identify and provide methods robust in their local neighborhood relationships on data containing low to high dropout rates.
Collapse
Affiliation(s)
- Alisa Pavel
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800, Kongens Lyngby, Denmark
| | - Manja Gersholm Grønberg
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800, Kongens Lyngby, Denmark
| | - Line H. Clemmensen
- Department of Applied Mathematics and Computer Science, Technical University of Denmark, 2800, Kongens Lyngby, Denmark
- Department of Mathematical Sciences, University of Copenhagen, 2100, Copenhagen, Denmark
| |
Collapse
|
4
|
Han X, Song K. TphPMF: A microbiome data imputation method using hierarchical Bayesian Probabilistic Matrix Factorization. PLoS Comput Biol 2025; 21:e1012858. [PMID: 40067818 PMCID: PMC11957397 DOI: 10.1371/journal.pcbi.1012858] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2024] [Revised: 03/31/2025] [Accepted: 02/07/2025] [Indexed: 04/02/2025] Open
Abstract
In microbiome research, data sparsity represents a prevalent and formidable challenge. Sparse data not only compromises the accuracy of statistical analyses but also conceals critical biological relationships, thereby undermining the reliability of the conclusions. To tackle this issue, we introduce a machine learning approach for microbiome data imputation, termed TphPMF. This technique leverages Probabilistic Matrix Factorization, incorporating phylogenetic relationships among microorganisms to establish Bayesian prior distributions. These priors facilitate posterior predictions of potential non-biological zeros. We demonstrate that TphPMF outperforms existing microbiome data imputation methods in accurately recovering missing taxon abundances. Furthermore, TphPMF enhances the efficacy of certain differential abundance analysis methods in detecting differentially abundant (DA) taxa, particularly showing advantages when used in conjunction with DESeq2-phyloseq. Additionally, TphPMF significantly improves the precision of cross-predicting disease conditions in microbiome datasets pertaining to type 2 diabetes and colorectal cancer.
Collapse
Affiliation(s)
- Xinyu Han
- School of Mathematics and Statistics, Qingdao University, Qingdao, China
| | - Kai Song
- School of Mathematics and Statistics, Qingdao University, Qingdao, China
| |
Collapse
|
5
|
Fan Z, Lv J, Zhang S, Gu B, Wang C, Zhang T. ISCAZIM: Integrated statistical correlation analysis for zero-inflated microbiome data. Heliyon 2025; 11:e41184. [PMID: 39811376 PMCID: PMC11730854 DOI: 10.1016/j.heliyon.2024.e41184] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2024] [Revised: 12/05/2024] [Accepted: 12/11/2024] [Indexed: 01/16/2025] Open
Abstract
Microbiome-metabolome association analysis is critical to reveal the key pairs of gut microbiota and metabolites for discovery of the microbial biomarkers in chronic diseases. However, the characteristics of microbiome data, such as zero inflation, over dispersion, may impair the confidence of association analysis between microbiome and metabolome data. The objectives of this study are to evaluate the strengths and weaknesses of existing statistical methods and to develop a computational framework tailored to the unique characteristics of microbiome data. We designed a computational framework called Integrated Statistical Correlation Analysis for Zero-Inflated Microbiome data (ISCAZIM) that takes account of complicated microbiome data characteristics, including zero inflation rates (ZIRs), dispersion and correlation patterns. ISCAZIM first benchmarked prevalent statistical correlation methods, Pearson, Spearman, zero inflated negative binomial (ZINB) model, mutual information and Maximal Information Coefficient. ISCAZIM then classifies the correlation pattern to linear or non-linear and applies the correlation method according to the ZIRs status. Applying to multiple real-world microbiome-metabolomics data, ISCAZIM is overall more accurate than using a single method with more truly significant association pairs included. Therefore, ISCAZIM will significantly facilitate the association analysis using zero-inflated microbiome data for multi-omics integration.
Collapse
Affiliation(s)
- Zhe Fan
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250012, China
| | - Jiali Lv
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250012, China
| | - Shuai Zhang
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250012, China
| | - Bingbing Gu
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250012, China
| | - Cheng Wang
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250012, China
| | - Tao Zhang
- Department of Biostatistics, School of Public Health, Cheeloo College of Medicine, Shandong University, Jinan, 250012, China
- National Institute of Health Data Science of China, Shandong University, Jinan, 250012, China
| |
Collapse
|
6
|
Karwowska Z, Aasmets O, Kosciolek T, Org E. Effects of data transformation and model selection on feature importance in microbiome classification data. MICROBIOME 2025; 13:2. [PMID: 39754220 PMCID: PMC11699698 DOI: 10.1186/s40168-024-01996-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/14/2024] [Accepted: 12/04/2024] [Indexed: 01/06/2025]
Abstract
BACKGROUND Accurate classification of host phenotypes from microbiome data is crucial for advancing microbiome-based therapies, with machine learning offering effective solutions. However, the complexity of the gut microbiome, data sparsity, compositionality, and population-specificity present significant challenges. Microbiome data transformations can alleviate some of the aforementioned challenges, but their usage in machine learning tasks has largely been unexplored. RESULTS Our analysis of over 8500 samples from 24 shotgun metagenomic datasets showed that it is possible to classify healthy and diseased individuals using microbiome data with minimal dependence on the choice of algorithm or transformation. Presence-absence transformations performed comparably to abundance-based transformations, and only a small subset of predictors is necessary for accurate classification. However, while different transformations resulted in comparable classification performance, the most important features varied significantly, which highlights the need to reevaluate machine learning-based biomarker detection. CONCLUSIONS Microbiome data transformations can significantly influence feature selection but have a limited effect on classification accuracy. Our findings suggest that while classification is robust across different transformations, the variation in feature selection necessitates caution when using machine learning for biomarker identification. This research provides valuable insights for applying machine learning to microbiome data and identifies important directions for future work.
Collapse
Affiliation(s)
- Zuzanna Karwowska
- Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland
- Doctoral School of Exact and Natural Sciences, Jagiellonian University, Krakow, Poland
- Sano Centre for Computational Medicine, Krakow, Poland
| | - Oliver Aasmets
- Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia
| | - Tomasz Kosciolek
- Małopolska Centre of Biotechnology, Jagiellonian University, Krakow, Poland.
- Department of Data Science and Engineering, Silesian University of Technology, Gliwice, Poland.
- Sano Centre for Computational Medicine, Krakow, Poland.
| | - Elin Org
- Estonian Genome Centre, Institute of Genomics, University of Tartu, Tartu, Estonia.
| |
Collapse
|
7
|
Kohnert E, Kreutz C. Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data. F1000Res 2025; 13:1180. [PMID: 39866725 PMCID: PMC11757917 DOI: 10.12688/f1000research.155230.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/19/2024] [Indexed: 01/28/2025] Open
Abstract
Background Synthetic data's utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.'s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning. Methods We replicate Nearing et al.'s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Conclusions Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.
Collapse
Affiliation(s)
- Eva Kohnert
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Baden-Württemberg, Germany
| | - Clemens Kreutz
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Baden-Württemberg, Germany
| |
Collapse
|
8
|
Huang J, Lu Y, Tian F, Ni Y. Association of body index with fecal microbiome in children cohorts with ethnic-geographic factor interaction: accurately using a Bayesian zero-inflated negative binomial regression model. mSystems 2024; 9:e0134524. [PMID: 39570024 DOI: 10.1128/msystems.01345-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2024] [Accepted: 10/24/2024] [Indexed: 11/22/2024] Open
Abstract
The exponential growth of high-throughput sequencing (HTS) data on the microbial communities presents researchers with an unparalleled opportunity to delve deeper into the association of microorganisms with host phenotype. However, this growth also poses a challenge, as microbial data are complex, sparse, discrete, and prone to zero inflation. Herein, by utilizing 10 distinct counting models for analyzing simulated data, we proposed an innovative Bayesian zero-inflated negative binomial (ZINB) regression model that is capable of identifying differentially abundant taxa associated with distinctive host phenotypes and quantifying the effects of covariates on these taxa. Our proposed model exhibits excellent accuracy compared with conventional Hurdle and INLA models, especially in scenarios characterized by inflation and overdispersion. Moreover, we confirm that dispersion parameters significantly affect the accuracy of model results, with defects gradually alleviating as the number of analyzed samples increases. Subsequently applying our model to amplicon data in real multi-ethnic children cohort, we found that only a subset of taxa were identified as having zero inflation in real data, suggesting that the prevailing understanding and processing of microbial count data in most previous microbiome studies were overly dogmatic. In practice, our pipeline of integrating bacterial differential abundance in microbiome data and relevant covariates is effective and feasible. Taken together, our method is expected to be extended to the microbiota studies of various multi-cohort populations. IMPORTANCE The microbiome is closely associated with physical indicators of the body, such as height, weight, age and BMI, which can be used as measures of human health. Accurately identifying which taxa in the microbiome are closely related to indicators of physical development is valuable as microbial markers of regional child growth trajectory. Zero-inflated negative binomial (ZINB) model, a type of Bayesian generalized linear model, can be effectively modeled in complex biological systems. We present an innovative ZINB regression model that is capable of identifying differentially abundant taxa associated with distinctive host phenotypes and quantifying the effects of covariates on these taxa, and demonstrate that its accuracy is superior to traditional Hurdle and INLA models. Our pipeline of integrating bacterial differential abundance in microbiome data and relevant covariates is effective and feasible.
Collapse
Affiliation(s)
- Jian Huang
- School of Food Science and Technology, Shihezi University, Shihezi, Xinjiang, China
- Key Laboratory of Xinjiang Special Probiotics and Dairy Technology, Shihezi University, Shihezi, Xinjiang, China
| | - Yanzhuan Lu
- School of Food Science and Technology, Shihezi University, Shihezi, Xinjiang, China
- Key Laboratory of Xinjiang Special Probiotics and Dairy Technology, Shihezi University, Shihezi, Xinjiang, China
| | - Fengwei Tian
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, Jiangsu, China
- School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu, China
| | - Yongqing Ni
- School of Food Science and Technology, Shihezi University, Shihezi, Xinjiang, China
- Key Laboratory of Xinjiang Special Probiotics and Dairy Technology, Shihezi University, Shihezi, Xinjiang, China
| |
Collapse
|
9
|
Yen PL, Lin TA, Chang CH, Yu CW, Kuo YH, Chang TT, Liao VHC. Di(2-ethylhexyl) phthalate disrupts circadian rhythm associated with changes in metabolites and cytochrome P450 gene expression in Caenorhabditis elegans. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2024; 363:125062. [PMID: 39366446 DOI: 10.1016/j.envpol.2024.125062] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/16/2024] [Revised: 09/17/2024] [Accepted: 10/01/2024] [Indexed: 10/06/2024]
Abstract
The plasticizer di(2-ethylhexyl) phthalate (DEHP) is a widespread environmental pollutant due to its extensive use. While circadian rhythms are inherent in most living organisms, the detrimental effects of DEHP on circadian rhythm and the underlying mechanisms remain largely unknown. This study investigated the influence of early developmental exposure to DEHP on circadian rhythm and explored the possible relationship between circadian disruption and DEHP metabolism in the model organism Caenorhabditis elegans. We observed that DEHP disrupted circadian rhythm in a dose-dependent fashion. Liquid chromatography-tandem mass spectrometry (LC-MS/MS) analysis revealed that DEHP-induced circadian disruption accompanies with altered proportions of DEHP metabolites in C. elegans. RNA sequencing data demonstrated that DEHP-induced circadian rhythm disruption caused differential gene expression. Moreover, DEHP-induced circadian disruption coincided with attenuated inductions of DEHP-induced cytochrome P450 genes, cyp-35A2, cyp-35A3, and cyp-35A4. Notably, cyp-35A2 mRNA exhibited circadian rhythm with entrainment, but DEHP exposure disrupted this rhythm. Our findings suggest that DEHP exposure disrupts circadian rhythm, which is associated with changes in DEHP metabolites and cytochrome P450 gene expression in C. elegans. Given the ubiquitous nature of DEHP pollution and the prevalence of circadian rhythms in living organisms, this study implies a potential negative impact of DEHP on circadian rhythm and DEHP metabolism in organisms.
Collapse
Affiliation(s)
- Pei-Ling Yen
- Department of Bioenvironmental Systems Engineering, National Taiwan University, Taipei 106, Taiwan
| | - Ting-An Lin
- Department of Bioenvironmental Systems Engineering, National Taiwan University, Taipei 106, Taiwan
| | - Chun-Han Chang
- Department of Bioenvironmental Systems Engineering, National Taiwan University, Taipei 106, Taiwan
| | - Chan-Wei Yu
- Department of Bioenvironmental Systems Engineering, National Taiwan University, Taipei 106, Taiwan
| | - Yu-Hsuan Kuo
- Department of Bioenvironmental Systems Engineering, National Taiwan University, Taipei 106, Taiwan
| | - Tzu-Ting Chang
- Department of Bioenvironmental Systems Engineering, National Taiwan University, Taipei 106, Taiwan
| | - Vivian Hsiu-Chuan Liao
- Department of Bioenvironmental Systems Engineering, National Taiwan University, Taipei 106, Taiwan.
| |
Collapse
|
10
|
Kim H, Siddiqui N, Karstens L, Ma L. A negative binomial latent factor model for paired microbiome sequencing data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.12.01.626246. [PMID: 39677741 PMCID: PMC11642826 DOI: 10.1101/2024.12.01.626246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Motivation Microbiome compositional data are often collected from several body sites and exhibit dependency among them. Analyzing microbial compositions from different sites jointly allows for effective borrowing of information by exploiting the underlying cross-site correlation, which can lead to more effective statistical analysis, especially when the sample size at one or both sites is limited. To this end, we introduce a joint model for microbiome compositions at two (or more) sites within the same subjects. Our model incorporates (i) latent factors shared across two body sites to explain the common subject effects and to serve as the source of correlation between the two sites; and (ii) mixtures of latent factors to allow heterogeneity among the samples in their level of cross-site association. The model is illustrated with synthetic data and we apply it in a case study involving samples of the urinary and vaginal microbiome collected from women. Results Simulation studies show how common subject effects influence regression analysis results; a stronger association between two sites in the data causes a greater degree of bias in the analysis. The model with latent factors mitigates the bias present in the model without latent factors, whereas the two models perform comparably for the data set without paired associations. In a case study involving samples collected from a study on the female urogenital microbiome with aging (e.g., the UMICRO study), our model leads to the detection of covariate associations of the vaginal and urinary microbiome composition that are otherwise not statistically significant under a similar regression model applied to the two sites separately. Our model also enables prediction of the microbial abundance at one site based on observations from another site. We also consider a model extension that allows the clustering of subjects (samples) and cluster-specific levels of paired association. Under the extended modeling framework, the clusters can be classified according to their association strengths.
Collapse
|
11
|
Zhang Y, Schluter J, Zhang L, Cao X, Jenq RR, Feng H, Haines J, Zhang L. Review and revamp of compositional data transformation: A new framework combining proportion conversion and contrast transformation. Comput Struct Biotechnol J 2024; 23:4088-4107. [PMID: 39624165 PMCID: PMC11609487 DOI: 10.1016/j.csbj.2024.11.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2024] [Revised: 11/01/2024] [Accepted: 11/02/2024] [Indexed: 01/03/2025] Open
Abstract
Due to the development of next-generation sequencing technology and an increased appreciation of their role in modulating host immunity and their potential as therapeutic agents, the human microbiome has emerged as a key area of interest in various biological investigations of human health and disease. However, microbiome data present a number of statistical challenges not addressed by existing methods, such as the varying sequencing depth, the compositionality, and zero inflation. Solutions like scaling and transformation methods help to mitigate heterogeneity and release constraints, but often introduce biases and yield inconsistent results on the same data. To address these issues, we conduct a systematic review of compositional data transformation, with a particular focus on the connection and distinction of existing techniques. Additionally, we create a new framework that enables the development of new transformations by combining proportion conversion with contrast transformations. This framework includes well-known methods such as Additive Log Ratio (ALR) and Centered Log Ratio (CLR) as special cases. Using this framework, we develop two novel transformations-Centered Arcsine Contrast (CAC) and Additive Arcsine Contrast (AAC)-which show enhanced performance in scenarios with high zero-inflation. Moreover, our findings suggest that ALR and CLR transformations are more effective when zero values are less prevalent. This comprehensive review and the innovative framework provide microbiome researchers with a significant direction to enhance data transformation procedures and improve analytical outcomes.
Collapse
Affiliation(s)
- Yiqian Zhang
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, 2109 Adelbert Rd, Cleveland, 44106, OH, USA
- Department of Statistics, University of Illinois Urbana-Champaign, 605 E. Springfield Ave., Champaign, 61820, IL, USA
| | - Jonas Schluter
- Institute for Systems Genetics, Department of Microbiology, New York University Grossman School of Medicine, 435 East 30th Street, New York, 10016, NY, USA
| | - Lijun Zhang
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, 2109 Adelbert Rd, Cleveland, 44106, OH, USA
| | - Xuan Cao
- Division of Statistics and Data Science, Department of Mathematical Sciences, University of Cincinnati, 2815 Commons Way, Cincinnati, 45219, OH, USA
| | - Robert R. Jenq
- Department of Hematology & Hematopoietic Cell Transplantation, City of Hope, 1500 East Duarte Road, Duarte, 91010, CA, USA
| | - Hao Feng
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, 2109 Adelbert Rd, Cleveland, 44106, OH, USA
| | - Jonathan Haines
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, 2109 Adelbert Rd, Cleveland, 44106, OH, USA
| | - Liangliang Zhang
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, 2109 Adelbert Rd, Cleveland, 44106, OH, USA
- Case Comprehensive Cancer Center, 2103 Cornell Road, Cleveland, 44106, OH, USA
| |
Collapse
|
12
|
Li D, Mei Q, Li G. scQA: A dual-perspective cell type identification model for single cell transcriptome data. Comput Struct Biotechnol J 2024; 23:520-536. [PMID: 38235363 PMCID: PMC10791572 DOI: 10.1016/j.csbj.2023.12.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2023] [Revised: 12/16/2023] [Accepted: 12/18/2023] [Indexed: 01/19/2024] Open
Abstract
Single-cell RNA sequencing technologies have been pivotal in advancing the development of algorithms for clustering heterogeneous cell populations. Existing methods for utilizing scRNA-seq data to identify cell types tend to neglect the beneficial impact of dropout events and perform clustering focusing solely on quantitative perspective. Here, we introduce a novel method named scQA, notable for its ability to concurrently identify cell types and cell type-specific key genes from both qualitative and quantitative perspectives. In contrast to other methods, scQA not only identifies cell types but also extracts key genes associated with these cell types, enabling bidirectional clustering for scRNA-seq data. Through an iterative process, our approach aims to minimize the number of landmarks to approximately a dozen while maximizing the inclusion of quasi-trend-preserved genes with dropouts both qualitatively and quantitatively. It then clusters cells by employing an ingenious label propagation strategy, obviating the requirement for a predetermined number of cell types. Validated on 20 publicly available scRNA-seq datasets, scQA consistently outperforms other salient tools. Furthermore, we confirm the effectiveness and potential biological significance of the identified key genes through both external and internal validation. In conclusion, scQA emerges as a valuable tool for investigating cell heterogeneity due to its distinctive fusion of qualitative and quantitative facets, along with bidirectional clustering capabilities. Furthermore, it can be seamlessly integrated into border scRNA-seq analyses. The source codes are publicly available at https://github.com/LD-Lyndee/scQA.
Collapse
Affiliation(s)
- Di Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| | - Qinglin Mei
- MOE Key Laboratory of Bioinformatics, BNRIST Bioinformatics Division, Department of Automation, Tsinghua University, Beijing 100084, China
| | - Guojun Li
- Research Center for Mathematics and Interdisciplinary Sciences, Shandong University, Qingdao 266237, China
| |
Collapse
|
13
|
Luo Q, Zhang S, Butt H, Chen Y, Jiang H, An L. PhyImpute and UniFracImpute: two imputation approaches incorporating phylogeny information for microbial count data. Brief Bioinform 2024; 26:bbae653. [PMID: 39708838 DOI: 10.1093/bib/bbae653] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2024] [Revised: 11/16/2024] [Accepted: 12/05/2024] [Indexed: 12/23/2024] Open
Abstract
Sequencing-based microbial count data analysis is a challenging task due to the presence of numerous non-biological zeros, which can impede downstream analysis. To tackle this issue, we introduce two novel approaches, PhyImpute and UniFracImpute, which leverage similar microbial samples to identify and impute non-biological zeros in microbial count data. Our proposed methods utilize the probability of non-biological zeros and phylogenetic trees to estimate sample-to-sample similarity, thus addressing this challenge. To evaluate the performance of our proposed methods, we conduct experiments using both simulated and real microbial data. The results demonstrate that PhyImpute and UniFracImpute outperform existing methods in recovering the zeros and empowering downstream analyses such as differential abundance analysis, and disease status classification.
Collapse
Affiliation(s)
- Qianwen Luo
- Department of Biosystems Engineering, University of Arizona, Tucson, AZ 85721, United States
| | - Shanshan Zhang
- Interdisciplinary Program in Statistics and Data Science, University of Arizona, Tucson, AZ 85721, United States
| | - Hamza Butt
- Department of Epidemiology and Biostatistics, University of Arizona, Tucson, AZ 85721, United States
| | - Yin Chen
- Department of Pharmacology and Toxicology, School of Pharmacy, University of Arizona, Tucson, AZ 85721, United States
| | - Hongmei Jiang
- Department of Statistics and Data Science, Northwestern University, Evanston, IL 60208, United States
| | - Lingling An
- Department of Biosystems Engineering, University of Arizona, Tucson, AZ 85721, United States
- Interdisciplinary Program in Statistics and Data Science, University of Arizona, Tucson, AZ 85721, United States
- Department of Epidemiology and Biostatistics, University of Arizona, Tucson, AZ 85721, United States
| |
Collapse
|
14
|
Sharifitabar M, Kazempour S, Razavian J, Sajedi S, Solhjoo S, Zare H. A deep neural network to de-noise single-cell RNA sequencing data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.20.624552. [PMID: 39605470 PMCID: PMC11601639 DOI: 10.1101/2024.11.20.624552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Single-cell RNA sequencing (scRNA-seq), a powerful technique for investigating the transcriptome of individual cells, enables the discovery of heterogeneous cell populations, rare cell types, and transcriptional dynamics in separate cells. Yet, scRNA-seq data analysis is limited by the problem of measurement dropouts, i.e., genes displaying zero expression levels. We introduce ZiPo, a deep artificial neural network for rate estimation and library size prediction in scRNA-seq data which incorporates adjustable zero inflation in the distribution to capture the dropouts. ZiPo builds upon established concepts, including using deep autoencoders and adopting the Poisson and negative binomial distributions, by taking advantage of novel strategies, including library size prediction and residual connections, to improve the overall performance. A significant innovation of ZiPo is the introduction of a scale-invariant loss term, making the weights sparse and, hence, the model biologically more interpretable. ZiPo quickly handles vast singular and mixed datasets, with the processing time directly proportional to the number of cells. In this paper, we demonstrate the power of ZiPo on three datasets and show its advantages over other current techniques. The code used to produce the results in this manuscript is available at https://bitbucket.org/habilzare/alzheimer/src/master/code/deep/ZiPo/.
Collapse
|
15
|
Brochu HN, Smith E, Jeong S, Carlson M, Hansen SG, Tisoncik-Go J, Law L, Picker LJ, Gale M, Peng X. Pre-challenge gut microbial signature predicts RhCMV/SIV vaccine efficacy in rhesus macaques. Microbiol Spectr 2024; 12:e0128524. [PMID: 39345211 PMCID: PMC11537114 DOI: 10.1128/spectrum.01285-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2024] [Accepted: 08/21/2024] [Indexed: 10/01/2024] Open
Abstract
Rhesus cytomegalovirus expressing simian immunodeficiency virus (RhCMV/SIV) vaccines protect ~59% of vaccinated rhesus macaques against repeated limiting-dose intra-rectal exposure with highly pathogenic SIVmac239M, but the exact mechanism responsible for the vaccine efficacy is unknown. It is becoming evident that complex interactions exist between gut microbiota and the host immune system. Here, we aimed to investigate if the rhesus gut microbiome impacts RhCMV/SIV vaccine-induced protection. Three groups of 15 rhesus macaques naturally pre-exposed to RhCMV were vaccinated with RhCMV/SIV vaccines. Rectal swabs were collected longitudinally both before SIV challenge (after vaccination) and post-challenge and were profiled using 16S rRNA based microbiome analysis. We identified ~2,400 16S rRNA amplicon sequence variants (ASVs), representing potential bacterial species/strains. Global gut microbial profiles were strongly associated with each of the three vaccination groups, and all animals tended to maintain consistent profiles throughout the pre-challenge phase. Despite vaccination group differences, by using newly developed compositional data analysis techniques, we identified a common gut microbial signature predictive of vaccine protection outcome across the three vaccination groups. Part of this microbial signature persisted even after SIV challenge. We also observed a strong correlation between this microbial signature and an early signature derived from whole blood transcriptomes in the same animals. Our findings indicate that changes in gut microbiomes are associated with RhCMV/SIV vaccine-induced protection and early host response to vaccination in rhesus macaques.IMPORTANCEThe human immunodeficiency virus (HIV) has infected millions of people worldwide. Unfortunately, still there is no vaccine that can prevent or treat HIV infection. A promising pre-clinical HIV vaccine based on rhesus cytomegalovirus (RhCMV) expressing simian immunodeficiency virus (SIV) antigens (RhCMV/SIV) provides sustained, durable protection against SIV challenge in ~59% of vaccinated rhesus macaques. There is an urgent need to understand the cause of this protection vs non-protection outcome. In this study, we profiled the gut microbiomes of 45 RhCMV/SIV vaccinated rhesus macaques and identified gut microbial signatures that were predictive of RhCMV/SIV vaccination groups and vaccine protection outcomes. These vaccine protection-associated microbial features were significantly correlated with early vaccine-induced host immune signatures in whole blood from the same animals. These findings show that the gut microbiome may be involved in RhCMV/SIV vaccine-induced protection, warranting further research into the impact of the gut microbiome in human vaccine trials.
Collapse
Affiliation(s)
- Hayden N. Brochu
- Department of Molecular Biomedical Sciences, North Carolina State University College of Veterinary Medicine, Raleigh, North Carolina, USA
- Bioinformatics Graduate Program, North Carolina State University, Raleigh, North Carolina, USA
| | - Elise Smith
- Department of Immunology, University of Washington, Seattle, Washington, USA
| | - Sangmi Jeong
- Department of Molecular Biomedical Sciences, North Carolina State University College of Veterinary Medicine, Raleigh, North Carolina, USA
- Bioinformatics Graduate Program, North Carolina State University, Raleigh, North Carolina, USA
| | - Michelle Carlson
- Department of Immunology, University of Washington, Seattle, Washington, USA
| | - Scott G. Hansen
- Vaccine and Gene Therapy Institute, Oregon Health & Science University, Beaverton, Oregon, USA
| | - Jennifer Tisoncik-Go
- Department of Immunology, University of Washington, Seattle, Washington, USA
- Center for Innate Immunity and Immune Disease, University of Washington, Seattle, Washington, USA
| | - Lynn Law
- Department of Immunology, University of Washington, Seattle, Washington, USA
- Center for Innate Immunity and Immune Disease, University of Washington, Seattle, Washington, USA
| | - Louis J. Picker
- Vaccine and Gene Therapy Institute, Oregon Health & Science University, Beaverton, Oregon, USA
| | - Michael Gale
- Department of Immunology, University of Washington, Seattle, Washington, USA
- Center for Innate Immunity and Immune Disease, University of Washington, Seattle, Washington, USA
- Washington National Primate Research Center, University of Washington, Seattle, Washington, USA
| | - Xinxia Peng
- Department of Molecular Biomedical Sciences, North Carolina State University College of Veterinary Medicine, Raleigh, North Carolina, USA
- Bioinformatics Graduate Program, North Carolina State University, Raleigh, North Carolina, USA
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, USA
| |
Collapse
|
16
|
Wang M, Fontaine S, Jiang H, Li G. ADAPT: Analysis of Microbiome Differential Abundance by Pooling Tobit Models. Bioinformatics 2024; 40:btae661. [PMID: 39509330 PMCID: PMC11959182 DOI: 10.1093/bioinformatics/btae661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2024] [Revised: 10/01/2024] [Accepted: 11/05/2024] [Indexed: 11/15/2024] Open
Abstract
MOTIVATION Microbiome differential abundance analysis (DAA) remains a challenging problem despite multiple methods proposed in the literature. The excessive zeros and compositionality of metagenomics data are two main challenges for DAA. RESULTS We propose a novel method called "Analysis of Microbiome Differential Abundance by Pooling Tobit Models" (ADAPT) to overcome these two challenges. ADAPT interprets zero counts as left-censored observations to avoid unfounded assumptions and complex models. ADAPT also encompasses a theoretically justified way of selecting non-differentially abundant microbiome taxa as a reference to reveal differentially abundant taxa while avoiding false discoveries. We generate synthetic data using independent simulation frameworks to show that ADAPT has more consistent false discovery rate control and higher statistical power than competitors. We use ADAPT to analyze 16S rRNA sequencing of saliva samples and shotgun metagenomics sequencing of plaque samples collected from infants in the COHRA2 study. The results provide novel insights into the association between the oral microbiome and early childhood dental caries. AVAILABILITY AND IMPLEMENTATION The R package ADAPT can be installed from Bioconductor at https://bioconductor.org/packages/release/bioc/html/ADAPT.html or from Github at https://github.com/mkbwang/ADAPT. The source codes for simulation studies and real data analysis are available at https://github.com/mkbwang/ADAPT_example.
Collapse
Affiliation(s)
- Mukai Wang
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan, 48109, United States
| | - Simon Fontaine
- Department of Statistics, University of Michigan, 1085 South University, Ann Arbor, Michigan, 48109, United States
| | - Hui Jiang
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan, 48109, United States
| | - Gen Li
- Department of Biostatistics, University of Michigan, 1415 Washington Heights, Ann Arbor, Michigan, 48109, United States
| |
Collapse
|
17
|
Özden F, Minary P. Learning to quantify uncertainty in off-target activity for CRISPR guide RNAs. Nucleic Acids Res 2024; 52:e87. [PMID: 39275984 PMCID: PMC11472043 DOI: 10.1093/nar/gkae759] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2024] [Revised: 08/07/2024] [Accepted: 08/23/2024] [Indexed: 09/16/2024] Open
Abstract
CRISPR-based genome editing technologies have revolutionised the field of molecular biology, offering unprecedented opportunities for precise genetic manipulation. However, off-target effects remain a significant challenge, potentially leading to unintended consequences and limiting the applicability of CRISPR-based genome editing technologies in clinical settings. Current literature predominantly focuses on point predictions for off-target activity, which may not fully capture the range of possible outcomes and associated risks. Here, we present crispAI, a neural network architecture-based approach for predicting uncertainty estimates for off-target cleavage activity, providing a more comprehensive risk assessment and facilitating improved decision-making in single guide RNA (sgRNA) design. Our approach makes use of the count noise model Zero Inflated Negative Binomial (ZINB) to model the uncertainty in the off-target cleavage activity data. In addition, we present the first-of-its-kind genome-wide sgRNA efficiency score, crispAI-aggregate, enabling prioritization among sgRNAs with similar point aggregate predictions by providing richer information compared to existing aggregate scores. We show that uncertainty estimates of our approach are calibrated and its predictive performance is superior to the state-of-the-art in silico off-target cleavage activity prediction methods. The tool and the trained models are available at https://github.com/furkanozdenn/crispr-offtarget-uncertainty.
Collapse
Affiliation(s)
- Furkan Özden
- Department of Computer Science, University of Oxford, Oxford OX1 3QD, UK
| | - Peter Minary
- Department of Computer Science, University of Oxford, Oxford OX1 3QD, UK
| |
Collapse
|
18
|
González A, Fullaondo A, Odriozola A. Host genetics and microbiota data analysis in colorectal cancer research. ADVANCES IN GENETICS 2024; 112:31-81. [PMID: 39396840 DOI: 10.1016/bs.adgen.2024.08.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/15/2024]
Abstract
Colorectal cancer (CRC) is a heterogeneous disease with a complex aetiology influenced by a myriad of genetic and environmental factors. Despite advances in CRC research, it is a major burden of disease, with the second highest incidence and third leading cause of cancer deaths worldwide. To individualise diagnosis, prognosis, and treatment of CRC, developing new strategies combining precision medicine and bioinformatic procedures is promising. Precision medicine is based on omics technologies and aims to individualise the management of CRC based on patient host genetic characteristics and microbiota. Bioinformatics is central to the application of personalised medicine because it enables the analysis of large datasets generated by these technologies. At the level of host genetics, bioinformatics allows the identification of mutations, genes, molecular pathways, biomarkers and drugs relevant to colorectal carcinogenesis. At the microbiota level, bioinformatics is fundamental to analysing microbial communities' composition and functionality and developing biomarkers and personalised microbiota-based therapies. This paper explores the host and microbiota genetic data analysis in CRC research.
Collapse
Affiliation(s)
- Adriana González
- Hologenomics Research Group, Department of Genetics, Physical Anthropology, and Animal Physiology, University of the Basque Country, Spain
| | - Asier Fullaondo
- Hologenomics Research Group, Department of Genetics, Physical Anthropology, and Animal Physiology, University of the Basque Country, Spain
| | - Adrian Odriozola
- Hologenomics Research Group, Department of Genetics, Physical Anthropology, and Animal Physiology, University of the Basque Country, Spain.
| |
Collapse
|
19
|
Wang Z, Lloyd D, Zhao S, Motsinger-Reif A. Taxanorm: a novel taxa-specific normalization approach for microbiome data. BMC Bioinformatics 2024; 25:304. [PMID: 39285319 PMCID: PMC11406911 DOI: 10.1186/s12859-024-05918-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2024] [Accepted: 08/28/2024] [Indexed: 09/19/2024] Open
Abstract
BACKGROUND In high-throughput sequencing studies, sequencing depth, which quantifies the total number of reads, varies across samples. Unequal sequencing depth can obscure true biological signals of interest and prevent direct comparisons between samples. To remove variability due to differential sequencing depth, taxa counts are usually normalized before downstream analysis. However, most existing normalization methods scale counts using size factors that are sample specific but not taxa specific, which can result in over- or under-correction for some taxa. RESULTS We developed TaxaNorm, a novel normalization method based on a zero-inflated negative binomial model. This method assumes the effects of sequencing depth on mean and dispersion vary across taxa. Incorporating the zero-inflation part can better capture the nature of microbiome data. We also propose two corresponding diagnosis tests on the varying sequencing depth effect for validation. We find that TaxaNorm achieves comparable performance to existing methods in most simulation scenarios in downstream analysis and reaches a higher power for some cases. Specifically, it balances power and false discovery control well. When applying the method in a real dataset, TaxaNorm has improved performance when correcting technical bias. CONCLUSION TaxaNorm both sample- and taxon- specific bias by introducing an appropriate regression framework in the microbiome data, which aids in data interpretation and visualization. The 'TaxaNorm' R package is freely available through the CRAN repository https://CRAN.R-project.org/package=TaxaNorm and the source code can be downloaded at https://github.com/wangziyue57/TaxaNorm .
Collapse
Affiliation(s)
- Ziyue Wang
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC, 27709, USA
- Department of Population Health, NYU Grossman School of Medicine, New York, NY, 10016, USA
| | - Dillon Lloyd
- Department of Biological Sciences and Statistics, North Carolina State University, Raleigh, NC, 27695, USA
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, 27695, USA
| | - Shanshan Zhao
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC, 27709, USA
| | - Alison Motsinger-Reif
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, Durham, NC, 27709, USA.
| |
Collapse
|
20
|
Grun CN, Jain R, Schniederberend M, Shoemaker CB, Nelson B, Kazmierczak BI. Bacterial cell surface characterization by phage display coupled to high-throughput sequencing. Nat Commun 2024; 15:7502. [PMID: 39209859 PMCID: PMC11362561 DOI: 10.1038/s41467-024-51912-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2023] [Accepted: 08/19/2024] [Indexed: 09/04/2024] Open
Abstract
The remarkable capacity of bacteria to adapt in response to selective pressures drives antimicrobial resistance. Pseudomonas aeruginosa illustrates this point, establishing chronic infections during which it evolves to survive antimicrobials and evade host defenses. Many adaptive changes occur on the P. aeruginosa cell surface but methods to identify these are limited. Here we combine phage display with high-throughput DNA sequencing to create a high throughput, multiplexed technology for surveying bacterial cell surfaces, Phage-seq. By applying phage display panning to hundreds of bacterial genotypes and analyzing the dynamics of the phage display selection process, we capture important biological information about cell surfaces. This approach also yields camelid single-domain antibodies that recognize key P. aeruginosa virulence factors on live cells. These antibodies have numerous potential applications in diagnostics and therapeutics. We propose that Phage-seq establishes a powerful paradigm for studying the bacterial cell surface by identifying and profiling many surface features in parallel.
Collapse
Affiliation(s)
- Casey N Grun
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT, USA
| | - Ruchi Jain
- Department of Medicine, Section of Infectious Diseases, Yale University School of Medicine, New Haven, CT, USA
- Piton Therapeutics, Watertown, MA, USA
| | - Maren Schniederberend
- Department of Medicine, Section of Infectious Diseases, Yale University School of Medicine, New Haven, CT, USA
| | - Charles B Shoemaker
- Department of Infectious Disease and Global Health, Tufts Cummings School of Veterinary Medicine, North Grafton, MA, USA
| | - Bryce Nelson
- Department of Pharmacology, Yale University School of Medicine, New Haven, CT, USA
- Orion Corporation, Turku, Finland
| | - Barbara I Kazmierczak
- Department of Microbial Pathogenesis, Yale University School of Medicine, New Haven, CT, USA.
- Department of Medicine, Section of Infectious Diseases, Yale University School of Medicine, New Haven, CT, USA.
| |
Collapse
|
21
|
Srivastava P, Benegas Coll M, Götz S, Nueda MJ, Conesa A. scMaSigPro: Differential Expression Analysis along Single-Cell Trajectories. Bioinformatics 2024; 40:btae443. [PMID: 38976653 PMCID: PMC11269465 DOI: 10.1093/bioinformatics/btae443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/02/2024] [Revised: 06/27/2024] [Accepted: 07/04/2024] [Indexed: 07/10/2024] Open
Abstract
MOTIVATION Understanding the dynamics of gene expression across different cellular states is crucial for discerning the mechanisms underneath cellular differentiation. Genes that exhibit variation in mean expression as a function of Pseudotime and between branching trajectories are expected to govern cell fate decisions. We introduce scMaSigPro, a method for the identification of differential gene expression patterns along Pseudotime and branching paths simultaneously. RESULTS We assessed the performance of scMaSigPro using synthetic and public datasets. Our evaluation shows that scMaSigPro outperforms existing methods in controlling the False Positive Rate and is computationally efficient. AVAILABILITY AND IMPLEMENTATION scMaSigPro is available as a free R package (version 4.0 or higher) under the GPL(≥2) license on GitHub at 'github.com/BioBam/scMaSigPro' and archived with version 0.03 on Zenodo at 'zenodo.org/records/12568922'.
Collapse
Affiliation(s)
- Priyansh Srivastava
- BioBam Bioinformatics S.L., Valencia, 46024, Spain
- Department of Computer Science, University of Valencia, Valencia, 46100, Spain
| | | | - Stefan Götz
- BioBam Bioinformatics S.L., Valencia, 46024, Spain
| | - María José Nueda
- Mathematics Department, University of Alicante, Alicante, 03690, Spain
| | - Ana Conesa
- Institute for Integrative Systems Biology (I2SysBio), Consejo Superior de Investigaciones Cientıficas (CSIC), Paterna, 46980, Spain
| |
Collapse
|
22
|
Li H, Zhu B, Jiang X, Guo L, Xie Y, Xu L, Li Q. An interpretable Bayesian clustering approach with feature selection for analyzing spatially resolved transcriptomics data. Biometrics 2024; 80:ujae066. [PMID: 39073775 DOI: 10.1093/biomtc/ujae066] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2023] [Revised: 05/13/2024] [Accepted: 07/07/2024] [Indexed: 07/30/2024]
Abstract
Recent breakthroughs in spatially resolved transcriptomics (SRT) technologies have enabled comprehensive molecular characterization at the spot or cellular level while preserving spatial information. Cells are the fundamental building blocks of tissues, organized into distinct yet connected components. Although many non-spatial and spatial clustering approaches have been used to partition the entire region into mutually exclusive spatial domains based on the SRT high-dimensional molecular profile, most require an ad hoc selection of less interpretable dimensional-reduction techniques. To overcome this challenge, we propose a zero-inflated negative binomial mixture model to cluster spots or cells based on their molecular profiles. To increase interpretability, we employ a feature selection mechanism to provide a low-dimensional summary of the SRT molecular profile in terms of discriminating genes that shed light on the clustering result. We further incorporate the SRT geospatial profile via a Markov random field prior. We demonstrate how this joint modeling strategy improves clustering accuracy, compared with alternative state-of-the-art approaches, through simulation studies and 3 real data applications.
Collapse
Affiliation(s)
- Huimin Li
- Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX 75080, United States
| | - Bencong Zhu
- Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX 75080, United States
- Department of Statistics, The Chinese University of Hong Kong, Hong Kong SAR, China
| | - Xi Jiang
- Department of Statistics and Data Science, Southern Methodist University, Dallas, TX 75205, United States
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, The University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Lei Guo
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, The University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Yang Xie
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, The University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Lin Xu
- Quantitative Biomedical Research Center, Peter O'Donnell Jr. School of Public Health, The University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Qiwei Li
- Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX 75080, United States
| |
Collapse
|
23
|
Liu Y, Fachrul M, Inouye M, Méric G. Harnessing human microbiomes for disease prediction. Trends Microbiol 2024; 32:707-719. [PMID: 38246848 DOI: 10.1016/j.tim.2023.12.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Revised: 12/12/2023] [Accepted: 12/12/2023] [Indexed: 01/23/2024]
Abstract
The human microbiome has been increasingly recognized as having potential use for disease prediction. Predicting the risk, progression, and severity of diseases holds promise to transform clinical practice, empower patient decisions, and reduce the burden of various common diseases, as has been demonstrated for cardiovascular disease or breast cancer. Combining multiple modifiable and non-modifiable risk factors, including high-dimensional genomic data, has been traditionally favored, but few studies have incorporated the human microbiome into models for predicting the prospective risk of disease. Here, we review research into the use of the human microbiome for disease prediction with a particular focus on prospective studies as well as the modulation and engineering of the microbiome as a therapeutic strategy.
Collapse
Affiliation(s)
- Yang Liu
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK; Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia; Department of Clinical Pathology, Melbourne Medical School, The University of Melbourne, Melbourne, Victoria, Australia; Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK; British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK
| | - Muhamad Fachrul
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia; Department of Clinical Pathology, Melbourne Medical School, The University of Melbourne, Melbourne, Victoria, Australia; Human Genomics and Evolution Unit, St Vincent's Institute of Medical Research, Victoria, Australia; Melbourne Integrative Genomics, University of Melbourne, Parkville, Victoria, Australia; School of BioSciences, University of Melbourne, Parkville, Victoria, Australia
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK; Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia; Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, UK; British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge, UK; Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge, UK; British Heart Foundation Cambridge Centre of Research Excellence, School of Clinical Medicine, University of Cambridge, Cambridge, UK
| | - Guillaume Méric
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, Victoria, Australia; Department of Cardiometabolic Health, University of Melbourne, Melbourne, Victoria, Australia; Central Clinical School, Monash University, Melbourne, Victoria, Australia; Department of Medical Science, Molecular Epidemiology, Uppsala University, Uppsala, Sweden; Department of Cardiovascular Research, Translation, and Implementation, La Trobe University, Melbourne, Victoria, Australia.
| |
Collapse
|
24
|
Abegaz F, Abedini D, White F, Guerrieri A, Zancarini A, Dong L, Westerhuis JA, van Eeuwijk F, Bouwmeester H, Smilde AK. A strategy for differential abundance analysis of sparse microbiome data with group-wise structured zeros. Sci Rep 2024; 14:12433. [PMID: 38816496 PMCID: PMC11139916 DOI: 10.1038/s41598-024-62437-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2023] [Accepted: 05/16/2024] [Indexed: 06/01/2024] Open
Abstract
Comparing the abundance of microbial communities between different groups or obtained under different experimental conditions using count sequence data is a challenging task due to various issues such as inflated zero counts, overdispersion, and non-normality. Several methods and procedures based on counts, their transformation and compositionality have been proposed in the literature to detect differentially abundant species in datasets containing hundreds to thousands of microbial species. Despite efforts to address the large numbers of zeros present in microbiome datasets, even after careful data preprocessing, the performance of existing methods is impaired by the presence of inflated zero counts and group-wise structured zeros (i.e. all zero counts in a group). We propose and validate using extensive simulations an approach combining two differential abundance testing methods, namely DESeq2-ZINBWaVE and DESeq2, to address the issues of zero-inflation and group-wise structured zeros, respectively. This combined approach was subsequently successfully applied to two plant microbiome datasets that revealed a number of taxa as interesting candidates for further experimental validation.
Collapse
Affiliation(s)
- Fentaw Abegaz
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands.
- Biometris, Wageningen University & Research, 6708 PB, Wageningen, The Netherlands.
| | - Davar Abedini
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Fred White
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Alessandra Guerrieri
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Anouk Zancarini
- IGEPP, INRAE, Institut Agro, Univ Rennes, 35653, Le Rheu, France
| | - Lemeng Dong
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Johan A Westerhuis
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Fred van Eeuwijk
- Biometris, Wageningen University & Research, 6708 PB, Wageningen, The Netherlands
| | - Harro Bouwmeester
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| | - Age K Smilde
- Swammerdam Institute for Life Sciences, University of Amsterdam, 1098 XH, Amsterdam, The Netherlands
| |
Collapse
|
25
|
Chi J, Ye J, Zhou Y. A GLM-based zero-inflated generalized Poisson factor model for analyzing microbiome data. Front Microbiol 2024; 15:1394204. [PMID: 38873138 PMCID: PMC11173601 DOI: 10.3389/fmicb.2024.1394204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 05/20/2024] [Indexed: 06/15/2024] Open
Abstract
Motivation High-throughput sequencing technology facilitates the quantitative analysis of microbial communities, improving the capacity to investigate the associations between the human microbiome and diseases. Our primary motivating application is to explore the association between gut microbes and obesity. The complex characteristics of microbiome data, including high dimensionality, zero inflation, and over-dispersion, pose new statistical challenges for downstream analysis. Results We propose a GLM-based zero-inflated generalized Poisson factor analysis (GZIGPFA) model to analyze microbiome data with complex characteristics. The GZIGPFA model is based on a zero-inflated generalized Poisson (ZIGP) distribution for modeling microbiome count data. A link function between the generalized Poisson rate and the probability of excess zeros is established within the generalized linear model (GLM) framework. The latent parameters of the GZIGPFA model constitute a low-rank matrix comprising a low-dimensional score matrix and a loading matrix. An alternating maximum likelihood algorithm is employed to estimate the unknown parameters, and cross-validation is utilized to determine the rank of the model in this study. The proposed GZIGPFA model demonstrates superior performance and advantages through comprehensive simulation studies and real data applications.
Collapse
Affiliation(s)
- Jinling Chi
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Jimin Ye
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Ying Zhou
- School of Mathematical Sciences, Heilongjiang University, Harbin, China
| |
Collapse
|
26
|
Algavi YM, Borenstein E. Relative dispersion ratios following fecal microbiota transplant elucidate principles governing microbial migration dynamics. Nat Commun 2024; 15:4447. [PMID: 38789466 PMCID: PMC11126695 DOI: 10.1038/s41467-024-48717-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2023] [Accepted: 05/08/2024] [Indexed: 05/26/2024] Open
Abstract
Microorganisms frequently migrate from one ecosystem to another. Yet, despite the potential importance of this process in modulating the environment and the microbial ecosystem, our understanding of the fundamental forces that govern microbial dispersion is still lacking. Moreover, while theoretical models and in-vitro experiments have highlighted the contribution of species interactions to community assembly, identifying such interactions in vivo, specifically in communities as complex as the human gut, remains challenging. To address this gap, here we introduce a robust and rigorous computational framework, termed Relative Dispersion Ratio (RDR) analysis, and leverage data from well-characterized fecal microbiota transplant trials, to rigorously pinpoint dependencies between taxa during the colonization of human gastrointestinal tract. Our analysis identifies numerous pairwise dependencies between co-colonizing microbes during migration between gastrointestinal environments. We further demonstrate that identified dependencies agree with previously reported findings from in-vitro experiments and population-wide distribution patterns. Finally, we explore metabolic dependencies between these taxa and characterize the functional properties that facilitate effective dispersion. Collectively, our findings provide insights into the principles and determinants of community dynamics following ecological translocation, informing potential opportunities for precise community design.
Collapse
Affiliation(s)
- Yadid M Algavi
- Faculty of Medical & Health Sciences, Tel Aviv University, Tel Aviv, Israel
| | - Elhanan Borenstein
- Faculty of Medical & Health Sciences, Tel Aviv University, Tel Aviv, Israel.
- The Blavatnik School of Computer Science, Tel Aviv University, Tel Aviv, Israel.
- Santa Fe Institute, Santa Fe, NM, USA.
| |
Collapse
|
27
|
Wang M, Fontaine S, Jiang H, Li G. ADAPT: Analysis of Microbiome Differential Abundance by Pooling Tobit Models. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.14.594186. [PMID: 38798558 PMCID: PMC11118451 DOI: 10.1101/2024.05.14.594186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2024]
Abstract
Microbiome differential abundance analysis remains a challenging problem despite multiple methods proposed in the literature. The excessive zeros and compositionality of metagenomics data are two main challenges for differential abundance analysis. We propose a novel method called "analysis of differential abundance by pooling Tobit models" (ADAPT) to overcome these two challenges. ADAPT uniquely treats zero counts as left-censored observations to facilitate computation and enhance interpretation. ADAPT also encompasses a theoretically justified way of selecting non-differentially abundant microbiome taxa as a reference for hypothesis testing. We generate synthetic data using independent simulation frameworks to show that ADAPT has more consistent false discovery rate control and higher statistical power than competitors. We use ADAPT to analyze 16S rRNA sequencing of saliva samples and shotgun metagenomics sequencing of plaque samples collected from infants in the COHRA2 study. The results provide novel insights into the association between the oral microbiome and early childhood dental caries.
Collapse
Affiliation(s)
- Mukai Wang
- Department of Biostatistics, University of Michigan, Ann Arbor, 48109, MI, USA
| | - Simon Fontaine
- Department of Statistics, University of Michigan, Ann Arbor, 48109, MI, USA
| | - Hui Jiang
- Department of Biostatistics, University of Michigan, Ann Arbor, 48109, MI, USA
| | - Gen Li
- Department of Biostatistics, University of Michigan, Ann Arbor, 48109, MI, USA
| |
Collapse
|
28
|
Cuevas-Diaz Duran R, Wei H, Wu J. Data normalization for addressing the challenges in the analysis of single-cell transcriptomic datasets. BMC Genomics 2024; 25:444. [PMID: 38711017 PMCID: PMC11073985 DOI: 10.1186/s12864-024-10364-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2023] [Accepted: 04/29/2024] [Indexed: 05/08/2024] Open
Abstract
BACKGROUND Normalization is a critical step in the analysis of single-cell RNA-sequencing (scRNA-seq) datasets. Its main goal is to make gene counts comparable within and between cells. To do so, normalization methods must account for technical and biological variability. Numerous normalization methods have been developed addressing different sources of dispersion and making specific assumptions about the count data. MAIN BODY The selection of a normalization method has a direct impact on downstream analysis, for example differential gene expression and cluster identification. Thus, the objective of this review is to guide the reader in making an informed decision on the most appropriate normalization method to use. To this aim, we first give an overview of the different single cell sequencing platforms and methods commonly used including isolation and library preparation protocols. Next, we discuss the inherent sources of variability of scRNA-seq datasets. We describe the categories of normalization methods and include examples of each. We also delineate imputation and batch-effect correction methods. Furthermore, we describe data-driven metrics commonly used to evaluate the performance of normalization methods. We also discuss common scRNA-seq methods and toolkits used for integrated data analysis. CONCLUSIONS According to the correction performed, normalization methods can be broadly classified as within and between-sample algorithms. Moreover, with respect to the mathematical model used, normalization methods can further be classified into: global scaling methods, generalized linear models, mixed methods, and machine learning-based methods. Each of these methods depict pros and cons and make different statistical assumptions. However, there is no better performing normalization method. Instead, metrics such as silhouette width, K-nearest neighbor batch-effect test, or Highly Variable Genes are recommended to assess the performance of normalization methods.
Collapse
Affiliation(s)
- Raquel Cuevas-Diaz Duran
- Tecnologico de Monterrey, Escuela de Medicina y Ciencias de la Salud, Monterrey, Nuevo Leon, 64710, Mexico.
| | - Haichao Wei
- The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
- Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, 77030, USA
| | - Jiaqian Wu
- The Vivian L. Smith Department of Neurosurgery, McGovern Medical School, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
- Center for Stem Cell and Regenerative Medicine, UT Brown Foundation Institute of Molecular Medicine, Houston, TX, 77030, USA.
- MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX, 77030, USA.
| |
Collapse
|
29
|
Brooks TG, Lahens NF, Mrčela A, Grant GR. Challenges and best practices in omics benchmarking. Nat Rev Genet 2024; 25:326-339. [PMID: 38216661 DOI: 10.1038/s41576-023-00679-6] [Citation(s) in RCA: 9] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 11/14/2023] [Indexed: 01/14/2024]
Abstract
Technological advances enabling massively parallel measurement of biological features - such as microarrays, high-throughput sequencing and mass spectrometry - have ushered in the omics era, now in its third decade. The resulting complex landscape of analytical methods has naturally fostered the growth of an omics benchmarking industry. Benchmarking refers to the process of objectively comparing and evaluating the performance of different computational or analytical techniques when processing and analysing large-scale biological data sets, such as transcriptomics, proteomics and metabolomics. With thousands of omics benchmarking studies published over the past 25 years, the field has matured to the point where the foundations of benchmarking have been established and well described. However, generating meaningful benchmarking data and properly evaluating performance in this complex domain remains challenging. In this Review, we highlight some common oversights and pitfalls in omics benchmarking. We also establish a methodology to bring the issues that can be addressed into focus and to be transparent about those that cannot: this takes the form of a spreadsheet template of guidelines for comprehensive reporting, intended to accompany publications. In addition, a survey of recent developments in benchmarking is provided as well as specific guidance for commonly encountered difficulties.
Collapse
Affiliation(s)
- Thomas G Brooks
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas F Lahens
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Antonijo Mrčela
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA
| | - Gregory R Grant
- Institute for Translational Medicine and Therapeutics, University of Pennsylvania, Philadelphia, PA, USA.
- Department of Genetics, University of Pennsylvania, Philadelphia, PA, USA.
| |
Collapse
|
30
|
Zong Y, Zhao H, Wang T. mbDecoda: a debiased approach to compositional data analysis for microbiome surveys. Brief Bioinform 2024; 25:bbae205. [PMID: 38701410 PMCID: PMC11066923 DOI: 10.1093/bib/bbae205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 04/05/2024] [Accepted: 04/15/2024] [Indexed: 05/05/2024] Open
Abstract
Potentially pathogenic or probiotic microbes can be identified by comparing their abundance levels between healthy and diseased populations, or more broadly, by linking microbiome composition with clinical phenotypes or environmental factors. However, in microbiome studies, feature tables provide relative rather than absolute abundance of each feature in each sample, as the microbial loads of the samples and the ratios of sequencing depth to microbial load are both unknown and subject to considerable variation. Moreover, microbiome abundance data are count-valued, often over-dispersed and contain a substantial proportion of zeros. To carry out differential abundance analysis while addressing these challenges, we introduce mbDecoda, a model-based approach for debiased analysis of sparse compositions of microbiomes. mbDecoda employs a zero-inflated negative binomial model, linking mean abundance to the variable of interest through a log link function, and it accommodates the adjustment for confounding factors. To efficiently obtain maximum likelihood estimates of model parameters, an Expectation Maximization algorithm is developed. A minimum coverage interval approach is then proposed to rectify compositional bias, enabling accurate and reliable absolute abundance analysis. Through extensive simulation studies and analysis of real-world microbiome datasets, we demonstrate that mbDecoda compares favorably with state-of-the-art methods in terms of effectiveness, robustness and reproducibility.
Collapse
Affiliation(s)
- Yuxuan Zong
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China
- SJTU-Yale Joint Center of Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
| | - Hongyu Zhao
- SJTU-Yale Joint Center of Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
- Department of Biostatistics, Yale University, New Haven, CT
| | - Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China
- SJTU-Yale Joint Center of Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
- Department of Statistics, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
31
|
Islam MT, Liu Y, Hassan MM, Abraham PE, Merlet J, Townsend A, Jacobson D, Buell CR, Tuskan GA, Yang X. Advances in the Application of Single-Cell Transcriptomics in Plant Systems and Synthetic Biology. BIODESIGN RESEARCH 2024; 6:0029. [PMID: 38435807 PMCID: PMC10905259 DOI: 10.34133/bdr.0029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2023] [Accepted: 01/28/2024] [Indexed: 03/05/2024] Open
Abstract
Plants are complex systems hierarchically organized and composed of various cell types. To understand the molecular underpinnings of complex plant systems, single-cell RNA sequencing (scRNA-seq) has emerged as a powerful tool for revealing high resolution of gene expression patterns at the cellular level and investigating the cell-type heterogeneity. Furthermore, scRNA-seq analysis of plant biosystems has great potential for generating new knowledge to inform plant biosystems design and synthetic biology, which aims to modify plants genetically/epigenetically through genome editing, engineering, or re-writing based on rational design for increasing crop yield and quality, promoting the bioeconomy and enhancing environmental sustainability. In particular, data from scRNA-seq studies can be utilized to facilitate the development of high-precision Build-Design-Test-Learn capabilities for maximizing the targeted performance of engineered plant biosystems while minimizing unintended side effects. To date, scRNA-seq has been demonstrated in a limited number of plant species, including model plants (e.g., Arabidopsis thaliana), agricultural crops (e.g., Oryza sativa), and bioenergy crops (e.g., Populus spp.). It is expected that future technical advancements will reduce the cost of scRNA-seq and consequently accelerate the application of this emerging technology in plants. In this review, we summarize current technical advancements in plant scRNA-seq, including sample preparation, sequencing, and data analysis, to provide guidance on how to choose the appropriate scRNA-seq methods for different types of plant samples. We then highlight various applications of scRNA-seq in both plant systems biology and plant synthetic biology research. Finally, we discuss the challenges and opportunities for the application of scRNA-seq in plants.
Collapse
Affiliation(s)
- Md Torikul Islam
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Yang Liu
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Md Mahmudul Hassan
- Department of Genetics and Plant Breeding,
Patuakhali Science and Technology University, Dumki, Patuakhali 8602, Bangladesh
| | - Paul E. Abraham
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Jean Merlet
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- Bredesen Center for Interdisciplinary Research and Graduate Education,
University of Tennessee Knoxville, Knoxville, TN 37996, USA
| | - Alice Townsend
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- Bredesen Center for Interdisciplinary Research and Graduate Education,
University of Tennessee Knoxville, Knoxville, TN 37996, USA
| | - Daniel Jacobson
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - C. Robin Buell
- Center for Applied Genetic Technologies,
University of Georgia, Athens, GA 30602, USA
- Department of Crop and Soil Sciences,
University of Georgia, Athens, GA 30602, USA
- Institute of Plant Breeding, Genetics, and Genomics,
University of Georgia, Athens, GA 30602, USA
| | - Gerald A. Tuskan
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| | - Xiaohan Yang
- Biosciences Division, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
- The Center for Bioenergy Innovation, Oak Ridge National Laboratory, Oak Ridge, TN 37831, USA
| |
Collapse
|
32
|
Brochu HN, Smith E, Jeong S, Carlson M, Hansen SG, Tisoncik-Go J, Law L, Picker LJ, Gale M, Peng X. Pre-challenge gut microbial signature predicts RhCMV/SIV vaccine efficacy in rhesus macaques. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.02.27.582186. [PMID: 38464179 PMCID: PMC10925241 DOI: 10.1101/2024.02.27.582186] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/12/2024]
Abstract
Background RhCMV/SIV vaccines protect ∼59% of vaccinated rhesus macaques against repeated limiting-dose intra-rectal exposure with highly pathogenic SIVmac239M, but the exact mechanism responsible for the vaccine efficacy is not known. It is becoming evident that complex interactions exist between gut microbiota and the host immune system. Here we aimed to investigate if the rhesus gut microbiome impacts RhCMV/SIV vaccine-induced protection. Methods Three groups of 15 rhesus macaques naturally pre-exposed to RhCMV were vaccinated with RhCMV/SIV vaccines. Rectal swabs were collected longitudinally both before SIV challenge (after vaccination) and post challenge and were profiled using 16S rRNA based microbiome analysis. Results We identified ∼2,400 16S rRNA amplicon sequence variants (ASVs), representing potential bacterial species/strains. Global gut microbial profiles were strongly associated with each of the three vaccination groups, and all animals tended to maintain consistent profiles throughout the pre-challenge phase. Despite vaccination group differences, using newly developed compositional data analysis techniques we identified a common gut microbial signature predictive of vaccine protection outcome across the three vaccination groups. Part of this microbial signature persisted even after SIV challenge. We also observed a strong correlation between this microbial signature and an early signature derived from whole blood transcriptomes in the same animals. Conclusions Our findings indicate that changes in gut microbiomes are associated with RhCMV/SIV vaccine-induced protection and early host response to vaccination in rhesus macaques.
Collapse
|
33
|
Kumar B, Lorusso E, Fosso B, Pesole G. A comprehensive overview of microbiome data in the light of machine learning applications: categorization, accessibility, and future directions. Front Microbiol 2024; 15:1343572. [PMID: 38419630 PMCID: PMC10900530 DOI: 10.3389/fmicb.2024.1343572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Accepted: 01/29/2024] [Indexed: 03/02/2024] Open
Abstract
Metagenomics, Metabolomics, and Metaproteomics have significantly advanced our knowledge of microbial communities by providing culture-independent insights into their composition and functional potential. However, a critical challenge in this field is the lack of standard and comprehensive metadata associated with raw data, hindering the ability to perform robust data stratifications and consider confounding factors. In this comprehensive review, we categorize publicly available microbiome data into five types: shotgun sequencing, amplicon sequencing, metatranscriptomic, metabolomic, and metaproteomic data. We explore the importance of metadata for data reuse and address the challenges in collecting standardized metadata. We also, assess the limitations in metadata collection of existing public repositories collecting metagenomic data. This review emphasizes the vital role of metadata in interpreting and comparing datasets and highlights the need for standardized metadata protocols to fully leverage metagenomic data's potential. Furthermore, we explore future directions of implementation of Machine Learning (ML) in metadata retrieval, offering promising avenues for a deeper understanding of microbial communities and their ecological roles. Leveraging these tools will enhance our insights into microbial functional capabilities and ecological dynamics in diverse ecosystems. Finally, we emphasize the crucial metadata role in ML models development.
Collapse
Affiliation(s)
- Bablu Kumar
- Università degli Studi di Milano, Milan, Italy
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
| | - Erika Lorusso
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
- National Research Council, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, Bari, Italy
| | - Bruno Fosso
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
| | - Graziano Pesole
- Department of Biosciences, Biotechnology and Environment, University of Bari A. Moro, Bari, Italy
- National Research Council, Institute of Biomembranes, Bioenergetics and Molecular Biotechnologies, Bari, Italy
| |
Collapse
|
34
|
Church SH, Mah JL, Wagner G, Dunn CW. Normalizing need not be the norm: count-based math for analyzing single-cell data. Theory Biosci 2024; 143:45-62. [PMID: 37947999 DOI: 10.1007/s12064-023-00408-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 10/13/2023] [Indexed: 11/12/2023]
Abstract
Counting transcripts of mRNA are a key method of observation in modern biology. With advances in counting transcripts in single cells (single-cell RNA sequencing or scRNA-seq), these data are routinely used to identify cells by their transcriptional profile, and to identify genes with differential cellular expression. Because the total number of transcripts counted per cell can vary for technical reasons, the first step of many commonly used scRNA-seq workflows is to normalize by sequencing depth, transforming counts into proportional abundances. The primary objective of this step is to reshape the data such that cells with similar biological proportions of transcripts end up with similar transformed measurements. But there is growing concern that normalization and other transformations result in unintended distortions that hinder both analyses and the interpretation of results. This has led to an intense focus on optimizing methods for normalization and transformation of scRNA-seq data. Here, we take an alternative approach, by avoiding normalization and transformation altogether. We abandon the use of distances to compare cells, and instead use a restricted algebra, motivated by measurement theory and abstract algebra, that preserves the count nature of the data. We demonstrate that this restricted algebra is sufficient to draw meaningful and practical comparisons of gene expression through the use of the dot product and other elementary operations. This approach sidesteps many of the problems with common transformations, and has the added benefit of being simpler and more intuitive. We implement our approach in the package countland, available in python and R.
Collapse
Affiliation(s)
- Samuel H Church
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA.
| | - Jasmine L Mah
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
| | - Günter Wagner
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
- Yale Systems Biology Institute, Yale University, New Haven, CT, USA
- Department of Obstetrics, Gynecology and Reproductive Sciences, Yale Medical School, New Haven, CT, USA
- Department of Obstetrics and Gynecology, Wayne State University, Detroit, MI, USA
| | - Casey W Dunn
- Department of Ecology and Evolutionary Biology, Yale University, New Haven, CT, USA
| |
Collapse
|
35
|
Lee TW, Hunter FW, Tsai P, Print CG, Wilson WR, Jamieson SMF. Clonal dynamics limits detection of selection in tumour xenograft CRISPR/Cas9 screens. Cancer Gene Ther 2023; 30:1610-1623. [PMID: 37684549 PMCID: PMC10721547 DOI: 10.1038/s41417-023-00664-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2023] [Revised: 08/08/2023] [Accepted: 08/29/2023] [Indexed: 09/10/2023]
Abstract
Transplantable in vivo CRISPR/Cas9 knockout screens, in which cells are edited in vitro and inoculated into mice to form tumours, allow evaluation of gene function in a cancer model that incorporates the multicellular interactions of the tumour microenvironment. To improve our understanding of the key parameters for success with this method, we investigated the choice of cell line, mouse host, tumour harvesting timepoint and guide RNA (gRNA) library size. We found that high gRNA (80-95%) representation was maintained in a HCT116 subline transduced with the GeCKOv2 whole-genome gRNA library and transplanted into NSG mice when tumours were harvested at early (14 d) but not late time points (38-43 d). The decreased representation in older tumours was accompanied by large increases in variance in gRNA read counts, with notable expansion of a small number of random clones in each sample. The variable clonal dynamics resulted in a high level of 'noise' that limited the detection of gRNA-based selection. Using simulated datasets derived from our experimental data, we show that considerable reductions in count variance would be achieved with smaller library sizes. Based on our findings, we suggest a pathway to rationally design adequately powered in vivo CRISPR screens for successful evaluation of gene function.
Collapse
Affiliation(s)
- Tet Woo Lee
- Auckland Cancer Society Research Centre, University of Auckland, Auckland, New Zealand.
- Maurice Wilkins Centre for Molecular Biodiscovery, University of Auckland, Auckland, New Zealand.
| | - Francis W Hunter
- Auckland Cancer Society Research Centre, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, University of Auckland, Auckland, New Zealand
- Oncology Therapeutic Area, Janssen Research and Development, Spring House, PA, USA
| | - Peter Tsai
- Maurice Wilkins Centre for Molecular Biodiscovery, University of Auckland, Auckland, New Zealand
- Department of Molecular Medicine and Pathology, University of Auckland, Auckland, New Zealand
| | - Cristin G Print
- Maurice Wilkins Centre for Molecular Biodiscovery, University of Auckland, Auckland, New Zealand
- Department of Molecular Medicine and Pathology, University of Auckland, Auckland, New Zealand
| | - William R Wilson
- Auckland Cancer Society Research Centre, University of Auckland, Auckland, New Zealand
- Maurice Wilkins Centre for Molecular Biodiscovery, University of Auckland, Auckland, New Zealand
| | - Stephen M F Jamieson
- Auckland Cancer Society Research Centre, University of Auckland, Auckland, New Zealand.
- Maurice Wilkins Centre for Molecular Biodiscovery, University of Auckland, Auckland, New Zealand.
- Department of Pharmacology and Clinical Pharmacology, University of Auckland, Auckland, New Zealand.
| |
Collapse
|
36
|
Liehrmann A, Delannoy E, Launay-Avon A, Gilbault E, Loudet O, Castandet B, Rigaill G. DiffSegR: an RNA-seq data driven method for differential expression analysis using changepoint detection. NAR Genom Bioinform 2023; 5:lqad098. [PMID: 37954572 PMCID: PMC10632193 DOI: 10.1093/nargab/lqad098] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2023] [Revised: 09/27/2023] [Accepted: 10/23/2023] [Indexed: 11/14/2023] Open
Abstract
To fully understand gene regulation, it is necessary to have a thorough understanding of both the transcriptome and the enzymatic and RNA-binding activities that shape it. While many RNA-Seq-based tools have been developed to analyze the transcriptome, most only consider the abundance of sequencing reads along annotated patterns (such as genes). These annotations are typically incomplete, leading to errors in the differential expression analysis. To address this issue, we present DiffSegR - an R package that enables the discovery of transcriptome-wide expression differences between two biological conditions using RNA-Seq data. DiffSegR does not require prior annotation and uses a multiple changepoints detection algorithm to identify the boundaries of differentially expressed regions in the per-base log2 fold change. In a few minutes of computation, DiffSegR could rightfully predict the role of chloroplast ribonuclease Mini-III in rRNA maturation and chloroplast ribonuclease PNPase in (3'/5')-degradation of rRNA, mRNA and tRNA precursors as well as intron accumulation. We believe DiffSegR will benefit biologists working on transcriptomics as it allows access to information from a layer of the transcriptome overlooked by the classical differential expression analysis pipelines widely used today. DiffSegR is available at https://aliehrmann.github.io/DiffSegR/index.html.
Collapse
Affiliation(s)
- Arnaud Liehrmann
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris-Saclay, CNRS, INRAE, Université Evry, Gif sur Yvette, 91190, France
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris Cité, CNRS, INRAE, Gif sur Yvette, 91190, France
- Laboratoire de Mathématiques et de Modélisation d’Evry (LaMME), Université d’Evry-Val-d’Essonne, UMR CNRS 8071, ENSIIE, USC INRAE, Evry,91037, France
| | - Etienne Delannoy
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris-Saclay, CNRS, INRAE, Université Evry, Gif sur Yvette, 91190, France
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris Cité, CNRS, INRAE, Gif sur Yvette, 91190, France
| | - Alexandra Launay-Avon
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris-Saclay, CNRS, INRAE, Université Evry, Gif sur Yvette, 91190, France
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris Cité, CNRS, INRAE, Gif sur Yvette, 91190, France
| | - Elodie Gilbault
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin (IJPB), 78000, Versailles, France
| | - Olivier Loudet
- Université Paris-Saclay, INRAE, AgroParisTech, Institut Jean-Pierre Bourgin (IJPB), 78000, Versailles, France
| | - Benoît Castandet
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris-Saclay, CNRS, INRAE, Université Evry, Gif sur Yvette, 91190, France
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris Cité, CNRS, INRAE, Gif sur Yvette, 91190, France
| | - Guillem Rigaill
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris-Saclay, CNRS, INRAE, Université Evry, Gif sur Yvette, 91190, France
- Institute of Plant Sciences Paris-Saclay (IPS2), Université Paris Cité, CNRS, INRAE, Gif sur Yvette, 91190, France
- Laboratoire de Mathématiques et de Modélisation d’Evry (LaMME), Université d’Evry-Val-d’Essonne, UMR CNRS 8071, ENSIIE, USC INRAE, Evry,91037, France
| |
Collapse
|
37
|
Roth C, Venu V, Job V, Lubbers N, Sanbonmatsu KY, Steadman CR, Starkenburg SR. Improved quality metrics for association and reproducibility in chromatin accessibility data using mutual information. BMC Bioinformatics 2023; 24:441. [PMID: 37990143 PMCID: PMC10664258 DOI: 10.1186/s12859-023-05553-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2023] [Accepted: 10/30/2023] [Indexed: 11/23/2023] Open
Abstract
BACKGROUND Correlation metrics are widely utilized in genomics analysis and often implemented with little regard to assumptions of normality, homoscedasticity, and independence of values. This is especially true when comparing values between replicated sequencing experiments that probe chromatin accessibility, such as assays for transposase-accessible chromatin via sequencing (ATAC-seq). Such data can possess several regions across the human genome with little to no sequencing depth and are thus non-normal with a large portion of zero values. Despite distributed use in the epigenomics field, few studies have evaluated and benchmarked how correlation and association statistics behave across ATAC-seq experiments with known differences or the effects of removing specific outliers from the data. Here, we developed a computational simulation of ATAC-seq data to elucidate the behavior of correlation statistics and to compare their accuracy under set conditions of reproducibility. RESULTS Using these simulations, we monitored the behavior of several correlation statistics, including the Pearson's R and Spearman's [Formula: see text] coefficients as well as Kendall's [Formula: see text] and Top-Down correlation. We also test the behavior of association measures, including the coefficient of determination R[Formula: see text], Kendall's W, and normalized mutual information. Our experiments reveal an insensitivity of most statistics, including Spearman's [Formula: see text], Kendall's [Formula: see text], and Kendall's W, to increasing differences between simulated ATAC-seq replicates. The removal of co-zeros (regions lacking mapped sequenced reads) between simulated experiments greatly improves the estimates of correlation and association. After removing co-zeros, the R[Formula: see text] coefficient and normalized mutual information display the best performance, having a closer one-to-one relationship with the known portion of shared, enhanced loci between simulated replicates. When comparing values between experimental ATAC-seq data using a random forest model, mutual information best predicts ATAC-seq replicate relationships. CONCLUSIONS Collectively, this study demonstrates how measures of correlation and association can behave in epigenomics experiments. We provide improved strategies for quantifying relationships in these increasingly prevalent and important chromatin accessibility assays.
Collapse
Affiliation(s)
- Cullen Roth
- Los Alamos National Laboratory, Genomics and Bioanalytics, Los Alamos, NM, USA.
| | - Vrinda Venu
- Los Alamos National Laboratory, Climate, Ecosystems, and Environmental Science, Los Alamos, NM, USA
| | - Vanessa Job
- Los Alamos National Laboratory, High Performance Computing and Design, Los Alamos, NM, USA
| | - Nicholas Lubbers
- Los Alamos National Laboratory, Information Sciences, Los Alamos, NM, USA
| | - Karissa Y Sanbonmatsu
- Los Alamos National Laboratory, Theoretical Biology and Biophysics, Los Alamos, NM, USA
| | - Christina R Steadman
- Los Alamos National Laboratory, Climate, Ecosystems, and Environmental Science, Los Alamos, NM, USA
| | - Shawn R Starkenburg
- Los Alamos National Laboratory, Genomics and Bioanalytics, Los Alamos, NM, USA
| |
Collapse
|
38
|
Paas-Oliveros E, Hernández-Lemus E, de Anda-Jáuregui G. Computational single cell oncology: state of the art. Front Genet 2023; 14:1256991. [PMID: 38028624 PMCID: PMC10663273 DOI: 10.3389/fgene.2023.1256991] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 10/24/2023] [Indexed: 12/01/2023] Open
Abstract
Single cell computational analysis has emerged as a powerful tool in the field of oncology, enabling researchers to decipher the complex cellular heterogeneity that characterizes cancer. By leveraging computational algorithms and bioinformatics approaches, this methodology provides insights into the underlying genetic, epigenetic and transcriptomic variations among individual cancer cells. In this paper, we present a comprehensive overview of single cell computational analysis in oncology, discussing the key computational techniques employed for data processing, analysis, and interpretation. We explore the challenges associated with single cell data, including data quality control, normalization, dimensionality reduction, clustering, and trajectory inference. Furthermore, we highlight the applications of single cell computational analysis, including the identification of novel cell states, the characterization of tumor subtypes, the discovery of biomarkers, and the prediction of therapy response. Finally, we address the future directions and potential advancements in the field, including the development of machine learning and deep learning approaches for single cell analysis. Overall, this paper aims to provide a roadmap for researchers interested in leveraging computational methods to unlock the full potential of single cell analysis in understanding cancer biology with the goal of advancing precision oncology. For this purpose, we also include a notebook that instructs on how to apply the recommended tools in the Preprocessing and Quality Control section.
Collapse
Affiliation(s)
- Ernesto Paas-Oliveros
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
| | - Enrique Hernández-Lemus
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
- Center for Complexity Sciences, Universidad Nacional Autónoma de México, Mexico City, Mexico
| | - Guillermo de Anda-Jáuregui
- Computational Genomics Division, National Institute of Genomic Medicine, Mexico City, Mexico
- Center for Complexity Sciences, Universidad Nacional Autónoma de México, Mexico City, Mexico
- Investigadores por Mexico, Conahcyt, Mexico City, Mexico
| |
Collapse
|
39
|
Ibrahimi E, Lopes MB, Dhamo X, Simeon A, Shigdel R, Hron K, Stres B, D’Elia D, Berland M, Marcos-Zambrano LJ. Overview of data preprocessing for machine learning applications in human microbiome research. Front Microbiol 2023; 14:1250909. [PMID: 37869650 PMCID: PMC10588656 DOI: 10.3389/fmicb.2023.1250909] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 09/22/2023] [Indexed: 10/24/2023] Open
Abstract
Although metagenomic sequencing is now the preferred technique to study microbiome-host interactions, analyzing and interpreting microbiome sequencing data presents challenges primarily attributed to the statistical specificities of the data (e.g., sparse, over-dispersed, compositional, inter-variable dependency). This mini review explores preprocessing and transformation methods applied in recent human microbiome studies to address microbiome data analysis challenges. Our results indicate a limited adoption of transformation methods targeting the statistical characteristics of microbiome sequencing data. Instead, there is a prevalent usage of relative and normalization-based transformations that do not specifically account for the specific attributes of microbiome data. The information on preprocessing and transformations applied to the data before analysis was incomplete or missing in many publications, leading to reproducibility concerns, comparability issues, and questionable results. We hope this mini review will provide researchers and newcomers to the field of human microbiome research with an up-to-date point of reference for various data transformation tools and assist them in choosing the most suitable transformation method based on their research questions, objectives, and data characteristics.
Collapse
Affiliation(s)
- Eliana Ibrahimi
- Department of Biology, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Marta B. Lopes
- Department of Mathematics, Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| | - Xhilda Dhamo
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Andrea Simeon
- BioSense Institute, University of Novi Sad, Novi Sad, Serbia
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Faculty of Science, Palacký University Olomouc, Olomouc, Czechia
| | - Blaž Stres
- Department of Catalysis and Chemical Reaction Engineering, National Institute of Chemistry, Ljubljana, Slovenia
- Faculty of Civil and Geodetic Engineering, Institute of Sanitary Engineering, Ljubljana, Slovenia
- Department of Automation, Biocybernetics and Robotics, Jožef Stefan Institute, Ljubljana, Slovenia
- Department of Animal Science, Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
| | - Domenica D’Elia
- Department of Biomedical Sciences, National Research Council, Institute for Biomedical Technologies, Bari, Italy
| | - Magali Berland
- INRAE, MetaGenoPolis, Université Paris-Saclay, Jouy-en-Josas, France
| | - Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| |
Collapse
|
40
|
Salvador AC, Huda MN, Arends D, Elsaadi AM, Gacasan CA, Brockmann GA, Valdar W, Bennett BJ, Threadgill DW. Analysis of strain, sex, and diet-dependent modulation of gut microbiota reveals candidate keystone organisms driving microbial diversity in response to American and ketogenic diets. MICROBIOME 2023; 11:220. [PMID: 37784178 PMCID: PMC10546677 DOI: 10.1186/s40168-023-01588-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2022] [Accepted: 06/01/2023] [Indexed: 10/04/2023]
Abstract
BACKGROUND The gut microbiota is modulated by a combination of diet, host genetics, and sex effects. The magnitude of these effects and interactions among them is important to understanding inter-individual variability in gut microbiota. In a previous study, mouse strain-specific responses to American and ketogenic diets were observed along with several QTLs for metabolic traits. In the current study, we searched for genetic variants underlying differences in the gut microbiota in response to American and ketogenic diets, which are high in fat and vary in carbohydrate composition, between C57BL/6 J (B6) and FVB/NJ (FVB) mouse strains. RESULTS Genetic mapping of microbial features revealed 18 loci under the QTL model (i.e., marginal effects that are not specific to diet or sex), 12 loci under the QTL by diet model, and 1 locus under the QTL by sex model. Multiple metabolic and microbial features map to the distal part of Chr 1 and Chr 16 along with eigenvectors extracted from principal coordinate analysis of measures of β-diversity. Bilophila, Ruminiclostridium 9, and Rikenella (Chr 1) were identified as sex- and diet-independent QTL candidate keystone organisms, and Parabacteroides (Chr 16) was identified as a diet-specific, candidate keystone organism in confirmatory factor analyses of traits mapping to these regions. For many microbial features, irrespective of which QTL model was used, diet or the interaction between diet and a genotype were the strongest predictors of the abundance of each microbial trait. Sex, while important to the analyses, was not as strong of a predictor for microbial abundances. CONCLUSIONS These results demonstrate that sex, diet, and genetic background have different magnitudes of effects on inter-individual differences in gut microbiota. Therefore, Precision Nutrition through the integration of genetic variation, microbiota, and sex affecting microbiota variation will be important to predict response to diets varying in carbohydrate composition. Video Abstract.
Collapse
Affiliation(s)
- Anna C Salvador
- Department of Molecular and Cellular Medicine, Texas A&M Health Science Center, College Station, TX, 77843, USA
- Department of Nutrition, Texas A&M University, College Station, TX, 77843, USA
| | - M Nazmul Huda
- Department of Nutrition, University of California Davis, Sacramento, CA, 95616, USA
- Obesity and Metabolism Unit, Western Human Nutrition Research Center, USDA-ARS, Davis, CA, 95616, USA
| | - Danny Arends
- Albrecht Daniel Thaer-Institut, 10115, Berlin, Germany
- Department of Applied Sciences, Northumbria University, Newcastle Upon Tyne, UK
| | - Ahmed M Elsaadi
- Department of Molecular and Cellular Medicine, Texas A&M Health Science Center, College Station, TX, 77843, USA
| | - C Anthony Gacasan
- Department of Molecular and Cellular Medicine, Texas A&M Health Science Center, College Station, TX, 77843, USA
| | | | - William Valdar
- Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
- Lineberger Comprehensive Cancer Center, University of North Carolina at Chapel Hill, Chapel Hill, NC, 27599, USA
| | - Brian J Bennett
- Department of Nutrition, University of California Davis, Sacramento, CA, 95616, USA
- Obesity and Metabolism Unit, Western Human Nutrition Research Center, USDA-ARS, Davis, CA, 95616, USA
| | - David W Threadgill
- Department of Molecular and Cellular Medicine, Texas A&M Health Science Center, College Station, TX, 77843, USA.
- Department of Nutrition, Texas A&M University, College Station, TX, 77843, USA.
- Department of Biochemistry & Biophysics, Texas A&M University, College Station, TX, 77843, USA.
| |
Collapse
|
41
|
Zyla J, Papiez A, Zhao J, Qu R, Li X, Kluger Y, Polanska J, Hatzis C, Pusztai L, Marczyk M. Evaluation of zero counts to better understand the discrepancies between bulk and single-cell RNA-Seq platforms. Comput Struct Biotechnol J 2023; 21:4663-4674. [PMID: 37841335 PMCID: PMC10568495 DOI: 10.1016/j.csbj.2023.09.035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 09/26/2023] [Accepted: 09/27/2023] [Indexed: 10/17/2023] Open
Abstract
Recent advances in sample preparation and sequencing technology have made it possible to profile the transcriptomes of individual cells using single-cell RNA sequencing (scRNA-Seq). Compared to bulk RNA-Seq data, single-cell data often contain a higher percentage of zero reads, mainly due to lower sequencing depth per cell, which affects mostly measurements of low-expression genes. However, discrepancies between platforms are observed regardless of expression level. Using four paired datasets with multiple samples each, we investigated technical and biological factors that can contribute to this expression shift. Using two separate machine learning models we found that, in addition to expression level, RNA integrity, gene or UTR3 length, and the number of transcripts potentially also influence the occurrence of zeros. These findings could enable the development of novel analytical methods for cross-platform expression shift correction. We also identified genes and biological pathways in our diverse datasets that consistently showed differences when assessed at the single cell versus bulk level to assist in interpreting analysis across transcriptomic platforms. At the gene level, 25 genes (0.12%) were found in all datasets as discordant, but at the pathway level, 7 pathways (2.02%) showed shared enrichment in discordant genes.
Collapse
Affiliation(s)
- Joanna Zyla
- Department of Data Science and Engineering, Silesian University of Technology, Gliwice 44-100, Poland
| | - Anna Papiez
- Department of Data Science and Engineering, Silesian University of Technology, Gliwice 44-100, Poland
| | - Jun Zhao
- Computational Biology and Bioinformatics Program, Yale University, New Haven, CT 06510, USA
- Department of Pathology, Yale School of Medicine, Yale University, New Haven, CT 06510, USA
| | - Rihao Qu
- Computational Biology and Bioinformatics Program, Yale University, New Haven, CT 06510, USA
- Department of Pathology, Yale School of Medicine, Yale University, New Haven, CT 06510, USA
| | - Xiaotong Li
- Breast Medical Oncology, Yale Cancer Center, Yale School of Medicine, New Haven, CT 06520, USA
| | - Yuval Kluger
- Computational Biology and Bioinformatics Program, Yale University, New Haven, CT 06510, USA
- Department of Pathology, Yale School of Medicine, Yale University, New Haven, CT 06510, USA
- Applied Mathematics Program, Yale University, New Haven, CT, USA
| | - Joanna Polanska
- Department of Data Science and Engineering, Silesian University of Technology, Gliwice 44-100, Poland
| | - Christos Hatzis
- Breast Medical Oncology, Yale Cancer Center, Yale School of Medicine, New Haven, CT 06520, USA
| | - Lajos Pusztai
- Breast Medical Oncology, Yale Cancer Center, Yale School of Medicine, New Haven, CT 06520, USA
| | - Michal Marczyk
- Department of Data Science and Engineering, Silesian University of Technology, Gliwice 44-100, Poland
- Breast Medical Oncology, Yale Cancer Center, Yale School of Medicine, New Haven, CT 06520, USA
| |
Collapse
|
42
|
Lazzardi S, Valle F, Mazzolini A, Scialdone A, Caselle M, Osella M. Emergent statistical laws in single-cell transcriptomic data. Phys Rev E 2023; 107:044403. [PMID: 37198814 DOI: 10.1103/physreve.107.044403] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/21/2022] [Accepted: 03/24/2023] [Indexed: 05/19/2023]
Abstract
Large-scale data on single-cell gene expression have the potential to unravel the specific transcriptional programs of different cell types. The structure of these expression datasets suggests a similarity with several other complex systems that can be analogously described through the statistics of their basic building blocks. Transcriptomes of single cells are collections of messenger RNA abundances transcribed from a common set of genes just as books are different collections of words from a shared vocabulary, genomes of different species are specific compositions of genes belonging to evolutionary families, and ecological niches can be described by their species abundances. Following this analogy, we identify several emergent statistical laws in single-cell transcriptomic data closely similar to regularities found in linguistics, ecology, or genomics. A simple mathematical framework can be used to analyze the relations between different laws and the possible mechanisms behind their ubiquity. Importantly, treatable statistical models can be useful tools in transcriptomics to disentangle the actual biological variability from general statistical effects present in most component systems and from the consequences of the sampling process inherent to the experimental technique.
Collapse
Affiliation(s)
- Silvia Lazzardi
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Filippo Valle
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Andrea Mazzolini
- Laboratoire de Physique de l'École Normale Supérieure (PSL University), CNRS, Sorbonne Université and Université de Paris, 75005 Paris, France
| | - Antonio Scialdone
- Institute of Epigenetics and Stem Cells, Helmholtz Zentrum München, Feodor-Lynen-Straße 21, 81377 München, Germany and Institute of Functional Epigenetics and Institute of Computational Biology, Helmholtz Zentrum München, Ingolstädter Landstraße 1, 85764 Neuherberg, Germany
| | - Michele Caselle
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| | - Matteo Osella
- Department of Physics, University of Turin and INFN, via P. Giuria 1, 10125 Turin, Italy
| |
Collapse
|
43
|
Aldirawi H, Morales FG. Univariate and Multivariate Statistical Analysis of Microbiome Data: An Overview. Appl Microbiol 2023. [DOI: 10.3390/applmicrobiol3020023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/30/2023]
Abstract
Microbiome data is high dimensional, sparse, compositional, and over-dispersed. Therefore, modeling microbiome data is very challenging and it is an active research area. Microbiome analysis has become a progressing area of research as microorganisms constitute a large part of life. Since many methods of microbiome data analysis have been presented, this review summarizes the challenges, methods used, and the advantages and disadvantages of those methods, to serve as an updated guide for those in the field. This review also compared different methods of analysis to progress the development of newer methods.
Collapse
|
44
|
Hajihosseini M, Amini P, Saidi-Mehrabad A, Dinu I. Infants' gut microbiome data: A Bayesian Marginal Zero-inflated Negative Binomial regression model for multivariate analyses of count data. Comput Struct Biotechnol J 2023; 21:1621-1629. [PMID: 36860341 PMCID: PMC9969297 DOI: 10.1016/j.csbj.2023.02.027] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2022] [Revised: 02/13/2023] [Accepted: 02/14/2023] [Indexed: 02/17/2023] Open
Abstract
The infants' gut microbiome is dynamic in nature. Literature has shown high inter-individual variability of gut microbial composition in the early years of infancy compared to adulthood. Although next-generation sequencing technologies are rapidly evolving, several statistical analysis aspects need to be addressed to capture the variability and dynamic nature of the infants' gut microbiome. In this study, we proposed a Bayesian Marginal Zero-inflated Negative Binomial (BAMZINB) model, addressing complexities associated with zero-inflation and multivariate structure of the infants' gut microbiome data. Here, we simulated 32 scenarios to compare the performance of BAMZINB with glmFit and BhGLM as the two other widely similar methods in the literature in handling zero-inflation, over-dispersion, and multivariate structure of the infants' gut microbiome. Then, we showed the performance of the BAMZINB approach on a real dataset using SKOT cohort (I and II) studies. Our simulation results showed that the BAMZINB model performed as well as those two methods in estimating the average abundance difference and had a better fit for almost all scenarios when the signal and sample size were large. Applying BAMZINB on SKOT cohorts showed remarkable changes in the average absolute abundance of specific bacteria from 9 to 18 months for infants of healthy and obese mothers. In conclusion, we recommend using the BAMZINB approach for infants' gut microbiome data taking zero-inflation and over-dispersion properties into account in multivariate analysis when comparing the average abundance difference.
Collapse
Affiliation(s)
- Morteza Hajihosseini
- Stanford Department of Urology, Center for Academic Medicine, Palo Alto, CA 94304
| | - Payam Amini
- Department of Biostatistics, School of public Health, IRAN University of Medical Sciences, Tehran, Iran
| | | | - Irina Dinu
- School of Public Health, University of Alberta, Edmonton, Alberta, Canada,Correspondence to: School of Public Health, University of Alberta, 3-278 Edmonton Clinic Health Academy, 11405 - 87 Ave NW, Edmonton, Alberta T6G 1C9, Canada.
| |
Collapse
|
45
|
Degnan DJ, Stratton KG, Richardson R, Claborne D, Martin EA, Johnson NA, Leach D, Webb-Robertson BJM, Bramer LM. pmartR 2.0: A Quality Control, Visualization, and Statistics Pipeline for Multiple Omics Datatypes. J Proteome Res 2023; 22:570-576. [PMID: 36622218 DOI: 10.1021/acs.jproteome.2c00610] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023]
Abstract
The pmartR (https://github.com/pmartR/pmartR) package was designed for the quality control (QC) and analysis of mass spectrometry data, tailored to specific characteristics of proteomic (isobaric or labeled), metabolomic, and lipidomic data sets. Since its initial release, the tool has been expanded to address the needs of its growing userbase and now includes QC and statistics for nuclear magnetic resonance metabolomic data, and leverages the DESeq2, edgeR, and limma-voom R packages for transcriptomic data analyses. These improvements have made progress toward a unified omics processing pipeline for ease of reporting and streamlined statistical purposes. The package's statistics and visualization capabilities have also been expanded by adding support for paired data and by integrating pmartR with the trelliscopejs R package for the quick creation of trellis displays (https://github.com/hafen/trelliscopejs). Here, we present relevant examples of each of these enhancements to pmartR and highlight how each new feature benefits the omics community.
Collapse
|
46
|
Salvador AC, Huda MN, Arends D, Elsaadi AM, Gacasan AC, Brockmann GA, Valdar W, Bennett BJ, Threadgill DW. Analysis of strain, sex, and diet-dependent modulation of gut microbiota reveals candidate keystone organisms driving microbial diversity in response to American and ketogenic diets. RESEARCH SQUARE 2023:rs.3.rs-2540322. [PMID: 36778219 PMCID: PMC9915790 DOI: 10.21203/rs.3.rs-2540322/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Background The gut microbiota is modulated by a combination of diet, host genetics, and sex effects. The magnitude of these effects and interactions among them is important to understanding inter-individual variability in gut microbiota. In a previous study, mouse strain-specific responses to American and ketogenic diets were observed along with several QTL for metabolic traits. In the current study, we searched for genetic variants underlying differences in the gut microbiota in response to American and ketogenic diets, which are high in fat and vary in carbohydrate composition, between C57BL/6J (B6) and FVB/NJ (FVB) mouse strains. Results Genetic mapping of microbial features revealed 18 loci under the QTL model (i.e., marginal effects that are not specific to diet or sex), 12 loci under the QTL by diet model, and 1 locus under the QTL by sex model. Multiple metabolic and microbial features map to the distal part of Chr 1 and Chr 16 along with eigenvectors extracted from principal coordinate analysis of measures of β-diversity. Bilophila , Ruminiclostridium 9 , and Rikenella (Chr 1) were identified as sex and diet independent QTL candidate keystone organisms and Rikenelleceae RC9 Gut Group (Chr 16) was identified as a diet-specific, candidate keystone organism in confirmatory factor analyses of traits mapping to these regions. For many microbial features, irrespective of which QTL model was used, diet or the interaction between diet and a genotype were the strongest predictors of the abundance of each microbial trait. Sex, while important to the analyses, was not as strong of a predictor for microbial abundances. Conclusions These results demonstrate that sex, diet, and genetic background have different magnitudes of effects on inter-individual differences in gut microbiota. Therefore, Precision Nutrition through the integration of genetic variation, microbiota, and sex affecting microbiota variation will be important to predict response to diets varying in carbohydrate composition.
Collapse
|
47
|
Shelton AO, Gold ZJ, Jensen AJ, D Agnese E, Andruszkiewicz Allan E, Van Cise A, Gallego R, Ramón-Laca A, Garber-Yonts M, Parsons K, Kelly RP. Toward quantitative metabarcoding. Ecology 2023; 104:e3906. [PMID: 36320096 DOI: 10.1002/ecy.3906] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/19/2022] [Revised: 07/07/2022] [Accepted: 08/23/2022] [Indexed: 12/24/2022]
Abstract
Amplicon-sequence data from environmental DNA (eDNA) and microbiome studies provide important information for ecology, conservation, management, and health. At present, amplicon-sequencing studies-known also as metabarcoding studies, in which the primary data consist of targeted, amplified fragments of DNA sequenced from many taxa in a mixture-struggle to link genetic observations to the underlying biology in a quantitative way, but many applications require quantitative information about the taxa or systems under scrutiny. As metabarcoding studies proliferate in ecology, it becomes more important to develop ways to make them quantitative to ensure that their conclusions are adequately supported. Here we link previously disparate sets of techniques for making such data quantitative, showing that the underlying polymerase chain reaction mechanism explains the observed patterns of amplicon data in a general way. By modeling the process through which amplicon-sequence data arise, rather than transforming the data post hoc, we show how to estimate the starting DNA proportions from a mixture of many taxa. We illustrate how to calibrate the model using mock communities and apply the approach to simulated data and a series of empirical examples. Our approach opens the door to improve the use of metabarcoding data in a wide range of applications in ecology, public health, and related fields.
Collapse
Affiliation(s)
- Andrew Olaf Shelton
- Conservation Biology Division, Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, Seattle, Washington, USA
| | - Zachary J Gold
- Conservation Biology Division, Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, Seattle, Washington, USA.,CICOES, University of Washington and Northwest Fisheries Science Center, National Marine Fisheries Service, Seattle, Washington, USA
| | - Alexander J Jensen
- CICOES, University of Washington and Northwest Fisheries Science Center, National Marine Fisheries Service, Seattle, Washington, USA.,School of Marine and Environmental Affairs, University of Washington, Seattle, Washington, USA
| | - Erin D Agnese
- School of Marine and Environmental Affairs, University of Washington, Seattle, Washington, USA
| | | | - Amy Van Cise
- North Gulf Oceanic Society, Visiting Scientist at Northwest Fisheries Science Center, National Oceanic and Atmospheric Administration, Seattle, Washington, USA
| | - Ramón Gallego
- School of Marine and Environmental Affairs, University of Washington, Seattle, Washington, USA.,Departamento de Biologia, Universidad Autonoma de Madrid, Unidad de Genetica, Madrid, Spain
| | - Ana Ramón-Laca
- CICOES, University of Washington and Northwest Fisheries Science Center, National Marine Fisheries Service, Seattle, Washington, USA.,School of Marine and Environmental Affairs, University of Washington, Seattle, Washington, USA
| | - Maya Garber-Yonts
- School of Marine and Environmental Affairs, University of Washington, Seattle, Washington, USA
| | - Kim Parsons
- Conservation Biology Division, Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, Seattle, Washington, USA
| | - Ryan P Kelly
- School of Marine and Environmental Affairs, University of Washington, Seattle, Washington, USA
| |
Collapse
|
48
|
Yang L, Chen J. Benchmarking differential abundance analysis methods for correlated microbiome sequencing data. Brief Bioinform 2023; 24:bbac607. [PMID: 36617187 PMCID: PMC9851339 DOI: 10.1093/bib/bbac607] [Citation(s) in RCA: 11] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2022] [Revised: 11/16/2022] [Accepted: 12/10/2022] [Indexed: 01/09/2023] Open
Abstract
Differential abundance analysis (DAA) is one central statistical task in microbiome data analysis. A robust and powerful DAA tool can help identify highly confident microbial candidates for further biological validation. Current microbiome studies frequently generate correlated samples from different microbiome sampling schemes such as spatial and temporal sampling. In the past decade, a number of DAA tools for correlated microbiome data (DAA-c) have been proposed. Disturbingly, different DAA-c tools could sometimes produce quite discordant results. To recommend the best practice to the field, we performed the first comprehensive evaluation of existing DAA-c tools using real data-based simulations. Overall, the linear model-based methods LinDA, MaAsLin2 and LDM are more robust than methods based on generalized linear models. The LinDA method is the only method that maintains reasonable performance in the presence of strong compositional effects.
Collapse
Affiliation(s)
- Lu Yang
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55901, USA
| | - Jun Chen
- Division of Computational Biology, Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN 55901, USA
| |
Collapse
|
49
|
Buyukozkan M, Benedetti E, Krumsiek J. rox: A Statistical Model for Regression with Missing Values. Metabolites 2023; 13:metabo13010127. [PMID: 36677052 PMCID: PMC9861384 DOI: 10.3390/metabo13010127] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2022] [Revised: 11/15/2022] [Accepted: 11/17/2022] [Indexed: 01/18/2023] Open
Abstract
High-dimensional omics datasets frequently contain missing data points, which typically occur due to concentrations below the limit of detection (LOD) of the profiling platform. The presence of such missing values significantly limits downstream statistical analysis and result interpretation. Two common techniques to deal with this issue include the removal of samples with missing values and imputation approaches that substitute the missing measurements with reasonable estimates. Both approaches, however, suffer from various shortcomings and pitfalls. In this paper, we present "rox", a novel statistical model for the analysis of omics data with missing values without the need for imputation. The model directly incorporates missing values as "low" concentrations into the calculation. We show the superiority of rox over common approaches on simulated data and on six metabolomics datasets. Fully leveraging the information contained in LOD-based missing values, rox provides a powerful tool for the statistical analysis of omics data.
Collapse
|
50
|
Gan D, Li J. SCIBER: a simple method for removing batch effects from single-cell RNA-sequencing data. Bioinformatics 2023; 39:6957084. [PMID: 36548380 PMCID: PMC9848058 DOI: 10.1093/bioinformatics/btac819] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 11/27/2022] [Accepted: 12/21/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION Integrative analysis of multiple single-cell RNA-sequencing datasets allows for more comprehensive characterizations of cell types, but systematic technical differences between datasets, known as 'batch effects', need to be removed before integration to avoid misleading interpretation of the data. Although many batch-effect-removal methods have been developed, there is still a large room for improvement: most existing methods only give dimension-reduced data instead of expression data of individual genes, are based on computationally demanding models and are black-box models and thus difficult to interpret or tune. RESULTS Here, we present a new batch-effect-removal method called SCIBER (Single-Cell Integrator and Batch Effect Remover) and study its performance on real datasets. SCIBER matches cell clusters across batches according to the overlap of their differentially expressed genes. As a simple algorithm that has better scalability to data with a large number of cells and is easy to tune, SCIBER shows comparable and sometimes better accuracy in removing batch effects on real datasets compared to the state-of-the-art methods, which are much more complicated. Moreover, SCIBER outputs expression data in the original space, that is, the expression of individual genes, which can be used directly for downstream analyses. Additionally, SCIBER is a reference-based method, which assigns one of the batches as the reference batch and keeps it untouched during the process, making it especially suitable for integrating user-generated datasets with standard reference data such as the Human Cell Atlas. AVAILABILITY AND IMPLEMENTATION SCIBER is publicly available as an R package on CRAN: https://cran.r-project.org/web/packages/SCIBER/. A vignette is included in the CRAN R package. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dailin Gan
- Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN 46556, USA
| | - Jun Li
- To whom correspondence should be addressed.
| |
Collapse
|