1
|
Koslovsky MD. Analyzing microbiome data with taxonomic misclassification using a zero-inflated Dirichlet-multinomial model. BMC Bioinformatics 2025; 26:69. [PMID: 40016656 PMCID: PMC11869466 DOI: 10.1186/s12859-025-06078-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2024] [Accepted: 02/10/2025] [Indexed: 03/01/2025] Open
Abstract
The human microbiome is the collection of microorganisms living on and inside of our bodies. A major aim of microbiome research is understanding the role microbial communities play in human health with the goal of designing personalized interventions that modulate the microbiome to treat or prevent disease. Microbiome data are challenging to analyze due to their high-dimensionality, overdispersion, and zero-inflation. Analysis is further complicated by the steps taken to collect and process microbiome samples. For example, sequencing instruments have a fixed capacity for the total number of reads delivered. It is therefore essential to treat microbial samples as compositional. Another complicating factor of modeling microbiome data is that taxa counts are subject to measurement error introduced at various stages of the measurement protocol. Advances in sequencing technology and preprocessing pipelines coupled with our growing knowledge of the human microbiome have reduced, but not eliminated, measurement error. Ignoring measurement error during analysis, though common in practice, can then lead to biased inference and curb reproducibility. We propose a Dirichlet-multinomial modeling framework for microbiome data with excess zeros and potential taxonomic misclassification. We demonstrate how accommodating taxonomic misclassification improves estimation performance and investigate differences in gut microbial composition between healthy and obese children.
Collapse
|
2
|
Lutz KC, Neugent ML, Bedi T, De Nisco NJ, Li Q. A Generalized Bayesian Stochastic Block Model for Microbiome Community Detection. Stat Med 2025; 44:e10291. [PMID: 39853798 PMCID: PMC11760646 DOI: 10.1002/sim.10291] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2023] [Revised: 10/02/2024] [Accepted: 11/11/2024] [Indexed: 01/26/2025]
Abstract
Advances in next-generation sequencing technology have enabled the high-throughput profiling of metagenomes and accelerated microbiome studies. Recently, there has been a rise in quantitative studies that aim to decipher the microbiome co-occurrence network and its underlying community structure based on metagenomic sequence data. Uncovering the complex microbiome community structure is essential to understanding the role of the microbiome in disease progression and susceptibility. Taxonomic abundance data generated from metagenomic sequencing technologies are high-dimensional and compositional, suffering from uneven sampling depth, over-dispersion, and zero-inflation. These characteristics often challenge the reliability of the current methods for microbiome community detection. To study the microbiome co-occurrence network and perform community detection, we propose a generalized Bayesian stochastic block model that is tailored for microbiome data analysis where the data are transformed using the recently developed modified centered-log ratio transformation. Our model also allows us to leverage taxonomic tree information using a Markov random field prior. The model parameters are jointly inferred by using Markov chain Monte Carlo sampling techniques. Our simulation study showed that the proposed approach performs better than competing methods even when taxonomic tree information is non-informative. We applied our approach to a real urinary microbiome dataset from postmenopausal women. To the best of our knowledge, this is the first time the urinary microbiome co-occurrence network structure in postmenopausal women has been studied. In summary, this statistical methodology provides a new tool for facilitating advanced microbiome studies.
Collapse
Affiliation(s)
- Kevin C. Lutz
- Peter O'Donnell Jr. School of Public HealthThe University of Texas Southwestern Medical CenterDallasTexas
| | - Michael L. Neugent
- Department of Biological SciencesThe University of Texas at DallasRichardsonTexas
| | - Tejasv Bedi
- Department of Mathematical SciencesThe University of Texas at DallasRichardsonTexas
| | - Nicole J. De Nisco
- Department of Biological SciencesThe University of Texas at DallasRichardsonTexas
- Department of UrologyThe University of Texas Southwestern Medical CenterDallasTexas
| | - Qiwei Li
- Department of Mathematical SciencesThe University of Texas at DallasRichardsonTexas
| |
Collapse
|
3
|
Guo Y, Yu L, Guo L, Xu L, Li Q. A regularized Bayesian Dirichlet-multinomial regression model for integrating single-cell-level omics and patient-level clinical study data. Biometrics 2025; 81:ujaf005. [PMID: 39887052 PMCID: PMC11783250 DOI: 10.1093/biomtc/ujaf005] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2024] [Revised: 11/02/2024] [Accepted: 01/21/2025] [Indexed: 02/01/2025]
Abstract
The abundance of various cell types can vary significantly among patients with varying phenotypes and even those with the same phenotype. Recent scientific advancements provide mounting evidence that other clinical variables, such as age, gender, and lifestyle habits, can also influence the abundance of certain cell types. However, current methods for integrating single-cell-level omics data with clinical variables are inadequate. In this study, we propose a regularized Bayesian Dirichlet-multinomial regression framework to investigate the relationship between single-cell RNA sequencing data and patient-level clinical data. Additionally, the model employs a novel hierarchical tree structure to identify such relationships at different cell-type levels. Our model successfully uncovers significant associations between specific cell types and clinical variables across three distinct diseases: pulmonary fibrosis, COVID-19, and non-small cell lung cancer. This integrative analysis provides biological insights and could potentially inform clinical interventions for various diseases.
Collapse
Affiliation(s)
- Yanghong Guo
- Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX 75080, United States
| | - Lei Yu
- Quantitative Biomedical Research Center, Peter O’Donnell Jr School of Public Health, The University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Lei Guo
- Quantitative Biomedical Research Center, Peter O’Donnell Jr School of Public Health, The University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Lin Xu
- Quantitative Biomedical Research Center, Peter O’Donnell Jr School of Public Health, The University of Texas Southwestern Medical Center, Dallas, TX 75390, United States
| | - Qiwei Li
- Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX 75080, United States
| |
Collapse
|
4
|
Huang J, Lu Y, Tian F, Ni Y. Association of body index with fecal microbiome in children cohorts with ethnic-geographic factor interaction: accurately using a Bayesian zero-inflated negative binomial regression model. mSystems 2024; 9:e0134524. [PMID: 39570024 DOI: 10.1128/msystems.01345-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2024] [Accepted: 10/24/2024] [Indexed: 11/22/2024] Open
Abstract
The exponential growth of high-throughput sequencing (HTS) data on the microbial communities presents researchers with an unparalleled opportunity to delve deeper into the association of microorganisms with host phenotype. However, this growth also poses a challenge, as microbial data are complex, sparse, discrete, and prone to zero inflation. Herein, by utilizing 10 distinct counting models for analyzing simulated data, we proposed an innovative Bayesian zero-inflated negative binomial (ZINB) regression model that is capable of identifying differentially abundant taxa associated with distinctive host phenotypes and quantifying the effects of covariates on these taxa. Our proposed model exhibits excellent accuracy compared with conventional Hurdle and INLA models, especially in scenarios characterized by inflation and overdispersion. Moreover, we confirm that dispersion parameters significantly affect the accuracy of model results, with defects gradually alleviating as the number of analyzed samples increases. Subsequently applying our model to amplicon data in real multi-ethnic children cohort, we found that only a subset of taxa were identified as having zero inflation in real data, suggesting that the prevailing understanding and processing of microbial count data in most previous microbiome studies were overly dogmatic. In practice, our pipeline of integrating bacterial differential abundance in microbiome data and relevant covariates is effective and feasible. Taken together, our method is expected to be extended to the microbiota studies of various multi-cohort populations. IMPORTANCE The microbiome is closely associated with physical indicators of the body, such as height, weight, age and BMI, which can be used as measures of human health. Accurately identifying which taxa in the microbiome are closely related to indicators of physical development is valuable as microbial markers of regional child growth trajectory. Zero-inflated negative binomial (ZINB) model, a type of Bayesian generalized linear model, can be effectively modeled in complex biological systems. We present an innovative ZINB regression model that is capable of identifying differentially abundant taxa associated with distinctive host phenotypes and quantifying the effects of covariates on these taxa, and demonstrate that its accuracy is superior to traditional Hurdle and INLA models. Our pipeline of integrating bacterial differential abundance in microbiome data and relevant covariates is effective and feasible.
Collapse
Affiliation(s)
- Jian Huang
- School of Food Science and Technology, Shihezi University, Shihezi, Xinjiang, China
- Key Laboratory of Xinjiang Special Probiotics and Dairy Technology, Shihezi University, Shihezi, Xinjiang, China
| | - Yanzhuan Lu
- School of Food Science and Technology, Shihezi University, Shihezi, Xinjiang, China
- Key Laboratory of Xinjiang Special Probiotics and Dairy Technology, Shihezi University, Shihezi, Xinjiang, China
| | - Fengwei Tian
- State Key Laboratory of Food Science and Resources, Jiangnan University, Wuxi, Jiangsu, China
- School of Food Science and Technology, Jiangnan University, Wuxi, Jiangsu, China
| | - Yongqing Ni
- School of Food Science and Technology, Shihezi University, Shihezi, Xinjiang, China
- Key Laboratory of Xinjiang Special Probiotics and Dairy Technology, Shihezi University, Shihezi, Xinjiang, China
| |
Collapse
|
5
|
Sankaran K, Kodikara S, Li JJ, Cao KAL. Semisynthetic simulation for microbiome data analysis. Brief Bioinform 2024; 26:bbaf051. [PMID: 39927858 PMCID: PMC11808806 DOI: 10.1093/bib/bbaf051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2024] [Revised: 12/19/2024] [Accepted: 01/23/2025] [Indexed: 02/11/2025] Open
Abstract
High-throughput sequencing data lie at the heart of modern microbiome research. Effective analysis of these data requires careful preprocessing, modeling, and interpretation to detect subtle signals and avoid spurious associations. In this review, we discuss how simulation can serve as a sandbox to test candidate approaches, creating a setting that mimics real data while providing ground truth. This is particularly valuable for power analysis, methods benchmarking, and reliability analysis. We explain the probability, multivariate analysis, and regression concepts behind modern simulators and how different implementations make trade-offs between generality, faithfulness, and controllability. Recognizing that all simulators only approximate reality, we review methods to evaluate how accurately they reflect key properties. We also present case studies demonstrating the value of simulation in differential abundance testing, dimensionality reduction, network analysis, and data integration. Code for these examples is available in an online tutorial (https://go.wisc.edu/8994yz) that can be easily adapted to new problem settings.
Collapse
Affiliation(s)
- Kris Sankaran
- Department of Statistics, University of Wisconsin-Madison, 1300 University Ave, Madison,WI 53703, United States
| | - Saritha Kodikara
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Building 184/30 Royal Parade, Melbourne, VIC 3052, Australia
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, 520 Portola Plaza, Los Angeles, CA 90095, United States
- Department of Human Genetics, University of California, Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA 90095, United States
- Department of Biostatistics, University of California, Los Angeles, 650 Charles E. Young Dr S, Los Angeles, CA 90095, United States
| | - Kim-Anh Lê Cao
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Building 184/30 Royal Parade, Melbourne, VIC 3052, Australia
| |
Collapse
|
6
|
Deng L, Tang Y, Zhang X, Chen J. Structure-adaptive canonical correlation analysis for microbiome multi-omics data. Front Genet 2024; 15:1489694. [PMID: 39655222 PMCID: PMC11626081 DOI: 10.3389/fgene.2024.1489694] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2024] [Accepted: 10/31/2024] [Indexed: 12/12/2024] Open
Abstract
Sparse canonical correlation analysis (sCCA) has been a useful approach for integrating different high-dimensional datasets by finding a subset of correlated features that explain the most correlation in the data. In the context of microbiome studies, investigators are always interested in knowing how the microbiome interacts with the host at different molecular levels such as genome, methylol, transcriptome, metabolome and proteome. sCCA provides a simple approach for exploiting the correlation structure among multiple omics data and finding a set of correlated omics features, which could contribute to understanding the host-microbiome interaction. However, existing sCCA methods do not address compositionality, and its application to microbiome data is thus not optimal. This paper proposes a new sCCA framework for integrating microbiome data with other high-dimensional omics data, accounting for the compositional nature of microbiome sequencing data. It also allows integrating prior structure information such as the grouping structure among bacterial taxa by imposing a "soft" constraint on the coefficients through varying penalization strength. As a result, the method provides significant improvement when the structure is informative while maintaining robustness against a misspecified structure. Through extensive simulation studies and real data analysis, we demonstrate the superiority of the proposed framework over the state-of-the-art approaches.
Collapse
Affiliation(s)
- Linsui Deng
- School of Data Science, The Chinese University of Hong Kong, Shenzhen, China
| | - Yanlin Tang
- School of Statistics, East China Normal University, Shanghai, China
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University, College Station, TX, United States
| | - Jun Chen
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, United States
| |
Collapse
|
7
|
Guo Y, Yu L, Guo L, Xu L, Li Q. A Regularized Bayesian Dirichlet-multinomial Regression Model for Integrating Single-cell-level Omics and Patient-level Clinical Study Data. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.06.04.597391. [PMID: 38895417 PMCID: PMC11185671 DOI: 10.1101/2024.06.04.597391] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/21/2024]
Abstract
The abundance of various cell types can vary significantly among patients with varying phenotypes and even those with the same phenotype. Recent scientific advancements provide mounting evidence that other clinical variables, such as age, gender, and lifestyle habits, can also influence the abundance of certain cell types. However, current methods for integrating single-cell-level omics data with clinical variables are inadequate. In this study, we propose a regularized Bayesian Dirichlet-multinomial regression framework to investigate the relationship between single-cell RNA sequencing data and patient-level clinical data. Additionally, the model employs a novel hierarchical tree structure to identify such relationships at different cell-type levels. Our model successfully uncovers significant associations between specific cell types and clinical variables across three distinct diseases: pulmonary fibrosis, COVID-19, and non-small cell lung cancer. This integrative analysis provides biological insights and could potentially inform clinical interventions for various diseases.
Collapse
Affiliation(s)
- Yanghong Guo
- Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, Texas, U.S.A
| | - Lei Yu
- Quantitative Biomedical Research Center, Peter O’Donnell Jr. School of Public Health, The University of Texas Southwestern Medical Center, Dallas, Texas, U.S.A
| | - Lei Guo
- Quantitative Biomedical Research Center, Peter O’Donnell Jr. School of Public Health, The University of Texas Southwestern Medical Center, Dallas, Texas, U.S.A
| | - Lin Xu
- Quantitative Biomedical Research Center, Peter O’Donnell Jr. School of Public Health, The University of Texas Southwestern Medical Center, Dallas, Texas, U.S.A
| | - Qiwei Li
- Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, Texas, U.S.A
| |
Collapse
|
8
|
Chi J, Ye J, Zhou Y. A GLM-based zero-inflated generalized Poisson factor model for analyzing microbiome data. Front Microbiol 2024; 15:1394204. [PMID: 38873138 PMCID: PMC11173601 DOI: 10.3389/fmicb.2024.1394204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 05/20/2024] [Indexed: 06/15/2024] Open
Abstract
Motivation High-throughput sequencing technology facilitates the quantitative analysis of microbial communities, improving the capacity to investigate the associations between the human microbiome and diseases. Our primary motivating application is to explore the association between gut microbes and obesity. The complex characteristics of microbiome data, including high dimensionality, zero inflation, and over-dispersion, pose new statistical challenges for downstream analysis. Results We propose a GLM-based zero-inflated generalized Poisson factor analysis (GZIGPFA) model to analyze microbiome data with complex characteristics. The GZIGPFA model is based on a zero-inflated generalized Poisson (ZIGP) distribution for modeling microbiome count data. A link function between the generalized Poisson rate and the probability of excess zeros is established within the generalized linear model (GLM) framework. The latent parameters of the GZIGPFA model constitute a low-rank matrix comprising a low-dimensional score matrix and a loading matrix. An alternating maximum likelihood algorithm is employed to estimate the unknown parameters, and cross-validation is utilized to determine the rank of the model in this study. The proposed GZIGPFA model demonstrates superior performance and advantages through comprehensive simulation studies and real data applications.
Collapse
Affiliation(s)
- Jinling Chi
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Jimin Ye
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Ying Zhou
- School of Mathematical Sciences, Heilongjiang University, Harbin, China
| |
Collapse
|
9
|
Zhang S, Fang H, Hu T. fastCCLasso: a fast and efficient algorithm for estimating correlation matrix from compositional data. Bioinformatics 2024; 40:btae314. [PMID: 38730540 PMCID: PMC11127107 DOI: 10.1093/bioinformatics/btae314] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2024] [Revised: 04/21/2024] [Accepted: 05/09/2024] [Indexed: 05/13/2024] Open
Abstract
MOTIVATION The composition and structure of microbial communities on the body surface are closely related to human health. The interaction relationship among microbes can help us understand the formation of the microecological environment and the biological mechanism by which microorganisms influence host health. With the help of high-throughput sequencing technologies, microbial abundances in a natural environment can be directly measured without the isolation of microorganisms in culture. Sequencing experiments in microbiome studies can measure the relative abundance of microbes, which is called compositional data. Although there are already many methods for correlation analysis for compositional data, the computation time or accuracy still needs to be improved for current microbiome studies. RESULTS We develop a fast and efficient algorithm, called fastCCLasso, based on a penalized weighted least squares for inferring the correlation structure of microbes from compositional data in microbiome studies. We perform a large number of numerical experiments and the simulation results show that fastCCLasso outperforms its competitors in edge detection for inferring the correlation network. We also apply fastCCLasso for estimating microbial networks in microbiome studies and fastCCLasso provides a conservative network with comparable false discovery counts that are derived from shuffled data. AVAILABILITY AND IMPLEMENTATION FastCCLasso is open source and freely available from https://github.com/ShenZhang-Statistics/fastCCLasso under GNU LGPL v3.
Collapse
Affiliation(s)
- Shen Zhang
- School of Mathematical Sciences, Capital Normal University, Beijing 100048, China
| | - Huaying Fang
- Beijing Advanced Innovation Center for Imaging Theory and Technology, Capital Normal University, Beijing 100048, China
- Academy for Multidisciplinary Studies, Capital Normal University, Beijing 100048, China
| | - Tao Hu
- School of Mathematical Sciences, Capital Normal University, Beijing 100048, China
| |
Collapse
|
10
|
Chi J, Ye J, Zhou Y. Mapping QTL controlling count traits with excess zeros and ones using a zero-and-one-inflated generalized Poisson regression model. Biom J 2024; 66:e2200342. [PMID: 38616336 DOI: 10.1002/bimj.202200342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 11/26/2023] [Accepted: 12/08/2023] [Indexed: 04/16/2024]
Abstract
The research on the quantitative trait locus (QTL) mapping of count data has aroused the wide attention of researchers. There are frequent problems in applied research that limit the application of the conventional Poisson model in the analysis of count phenotypes, which include the overdispersion and excess zeros and ones. In this article, a novel model, that is, the zero-and-one-inflated generalized Poisson (ZOIGP) model, is proposed to deal with these problems. Based on the proposed model, a score test is performed for the inflation parameter, in which the ZOIGP model with a constant proportion of excess zeros and ones is compared with a standard generalized Poisson model. To illustrate the practicability of the ZOIGP model, we extend it to the QTL interval mapping application that underpins count phenotype with excess zeros and excess ones. The genetic effects are estimated utilizing the expectation-maximization algorithm embedded with the Newton-Raphson algorithm, and the genome-wide scan and likelihood ratio test is performed to map and test the potential QTLs. The statistical properties exhibited by the proposed method are investigated through simulation. Finally, a real data analysis example is used to illustrate the utility of the proposed method for QTL mapping.
Collapse
Affiliation(s)
- Jinling Chi
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Jimin Ye
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Ying Zhou
- School of Mathematical Sciences, Heilongjiang University, Harbin, China
| |
Collapse
|
11
|
Koslovsky MD. A Bayesian zero-inflated Dirichlet-multinomial regression model for multivariate compositional count data. Biometrics 2023; 79:3239-3251. [PMID: 36896642 DOI: 10.1111/biom.13853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 02/23/2023] [Indexed: 03/11/2023]
Abstract
The Dirichlet-multinomial (DM) distribution plays a fundamental role in modern statistical methodology development and application. Recently, the DM distribution and its variants have been used extensively to model multivariate count data generated by high-throughput sequencing technology in omics research due to its ability to accommodate the compositional structure of the data as well as overdispersion. A major limitation of the DM distribution is that it is unable to handle excess zeros typically found in practice which may bias inference. To fill this gap, we propose a novel Bayesian zero-inflated DM model for multivariate compositional count data with excess zeros. We then extend our approach to regression settings and embed sparsity-inducing priors to perform variable selection for high-dimensional covariate spaces. Throughout, modeling decisions are made to boost scalability without sacrificing interpretability or imposing limiting assumptions. Extensive simulations and an application to a human gut microbiome dataset are presented to compare the performance of the proposed method to existing approaches. We provide an accompanying R package with a user-friendly vignette to apply our method to other datasets.
Collapse
Affiliation(s)
- Matthew D Koslovsky
- Department of Statistics, Colorado State University, Fort Collins, Colorado, USA
| |
Collapse
|
12
|
Ibrahimi E, Lopes MB, Dhamo X, Simeon A, Shigdel R, Hron K, Stres B, D’Elia D, Berland M, Marcos-Zambrano LJ. Overview of data preprocessing for machine learning applications in human microbiome research. Front Microbiol 2023; 14:1250909. [PMID: 37869650 PMCID: PMC10588656 DOI: 10.3389/fmicb.2023.1250909] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 09/22/2023] [Indexed: 10/24/2023] Open
Abstract
Although metagenomic sequencing is now the preferred technique to study microbiome-host interactions, analyzing and interpreting microbiome sequencing data presents challenges primarily attributed to the statistical specificities of the data (e.g., sparse, over-dispersed, compositional, inter-variable dependency). This mini review explores preprocessing and transformation methods applied in recent human microbiome studies to address microbiome data analysis challenges. Our results indicate a limited adoption of transformation methods targeting the statistical characteristics of microbiome sequencing data. Instead, there is a prevalent usage of relative and normalization-based transformations that do not specifically account for the specific attributes of microbiome data. The information on preprocessing and transformations applied to the data before analysis was incomplete or missing in many publications, leading to reproducibility concerns, comparability issues, and questionable results. We hope this mini review will provide researchers and newcomers to the field of human microbiome research with an up-to-date point of reference for various data transformation tools and assist them in choosing the most suitable transformation method based on their research questions, objectives, and data characteristics.
Collapse
Affiliation(s)
- Eliana Ibrahimi
- Department of Biology, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Marta B. Lopes
- Department of Mathematics, Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| | - Xhilda Dhamo
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Andrea Simeon
- BioSense Institute, University of Novi Sad, Novi Sad, Serbia
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Faculty of Science, Palacký University Olomouc, Olomouc, Czechia
| | - Blaž Stres
- Department of Catalysis and Chemical Reaction Engineering, National Institute of Chemistry, Ljubljana, Slovenia
- Faculty of Civil and Geodetic Engineering, Institute of Sanitary Engineering, Ljubljana, Slovenia
- Department of Automation, Biocybernetics and Robotics, Jožef Stefan Institute, Ljubljana, Slovenia
- Department of Animal Science, Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
| | - Domenica D’Elia
- Department of Biomedical Sciences, National Research Council, Institute for Biomedical Technologies, Bari, Italy
| | - Magali Berland
- INRAE, MetaGenoPolis, Université Paris-Saclay, Jouy-en-Josas, France
| | - Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| |
Collapse
|
13
|
Fu J, Koslovsky MD, Neophytou AM, Vannucci M. A Bayesian joint model for compositional mediation effect selection in microbiome data. Stat Med 2023. [PMID: 37173609 DOI: 10.1002/sim.9764] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2022] [Revised: 04/17/2023] [Accepted: 04/26/2023] [Indexed: 05/15/2023]
Abstract
Analyzing multivariate count data generated by high-throughput sequencing technology in microbiome research studies is challenging due to the high-dimensional and compositional structure of the data and overdispersion. In practice, researchers are often interested in investigating how the microbiome may mediate the relation between an assigned treatment and an observed phenotypic response. Existing approaches designed for compositional mediation analysis are unable to simultaneously determine the presence of direct effects, relative indirect effects, and overall indirect effects, while quantifying their uncertainty. We propose a formulation of a Bayesian joint model for compositional data that allows for the identification, estimation, and uncertainty quantification of various causal estimands in high-dimensional mediation analysis. We conduct simulation studies and compare our method's mediation effects selection performance with existing methods. Finally, we apply our method to a benchmark data set investigating the sub-therapeutic antibiotic treatment effect on body weight in early-life mice.
Collapse
Affiliation(s)
- Jingyan Fu
- Department of Statistics, Rice University, Houston, Texas, USA
| | - Matthew D Koslovsky
- Department of Statistics, Colorado State University, Fort Collins, Colorado, USA
| | - Andreas M Neophytou
- Department of Environmental & Radiological Health Sciences, Colorado State University, Fort Collins, Colorado, USA
| | - Marina Vannucci
- Department of Statistics, Rice University, Houston, Texas, USA
| |
Collapse
|
14
|
Wrobel J, Harris C, Vandekar S. Statistical Analysis of Multiplex Immunofluorescence and Immunohistochemistry Imaging Data. Methods Mol Biol 2023; 2629:141-168. [PMID: 36929077 DOI: 10.1007/978-1-0716-2986-4_8] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023]
Abstract
Advances in multiplexed single-cell immunofluorescence (mIF) and multiplex immunohistochemistry (mIHC) imaging technologies have enabled the analysis of cell-to-cell spatial relationships that promise to revolutionize our understanding of tissue-based diseases and autoimmune disorders. Multiplex images are collected as multichannel TIFF files; then denoised, segmented to identify cells and nuclei, normalized across slides with protein markers to correct for batch effects, and phenotyped; and then tissue composition and spatial context at the cellular level are analyzed. This chapter discusses methods and software infrastructure for image processing and statistical analysis of mIF/mIHC data.
Collapse
Affiliation(s)
- Julia Wrobel
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, Aurora, CO, USA.
| | - Coleman Harris
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Simon Vandekar
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
15
|
Jiang R, Zhan X, Wang T. A Flexible Zero-Inflated Poisson-Gamma Model with Application to Microbiome Sequence Count Data. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2151447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Affiliation(s)
- Roulan Jiang
- Center for Statistical Science and Department of Industrial Engineering, Tsinghua University, Beijing 100084, China
| | - Xiang Zhan
- Department of Biostatistics, School of Public Health, Beijing International Center for Mathematical Research and Center for Statistical Science, Peking University, Beijing 100871, China
| | - Tianying Wang
- 3Center for Statistical Science and Department of Industrial Engineering, Tsinghua University, Beijing 100084, China
| |
Collapse
|
16
|
Ye P, Qiao X, Tang W, Wang C, He H. Testing latent class of subjects with structural zeros in negative binomial models with applications to gut microbiome data. Stat Methods Med Res 2022; 31:2237-2254. [PMID: 35899309 DOI: 10.1177/09622802221115881] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Human microbiome research has become a hot-spot in health and medical research in the past decade due to the rapid development of modern high-throughput. Typical data in a microbiome study consisting of the operational taxonomic unit counts may have over-dispersion and/or structural zero issues. In such cases, negative binomial models can be applied to address the over-dispersion issue, while zero-inflated negative binomial models can be applied to address both issues. In practice, it is essential to know if there is zero-inflation in the data before applying negative binomial or zero-inflated negative binomial models because zero-inflated negative binomial models may be unnecessarily complex and difficult to interpret, or may even suffer from convergence issues if there is no zero-inflation in the data. On the other hand, negative binomial models may yield invalid inferences if the data does exhibit excessive zeros. In this paper, we develop a new test for detecting zero-inflation resulting from a latent class of subjects with structural zeros in a negative binomial regression model by directly comparing the amount of observed zeros with what would be expected under the negative binomial regression model. A closed form of the test statistic as well as its asymptotic properties are derived based on estimating equations. Intensive simulation studies are conducted to investigate the performance of the new test and compare it with the classical Wald, likelihood ratio, and score tests. The tests are also applied to human gut microbiome data to test latent class in microbial genera.
Collapse
Affiliation(s)
- Peng Ye
- School of Statistics, 12630University of International Business and Economics, Beijing, China
- Department of Epidemiology, 25812School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, USA
| | - Xinhui Qiao
- School of Statistics, 12630University of International Business and Economics, Beijing, China
| | - Wan Tang
- Department of Biostatistics and Data Science, School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, USA
| | - Chunyi Wang
- Ruijin Hospital, School of Medicine, Shanghai Jiao Tong University, Shanghai, China
| | - Hua He
- Department of Epidemiology, 25812School of Public Health and Tropical Medicine, Tulane University, New Orleans, LA, USA
| |
Collapse
|
17
|
Lutz KC, Jiang S, Neugent ML, De Nisco NJ, Zhan X, Li Q. A Survey of Statistical Methods for Microbiome Data Analysis. FRONTIERS IN APPLIED MATHEMATICS AND STATISTICS 2022; 8:884810. [PMID: 39575140 PMCID: PMC11581570 DOI: 10.3389/fams.2022.884810] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/24/2024]
Abstract
In the last decade, numerous statistical methods have been developed for analyzing microbiome data generated from high-throughput next-generation sequencing technology. Microbiome data are typically characterized by zero inflation, overdispersion, high dimensionality, and sample heterogeneity. Three popular areas of interest in microbiome research requiring statistical methods that can account for the characterizations of microbiome data include detecting differentially abundant taxa across phenotype groups, identifying associations between the microbiome and covariates, and constructing microbiome networks to characterize ecological associations of microbes. These three areas are referred to as differential abundance analysis, integrative analysis, and network analysis, respectively. In this review, we highlight available statistical methods for differential abundance analysis, integrative analysis, and network analysis that have greatly advanced microbiome research. In addition, we discuss each method's motivation, modeling framework, and application.
Collapse
Affiliation(s)
- Kevin C. Lutz
- Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX, United States
| | - Shuang Jiang
- Department of Statistical Science, Southern Methodist University, Dallas, TX, United States
- Department of Population and Data Sciences, The University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Michael L. Neugent
- Department of Biological Sciences, The University of Texas at Dallas, Richardson, TX, United States
| | - Nicole J. De Nisco
- Department of Biological Sciences, The University of Texas at Dallas, Richardson, TX, United States
| | - Xiaowei Zhan
- Department of Population and Data Sciences, The University of Texas Southwestern Medical Center, Dallas, TX, United States
| | - Qiwei Li
- Department of Mathematical Sciences, The University of Texas at Dallas, Richardson, TX, United States
| |
Collapse
|
18
|
An Overview of Modern Applications of Negative Binomial Modelling in Ecology and Biodiversity. DIVERSITY 2022. [DOI: 10.3390/d14050320] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Negative binomial modelling is one of the most commonly used statistical tools for analysing count data in ecology and biodiversity research. This is not surprising given the prevalence of overdispersion (i.e., evidence that the variance is greater than the mean) in many biological and ecological studies. Indeed, overdispersion is often indicative of some form of biological aggregation process (e.g., when species or communities cluster in groups). If overdispersion is ignored, the precision of model parameters can be severely overestimated and can result in misleading statistical inference. In this article, we offer some insight as to why the negative binomial distribution is becoming, and arguably should become, the default starting distribution (as opposed to assuming Poisson counts) for analysing count data in ecology and biodiversity research. We begin with an overview of traditional uses of negative binomial modelling, before examining several modern applications and opportunities in modern ecology/biodiversity where negative binomial modelling is playing a critical role, from generalisations based on exploiting its Poisson-gamma mixture formulation in species distribution models and occurrence data analysis, to estimating animal abundance in negative binomial N-mixture models, and biodiversity measures via rank abundance distributions. Comparisons to other common models for handling overdispersion on real data are provided. We also address the important issue of software, and conclude with a discussion of future directions for analysing ecological and biological data with negative binomial models. In summary, we hope this overview will stimulate the use of negative binomial modelling as a starting point for the analysis of count data in ecology and biodiversity studies.
Collapse
|
19
|
Liu T, Xu P, Du Y, Lu H, Zhao H, Wang T. MZINBVA: variational approximation for multilevel zero-inflated negative-binomial models for association analysis in microbiome surveys. Brief Bioinform 2021; 23:6409694. [PMID: 34718406 DOI: 10.1093/bib/bbab443] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2021] [Revised: 09/11/2021] [Accepted: 09/28/2021] [Indexed: 01/02/2023] Open
Abstract
As our understanding of the microbiome has expanded, so has the recognition of its critical role in human health and disease, thereby emphasizing the importance of testing whether microbes are associated with environmental factors or clinical outcomes. However, many of the fundamental challenges that concern microbiome surveys arise from statistical and experimental design issues, such as the sparse and overdispersed nature of microbiome count data and the complex correlation structure among samples. For example, in the human microbiome project (HMP) dataset, the repeated observations across time points (level 1) are nested within body sites (level 2), which are further nested within subjects (level 3). Therefore, there is a great need for the development of specialized and sophisticated statistical tests. In this paper, we propose multilevel zero-inflated negative-binomial models for association analysis in microbiome surveys. We develop a variational approximation method for maximum likelihood estimation and inference. It uses optimization, rather than sampling, to approximate the log-likelihood and compute parameter estimates, provides a robust estimate of the covariance of parameter estimates and constructs a Wald-type test statistic for association testing. We evaluate and demonstrate the performance of our method using extensive simulation studies and an application to the HMP dataset. We have developed an R package MZINBVA to implement the proposed method, which is available from the GitHub repository https://github.com/liudoubletian/MZINBVA.
Collapse
Affiliation(s)
- Tiantian Liu
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan RD, 200240, Shanghai, China
| | - Peirong Xu
- Department of Breast Surgery, Renji Hospital, School of Medicine, Shanghai Jiao Tong University, 200127, Shanghai, China
| | - Yueyao Du
- Department of Biostatistics, Yale University, 60 College Stree, CT 06520, New Haven, USA.,MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, 800 Dongchuan RD, 200240, Shanghai, China
| | - Hui Lu
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan RD, 200240, Shanghai, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, 60 College Stree, CT 06520, New Haven, USA
| | - Tao Wang
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan RD, 200240, Shanghai, China
| |
Collapse
|
20
|
Challenges and Opportunities in the Statistical Analysis of Multiplex Immunofluorescence Data. Cancers (Basel) 2021; 13:cancers13123031. [PMID: 34204319 PMCID: PMC8233801 DOI: 10.3390/cancers13123031] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2021] [Revised: 06/11/2021] [Accepted: 06/14/2021] [Indexed: 12/21/2022] Open
Abstract
Simple Summary Immune modulation is considered a hallmark of cancer initiation and progression, and has offered promising opportunities for therapeutic manipulation. Multiplex immunofluorescence (mIF) technology has enabled the tumor immune microenvironment (TIME) to be studied at an increased scale, in terms of both the number of markers and the number of samples. Another benefit of mIF technology is the ability to measure not only the abundance but also the spatial location of multiple cells types within a tissue sample simultaneously, allowing for assessment of the co-localization of different types of immune markers. Thus, the use of mIF technologies have enable researchers to characterize patient, clinical, and tumor characteristics in the hope of identifying patients whom might benefit from immunotherapy treatments. In this review we outline some of the challenges and opportunities in the statistical analyses of mIF data to study the TIME. Abstract Immune modulation is considered a hallmark of cancer initiation and progression. The recent development of immunotherapies has ushered in a new era of cancer treatment. These therapeutics have led to revolutionary breakthroughs; however, the efficacy of immunotherapy has been modest and is often restricted to a subset of patients. Hence, identification of which cancer patients will benefit from immunotherapy is essential. Multiplex immunofluorescence (mIF) microscopy allows for the assessment and visualization of the tumor immune microenvironment (TIME). The data output following image and machine learning analyses for cell segmenting and phenotyping consists of the following information for each tumor sample: the number of positive cells for each marker and phenotype(s) of interest, number of total cells, percent of positive cells for each marker, and spatial locations for all measured cells. There are many challenges in the analysis of mIF data, including many tissue samples with zero positive cells or “zero-inflated” data, repeated measurements from multiple TMA cores or tissue slides per subject, and spatial analyses to determine the level of clustering and co-localization between the cell types in the TIME. In this review paper, we will discuss the challenges in the statistical analysis of mIF data and opportunities for further research.
Collapse
|
21
|
Rong R, Jiang S, Xu L, Xiao G, Xie Y, Liu DJ, Li Q, Zhan X. MB-GAN: Microbiome Simulation via Generative Adversarial Network. Gigascience 2021; 10:giab005. [PMID: 33543271 PMCID: PMC7931821 DOI: 10.1093/gigascience/giab005] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Revised: 12/15/2020] [Accepted: 01/14/2021] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Trillions of microbes inhabit the human body and have a profound effect on human health. The recent development of metagenome-wide association studies and other quantitative analysis methods accelerate the discovery of the associations between human microbiome and diseases. To assess the strengths and limitations of these analytical tools, simulating realistic microbiome datasets is critically important. However, simulating the real microbiome data is challenging because it is difficult to model their correlation structure using explicit statistical models. RESULTS To address the challenge of simulating realistic microbiome data, we designed a novel simulation framework termed MB-GAN, by using a generative adversarial network (GAN) and utilizing methodology advancements from the deep learning community. MB-GAN can automatically learn from given microbial abundances and compute simulated abundances that are indistinguishable from them. In practice, MB-GAN showed the following advantages. First, MB-GAN avoids explicit statistical modeling assumptions, and it only requires real datasets as inputs. Second, unlike the traditional GANs, MB-GAN is easily applicable and can converge efficiently. CONCLUSIONS By applying MB-GAN to a case-control gut microbiome study of 396 samples, we demonstrated that the simulated data and the original data had similar first-order and second-order properties, including sparsity, diversities, and taxa-taxa correlations. These advantages are suitable for further microbiome methodology development where high-fidelity microbiome data are needed.
Collapse
Affiliation(s)
- Ruichen Rong
- University of Texas Southwestern Medical Center, Quantitative Biomedical Research Center, Department of Population and Data Sciences, 5323 Harry Hines Blvd, Dallas, TX 75390, USA
| | - Shuang Jiang
- University of Texas Southwestern Medical Center, Quantitative Biomedical Research Center, Department of Population and Data Sciences, 5323 Harry Hines Blvd, Dallas, TX 75390, USA
- Southern Methodist University, Department of Statistical Science, 3225 Daniel Ave, Dallas, TX 75275, USA
| | - Lin Xu
- University of Texas Southwestern Medical Center, Quantitative Biomedical Research Center, Department of Population and Data Sciences, 5323 Harry Hines Blvd, Dallas, TX 75390, USA
| | - Guanghua Xiao
- University of Texas Southwestern Medical Center, Quantitative Biomedical Research Center, Department of Population and Data Sciences, 5323 Harry Hines Blvd, Dallas, TX 75390, USA
| | - Yang Xie
- University of Texas Southwestern Medical Center, Quantitative Biomedical Research Center, Department of Population and Data Sciences, 5323 Harry Hines Blvd, Dallas, TX 75390, USA
| | - Dajiang J Liu
- Pennsylvania State University, Department of Public Health Sciences, 700 HMC Crescent Road, Hershey, PA 17033, USA
| | - Qiwei Li
- University of Texas at Dallas, Department of Mathematical Sciences, FN32 800 West Campbell Road, Richardson, TX 75080, USA
| | - Xiaowei Zhan
- University of Texas Southwestern Medical Center, Quantitative Biomedical Research Center, Department of Population and Data Sciences, 5323 Harry Hines Blvd, Dallas, TX 75390, USA
- University of Texas Southwestern Medical Center, 5323 Harry Hines Blvd, Dallas, TX 75390, USA. Center for the Genetics of Host Defense
| |
Collapse
|