1
|
Koslovsky MD. A Bayesian zero-inflated Dirichlet-multinomial regression model for multivariate compositional count data. Biometrics 2023; 79:3239-3251. [PMID: 36896642 DOI: 10.1111/biom.13853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 02/23/2023] [Indexed: 03/11/2023]
Abstract
The Dirichlet-multinomial (DM) distribution plays a fundamental role in modern statistical methodology development and application. Recently, the DM distribution and its variants have been used extensively to model multivariate count data generated by high-throughput sequencing technology in omics research due to its ability to accommodate the compositional structure of the data as well as overdispersion. A major limitation of the DM distribution is that it is unable to handle excess zeros typically found in practice which may bias inference. To fill this gap, we propose a novel Bayesian zero-inflated DM model for multivariate compositional count data with excess zeros. We then extend our approach to regression settings and embed sparsity-inducing priors to perform variable selection for high-dimensional covariate spaces. Throughout, modeling decisions are made to boost scalability without sacrificing interpretability or imposing limiting assumptions. Extensive simulations and an application to a human gut microbiome dataset are presented to compare the performance of the proposed method to existing approaches. We provide an accompanying R package with a user-friendly vignette to apply our method to other datasets.
Collapse
Affiliation(s)
- Matthew D Koslovsky
- Department of Statistics, Colorado State University, Fort Collins, Colorado, USA
| |
Collapse
|
2
|
Wassan JT, Wang H, Zheng H. Developing a New Phylogeny-Driven Random Forest Model for Functional Metagenomics. IEEE Trans Nanobioscience 2023; 22:763-770. [PMID: 37279136 DOI: 10.1109/tnb.2023.3283462] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Metagenomics is an unobtrusive science linking microbial genes to biological functions or environmental states. Classifying microbial genes into their functional repertoire is an important task in the downstream analysis of Metagenomic studies. The task involves Machine Learning (ML) based supervised methods to achieve good classification performance. Random Forest (RF) has been applied rigorously to microbial gene abundance profiles, mapping them to functional phenotypes. The current research targets tuning RF by the evolutionary ancestry of microbial phylogeny, developing a Phylogeny-RF model for functional classification of metagenomes. This method facilitates capturing the effects of phylogenetic relatedness in an ML classifier itself rather than just applying a supervised classifier over the raw abundances of microbial genes. The idea is rooted in the fact that closely related microbes by phylogeny are highly correlated and tend to have similar genetic and phenotypic traits. Such microbes behave similarly; and hence tend to be selected together, or one of these could be dropped from the analysis, to improve the ML process. The proposed Phylogeny-RF algorithm has been compared with state-of-the-art classification methods including RF and the phylogeny-aware methods of MetaPhyl and PhILR, using three real-world 16S rRNA metagenomic datasets. It has been observed that the proposed method not only achieved significantly better performance than the traditional RF model but also performed better than the other phylogeny-driven benchmarks (p < 0.05). For example, Phylogeny-RF attained a highest AUC of 0.949 and Kappa of 0.891 over soil microbiomes in comparison to other benchmarks.
Collapse
|
3
|
LeBlanc P, Ma L. Microbiome subcommunity learning with logistic-tree normal latent Dirichlet allocation. Biometrics 2023; 79:2321-2332. [PMID: 36222326 PMCID: PMC10090221 DOI: 10.1111/biom.13772] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2021] [Accepted: 09/26/2022] [Indexed: 11/28/2022]
Abstract
Mixed-membership (MM) models such as latent Dirichlet allocation (LDA) have been applied to microbiome compositional data to identify latent subcommunities of microbial species. These subcommunities are informative for understanding the biological interplay of microbes and for predicting health outcomes. However, microbiome compositions typically display substantial cross-sample heterogeneities in subcommunity compositions-that is, the variability in the proportions of microbes in shared subcommunities across samples-which is not accounted for in prior analyses. As a result, LDA can produce inference, which is highly sensitive to the specification of the number of subcommunities and often divides a single subcommunity into multiple artificial ones. To address this limitation, we incorporate the logistic-tree normal (LTN) model into LDA to form a new MM model. This model allows cross-sample variation in the composition of each subcommunity around some "centroid" composition that defines the subcommunity. Incorporation of auxiliary Pólya-Gamma variables enables a computationally efficient collapsed blocked Gibbs sampler to carry out Bayesian inference under this model. By accounting for such heterogeneity, our new model restores the robustness of the inference in the specification of the number of subcommunities and allows meaningful subcommunities to be identified.
Collapse
Affiliation(s)
- Patrick LeBlanc
- Department of Statistical Sciences, Duke University, Durham, North Carolina, USA
| | - Li Ma
- Department of Statistical Sciences, Duke University, Durham, North Carolina, USA
- Department of Biostatistics and Bioinformatics, Duke University Medical School, Durham, North Carolina, USA
| |
Collapse
|
4
|
Clarke TH, Greco C, Brinkac L, Nelson KE, Singh H. MPrESS: An R-Package for Accurately Predicting Power for Comparisons of 16S rRNA Microbiome Taxa Distributions including Simulation by Dirichlet Mixture Modeling. Microorganisms 2023; 11:1166. [PMID: 37317139 DOI: 10.3390/microorganisms11051166] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Revised: 04/24/2023] [Accepted: 04/27/2023] [Indexed: 06/16/2023] Open
Abstract
Deep sequencing has revealed that the 16S rRNA gene composition of the human microbiome can vary between populations. However, when existing data are insufficient to address the desired study questions due to limited sample sizes, Dirichlet mixture modeling (DMM) can simulate 16S rRNA gene predictions from experimental microbiome data. We examined the extent to which simulated 16S rRNA gene microbiome data can accurately reflect the diversity within that identified from experimental data and calculate the power. Even when experimental and simulated datasets differed by less than 10%, simulation by DMM consistently overestimates power, except when using only highly discriminating taxa. Admixtures of DMM with experimental data performed poorly compared to pure simulation and did not show the same correlation with experimental data p-value and power values. While multiple replications of random sampling remain the favored method of determining the power, when the estimated sample size required to achieve a certain power exceeds the sample number, then simulated samples based on DMM can be used. We introduce an R-Package, MPrESS, to assist in power calculation and sample size estimation for a 16S rRNA gene microbiome dataset to detect a difference between populations. MPrESS can be downloaded from GitHub.
Collapse
Affiliation(s)
- Thomas H Clarke
- J. Craig Venter Institute, 9605 Medical Center Drive, Suite #150, Rockville, MD 20850, USA
| | - Chris Greco
- J. Craig Venter Institute, 9605 Medical Center Drive, Suite #150, Rockville, MD 20850, USA
| | - Lauren Brinkac
- J. Craig Venter Institute, 9605 Medical Center Drive, Suite #150, Rockville, MD 20850, USA
- Noblis, Reston, VA 20191, USA
| | - Karen E Nelson
- J. Craig Venter Institute, 9605 Medical Center Drive, Suite #150, Rockville, MD 20850, USA
| | - Harinder Singh
- J. Craig Venter Institute, 9605 Medical Center Drive, Suite #150, Rockville, MD 20850, USA
| |
Collapse
|
5
|
Hong Q, Chen G, Tang ZZ. PhyloMed: a phylogeny-based test of mediation effect in microbiome. Genome Biol 2023; 24:72. [PMID: 37041566 PMCID: PMC10088256 DOI: 10.1186/s13059-023-02902-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Accepted: 03/15/2023] [Indexed: 04/13/2023] Open
Abstract
Microbiome data from sequencing experiments contain the relative abundance of a large number of microbial taxa with their evolutionary relationships represented by a phylogenetic tree. The compositional and high-dimensional nature of the microbiome mediator challenges the validity of standard mediation analyses. We propose a phylogeny-based mediation analysis method called PhyloMed to address this challenge. Unlike existing methods that directly identify individual mediating taxa, PhyloMed discovers mediation signals by analyzing subcompositions defined on the phylogenic tree. PhyloMed produces well-calibrated mediation test p-values and yields substantially higher discovery power than existing methods.
Collapse
Affiliation(s)
- Qilin Hong
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53715, USA
| | - Guanhua Chen
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53715, USA
| | - Zheng-Zheng Tang
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, 53715, USA.
| |
Collapse
|
6
|
Shi Y, Zhang L, Do KA, Jenq R, Peterson CB. Sparse tree-based clustering of microbiome data to characterize microbiome heterogeneity in pancreatic cancer. J R Stat Soc Ser C Appl Stat 2023; 72:20-36. [PMID: 37034187 PMCID: PMC10077950 DOI: 10.1093/jrsssc/qlac002] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/16/2023]
Abstract
There is a keen interest in characterizing variation in the microbiome across cancer patients, given increasing evidence of its important role in determining treatment outcomes. Here our goal is to discover subgroups of patients with similar microbiome profiles. We propose a novel unsupervised clustering approach in the Bayesian framework that innovates over existing model-based clustering approaches, such as the Dirichlet multinomial mixture model, in three key respects: we incorporate feature selection, learn the appropriate number of clusters from the data, and integrate information on the tree structure relating the observed features. We compare the performance of our proposed method to existing methods on simulated data designed to mimic real microbiome data. We then illustrate results obtained for our motivating data set, a clinical study aimed at characterizing the tumor microbiome of pancreatic cancer patients.
Collapse
Affiliation(s)
- Yushu Shi
- Department of Statistics, University of Missouri, Columbia, Columbia, MO, USA
| | - Liangliang Zhang
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, OH, USA
| | - Kim-Anh Do
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Robert Jenq
- Department of Genomic Medicine, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| | - Christine B Peterson
- Department of Biostatistics, The University of Texas MD Anderson Cancer Center, Houston, TX, USA
| |
Collapse
|
7
|
Liu M, Zhang Q, Ma S. A tree-based gene-environment interaction analysis with rare features. Stat Anal Data Min 2022; 15:648-674. [PMID: 38046814 PMCID: PMC10691867 DOI: 10.1002/sam.11578] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Accepted: 02/14/2022] [Indexed: 01/20/2023]
Abstract
Gene-environment (G-E) interaction analysis plays a critical role in understanding and modeling complex diseases. Compared to main-effect-only analysis, it is more seriously challenged by higher dimensionality, weaker signals, and the unique "main effects, interactions" variable selection hierarchy. In joint G-E interaction analysis under which a large number of G factors are analysed in a single model, effort tailored to rare features (e.g., SNPs with low minor allele frequencies) has been limited. Existing investigations on rare features have been mostly focused on marginal analysis, where various data aggregation techniques have been developed, and hypothesis testings have been conducted to identify significant aggregated features. However, such techniques cannot be extended to joint G-E interaction analysis. In this study, building on a very recent tree-based data aggregation technique, which has been developed for main-effect-only analysis, we develop a new G-E interaction analysis approach tailored to rare features. The adopted data aggregation technique allows for more efficient information borrowing from neighboring rare features. Similar to some existing state-of-the-art ones, the proposed approach adopts penalization for variable selection, regularized estimation, and respect of the variable selection hierarchy. Simulation shows that it has more accurate identification of important interactions and main effects than several competing alternatives. In the analysis of NFBC1966 study, the proposed approach leads to findings different from the alternatives and with satisfactory prediction and stability performance.
Collapse
Affiliation(s)
- Mengque Liu
- School of Journalism and New Media, Xi’an Jiaotong Universit0y, Shanxi Xi’an, China
| | - Qingzhao Zhang
- Department of Statistics and Data Science, School of Economics, Wang Yanan Institute for Studies in Economics, and Fujian Key Lab of Statistics, Xiamen University, Fujian Xiamen, China
| | - Shuangge Ma
- Department of Biostatistics, Yale School of Public Health, New Haven, Connecticut, USA
| |
Collapse
|
8
|
Mao J, Ma L. Dirichlet-tree multinomial mixtures for clustering microbiome compositions. Ann Appl Stat 2022; 16:1476-1499. [DOI: 10.1214/21-aoas1552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Affiliation(s)
- Jialiang Mao
- Department of Statistical Science, Duke University
| | - Li Ma
- Department of Statistical Science, Duke University
| |
Collapse
|
9
|
Siddiqui NY, Ma L, Brubaker L, Mao J, Hoffman C, Dahl EM, Wang Z, Karstens L. Updating Urinary Microbiome Analyses to Enhance Biologic Interpretation. Front Cell Infect Microbiol 2022; 12:789439. [PMID: 35899056 PMCID: PMC9309214 DOI: 10.3389/fcimb.2022.789439] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2021] [Accepted: 06/13/2022] [Indexed: 11/13/2022] Open
Abstract
ObjectiveAn approach for assessing the urinary microbiome is 16S rRNA gene sequencing, where analysis methods are rapidly evolving. This re-analysis of an existing dataset aimed to determine whether updated bioinformatic and statistical techniques affect clinical inferences.MethodsA prior study compared the urinary microbiome in 123 women with mixed urinary incontinence (MUI) and 84 controls. We obtained unprocessed sequencing data from multiple variable regions, processed operational taxonomic unit (OTU) tables from the original analysis, and de-identified clinical data. We re-processed sequencing data with DADA2 to generate amplicon sequence variant (ASV) tables. Taxa from ASV tables were compared to the original OTU tables; taxa from different variable regions after updated processing were also compared. Bayesian graphical compositional regression (BGCR) was used to test for associations between microbial compositions and clinical phenotypes (e.g., MUI versus control) while adjusting for clinical covariates. Several techniques were used to cluster samples into microbial communities. Multivariable regression was used to test for associations between microbial communities and MUI, again while adjusting for potentially confounding variables.ResultsOf taxa identified through updated bioinformatic processing, only 40% were identified originally, though taxa identified through both methods represented >99% of the sequencing data in terms of relative abundance. Different 16S rRNA gene regions resulted in different recovered taxa. With BGCR analysis, there was a low (33.7%) probability of an association between overall microbial compositions and clinical phenotype. However, when microbial data are clustered into bacterial communities, we confirmed that bacterial communities are associated with MUI. Contrary to the originally published analysis, we did not identify different associations by age group, which may be due to the incorporation of different covariates in statistical models.ConclusionsUpdated bioinformatic processing techniques recover different taxa compared to earlier techniques, though most of these differences exist in low abundance taxa that occupy a small proportion of the overall microbiome. While overall microbial compositions are not associated with MUI, we confirmed associations between certain communities of bacteria and MUI. Incorporation of several covariates that are associated with the urinary microbiome improved inferences when assessing for associations between bacterial communities and MUI in multivariable models.
Collapse
Affiliation(s)
- Nazema Y. Siddiqui
- Division of Urogynecology & Reconstructive Pelvic Surgery, Division of Reproductive Sciences, Department of Obstetrics & Gynecology, Duke University Medical Center, Durham, NC, United States
- *Correspondence: Nazema Y. Siddiqui,
| | - Li Ma
- Department of Statistical Science, Duke University, Durham, NC, United States
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, NC, United States
| | - Linda Brubaker
- Division of Female Pelvic Medicine and Reconstructive Surgery, Department of Obstetrics, Gynecology and Reproductive Sciences, University of California, San Diego, San Diego, CA, United States
| | - Jialiang Mao
- Department of Statistical Science, Duke University, Durham, NC, United States
| | - Carter Hoffman
- Division of Bioinformatics and Computational Biomedicine, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, United States
| | - Erin M. Dahl
- Division of Bioinformatics and Computational Biomedicine, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, United States
| | - Zhuoqun Wang
- Department of Statistical Science, Duke University, Durham, NC, United States
| | - Lisa Karstens
- Division of Bioinformatics and Computational Biomedicine, Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, United States
- Division of Urogynecology, Department of Obstetrics and Gynecology, Oregon Health & Science University, Portland, OR, United States
| |
Collapse
|
10
|
Raiho AM, Paciorek CJ, Dawson A, Jackson ST, Mladenoff DJ, Williams JW, McLachlan JS. 8000-year doubling of Midwestern forest biomass driven by population- and biome-scale processes. Science 2022. [DOI: 10.1126/science.abk3126] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Changes in woody biomass over centuries to millennia are poorly known, leaving unclear the magnitude of terrestrial carbon fluxes before industrial-era disturbance. Here, we statistically reconstructed changes in woody biomass across the upper Midwestern region of the United States over the past 10,000 years using a Bayesian model calibrated to preindustrial forest biomass estimates and fossil pollen records. After an initial postglacial decline, woody biomass nearly doubled during the past 8000 years, sequestering 1800 teragrams. This steady accumulation of carbon was driven by two separate ecological responses to regionally changing climate: the spread of forested biomes and the population expansion of high-biomass tree species within forests. What took millennia to accumulate took less than two centuries to remove: Industrial-era logging and agriculture have erased this carbon accumulation.
Collapse
Affiliation(s)
- A. M. Raiho
- Department of Biological Sciences, University of Notre Dame, South Bend, IN, USA
- Earth System Science Interdisciplinary Center, University of Maryland, College Park, College Park, MD, USA
| | - C. J. Paciorek
- Department of Statistics, University of California, Berkeley, Berkeley, CA, USA
| | - A. Dawson
- Department of General Education, Mount Royal University, Calgary, Alberta, Canada
- Department of Biological Sciences, University of Calgary, Calgary, Alberta, Canada
| | - S. T. Jackson
- US Geological Survey, Southwest and South Central Climate Adaptation Centers, Tucson, AZ, USA
- Department of Geosciences, University of Arizona, Tucson, AZ, USA
| | - D. J. Mladenoff
- Department of Forest and Wildlife Ecology, University of Wisconsin–Madison, Madison, WI, USA
| | - J. W. Williams
- Department of Geography, University of Wisconsin–Madison, Madison, WI, USA
- Center for Climatic Research, University of Wisconsin–Madison, Madison, WI, USA
| | - J. S. McLachlan
- Department of Biological Sciences, University of Notre Dame, South Bend, IN, USA
| |
Collapse
|
11
|
Liu T, Zhou C, Wang H, Zhao H, Wang T. phyloMDA: an R package for phylogeny-aware microbiome data analysis. BMC Bioinformatics 2022; 23:213. [PMID: 35668363 PMCID: PMC9169257 DOI: 10.1186/s12859-022-04744-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2022] [Accepted: 05/23/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Modern sequencing technologies have generated low-cost microbiome survey datasets, across sample sites, conditions, and treatments, on an unprecedented scale and throughput. These datasets often come with a phylogenetic tree that provides a unique opportunity to examine how shared evolutionary history affects the different patterns in host-associated microbial communities. RESULTS In this paper, we describe an R package, phyloMDA, for phylogeny-aware microbiome data analysis. It includes the Dirichlet-tree multinomial model for multivariate abundance data, tree-guided empirical Bayes estimation of microbial compositions, and tree-based multiscale regression methods with relative abundances as predictors. CONCLUSION phyloMDA is a versatile and user-friendly tool to analyze microbiome data while incorporating the phylogenetic information and addressing some of the challenges posed by the data.
Collapse
Affiliation(s)
- Tiantian Liu
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
| | - Chao Zhou
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
| | - Huimin Wang
- Department of Biostatistics and Bioinformatics, Duke University School of Medicine, Durham, North Carolina, USA
| | - Hongyu Zhao
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China.,Department of Biostatistics, Yale University, New Haven, Connecticut, USA
| | - Tao Wang
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China. .,Department of Statistics, School of Mathematical Sciences, Shanghai Jiao Tong University, Shanghai, China. .,MoE Key Lab of Artificial Intelligence, AI Institute, Shanghai Jiao Tong University, Shanghai, China.
| |
Collapse
|
12
|
Zeng Y, Pang D, Zhao H, Wang T. A Zero-inflated Logistic Normal Multinomial Model for Extracting Microbial Compositions. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2044827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Yanyan Zeng
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University
| | - Daolin Pang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University
| | - Hongyu Zhao
- Department of Biostatistics, Yale University
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University
| | - Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University
- Department of Statistics, Shanghai Jiao Tong University
| |
Collapse
|
13
|
Rudra P, Baxter R, Hsieh EWY, Ghosh D. Compositional Data Analysis using Kernels in mass cytometry data. BIOINFORMATICS ADVANCES 2022; 2:vbac003. [PMID: 35224501 PMCID: PMC8867823 DOI: 10.1093/bioadv/vbac003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/23/2021] [Revised: 12/06/2021] [Accepted: 01/12/2022] [Indexed: 01/27/2023]
Abstract
MOTIVATION Cell-type abundance data arising from mass cytometry experiments are compositional in nature. Classical association tests do not apply to the compositional data due to their non-Euclidean nature. Existing methods for analysis of cell type abundance data suffer from several limitations for high-dimensional mass cytometry data, especially when the sample size is small. RESULTS We proposed a new multivariate statistical learning methodology, Compositional Data Analysis using Kernels (CODAK), based on the kernel distance covariance (KDC) framework to test the association of the cell type compositions with important predictors (categorical or continuous) such as disease status. CODAK scales well for high-dimensional data and provides satisfactory performance for small sample sizes (n < 25). We conducted simulation studies to compare the performance of the method with existing methods of analyzing cell type abundance data from mass cytometry studies. The method is also applied to a high-dimensional dataset containing different subgroups of populations including Systemic Lupus Erythematosus (SLE) patients and healthy control subjects. AVAILABILITY AND IMPLEMENTATION CODAK is implemented using R. The codes and the data used in this manuscript are available on the web at http://github.com/GhoshLab/CODAK/. CONTACT prudra@okstate.edu. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Pratyaydipta Rudra
- Department of Statistics, Oklahoms State University, Stillwater, OK 74078, USA
- To whom correspondence should be addressed.
| | - Ryan Baxter
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Elena W Y Hsieh
- Department of Immunology and Microbiology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
- Department of Pediatrics, Section of Allergy and Immunology, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, Colorado School of Public Health, University of Colorado Anschutz Medical Campus, Aurora, CO 80045, USA
| |
Collapse
|
14
|
Ostner J, Carcy S, Müller CL. tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data. Front Genet 2021; 12:766405. [PMID: 34950190 PMCID: PMC8689185 DOI: 10.3389/fgene.2021.766405] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Accepted: 11/01/2021] [Indexed: 12/11/2022] Open
Abstract
Accurate generative statistical modeling of count data is of critical relevance for the analysis of biological datasets from high-throughput sequencing technologies. Important instances include the modeling of microbiome compositions from amplicon sequencing surveys and the analysis of cell type compositions derived from single-cell RNA sequencing. Microbial and cell type abundance data share remarkably similar statistical features, including their inherent compositionality and a natural hierarchical ordering of the individual components from taxonomic or cell lineage tree information, respectively. To this end, we introduce a Bayesian model for tree-aggregated amplicon and single-cell compositional data analysis (tascCODA) that seamlessly integrates hierarchical information and experimental covariate data into the generative modeling of compositional count data. By combining latent parameters based on the tree structure with spike-and-slab Lasso penalization, tascCODA can determine covariate effects across different levels of the population hierarchy in a data-driven parsimonious way. In the context of differential abundance testing, we validate tascCODA's excellent performance on a comprehensive set of synthetic benchmark scenarios. Our analyses on human single-cell RNA-seq data from ulcerative colitis patients and amplicon data from patients with irritable bowel syndrome, respectively, identified aggregated cell type and taxon compositional changes that were more predictive and parsimonious than those proposed by other schemes. We posit that tascCODA constitutes a valuable addition to the growing statistical toolbox for generative modeling and analysis of compositional changes in microbial or cell population data.
Collapse
Affiliation(s)
- Johannes Ostner
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Salomé Carcy
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Department of Biology, École Normale Supérieure, PSL University, Paris, France
| | - Christian L. Müller
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Center for Computational Mathematics, Flatiron Institute, New York, NY, United States
| |
Collapse
|
15
|
Zhou C, Zhao H, Wang T. Transformation and differential abundance analysis of microbiome data incorporating phylogeny. Bioinformatics 2021; 37:4652-4660. [PMID: 34302462 DOI: 10.1093/bioinformatics/btab543] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 05/31/2021] [Accepted: 07/22/2021] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Microbiome data have proven extremely useful for understanding microbial communities and their impacts in health and disease. Although microbiome analysis methods and standards are evolving rapidly, obtaining meaningful and interpretable results from microbiome studies still requires careful statistical treatment. In particular, many existing and emerging methods for differential abundance analysis fail to account for the fact that microbiome data are high-dimensional and sparse, compositional, negatively and positively correlated, and phylogenetically structured. To better describe microbiome data and improve the power of differential abundance testing, there is still a great need for the continued development of appropriate statistical methodology. RESULTS In this paper, we propose a model-based approach for microbiome data transformation, and a phylogenetically informed procedure for differential abundance (DA) testing based on the transformed data. First, we extend the Dirichlet-tree multinomial (DTM) to zero-inflated DTM (ZIDTM) for multivariate modeling of microbial counts, addressing data sparsity, and correlation and phylogeny among bacterial taxa. Then, within this framework and using a Bayesian formulation, we introduce posterior mean transformation to convert raw counts into nonzero relative abundances that sum to one, accounting for the compositionality nature of microbiome data. Second, using the transformed data, we propose adaptive analysis of composition of microbiomes (adaANCOM) for DA testing by constructing log-ratios adaptively on the tree for each taxon, greatly reducing the computational complexity of ANCOM in high dimensions. Finally, we present extensive simulation studies, an analysis of HMP data across 18 body sites and 2 visits, and an application to a gut microbiome and malnutrition study, to investigate the performance of posterior mean transformation and adaANCOM. Comparisons with ANCOM and other DA testing procedures show that adaANCOM controls the false discovery rate well, allows for easy interpretation of the results, and is computationally efficient for high-dimensional problems. AVAILABILITY The developed R package is available at https://github.com/ZRChao/adaANCOM. For replicability purposes, scripts for our simulations and data analysis are available at https://github.com/ZRChao/Papers_supplementary. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chao Zhou
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, Connecticut, U.S.A.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
| | - Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China.,MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
16
|
Affiliation(s)
| | - Jacob Bien
- Data Sciences and Operations, USC Marshall, Los Angeles, CA
| |
Collapse
|
17
|
Koslovsky MD, Hoffman KL, Daniel CR, Vannucci M. A Bayesian model of microbiome data for simultaneous identification of covariate associations and prediction of phenotypic outcomes. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1354] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
18
|
Koslovsky MD, Vannucci M. MicroBVS: Dirichlet-tree multinomial regression models with Bayesian variable selection - an R package. BMC Bioinformatics 2020; 21:301. [PMID: 32660471 PMCID: PMC7359232 DOI: 10.1186/s12859-020-03640-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Accepted: 07/02/2020] [Indexed: 11/29/2022] Open
Abstract
Background Understanding the relation between the human microbiome and modulating factors, such as diet, may help researchers design intervention strategies that promote and maintain healthy microbial communities. Numerous analytical tools are available to help identify these relations, oftentimes via automated variable selection methods. However, available tools frequently ignore evolutionary relations among microbial taxa, potential relations between modulating factors, as well as model selection uncertainty. Results We present MicroBVS, an R package for Dirichlet-tree multinomial models with Bayesian variable selection, for the identification of covariates associated with microbial taxa abundance data. The underlying Bayesian model accommodates phylogenetic structure in the abundance data and various parameterizations of covariates’ prior probabilities of inclusion. Conclusion While developed to study the human microbiome, our software can be employed in various research applications, where the aim is to generate insights into the relations between a set of covariates and compositional data with or without a known tree-like structure.
Collapse
|
19
|
Liu T, Zhao H, Wang T. An empirical Bayes approach to normalization and differential abundance testing for microbiome data. BMC Bioinformatics 2020; 21:225. [PMID: 32493208 PMCID: PMC7268703 DOI: 10.1186/s12859-020-03552-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 05/18/2020] [Indexed: 12/14/2022] Open
Abstract
Background Advances in DNA sequencing have offered researchers an unprecedented opportunity to better study the variety of species living in and on the human body. However, the analysis of microbiome data is complicated by several challenges. First, the sequencing depth may vary by orders of magnitude across samples. Second, species are rare and the data often contain many zeros. Third, the specimen is a fraction of the microbial ecosystem, and so the data are compositional carrying only relative information. Other characteristics of microbiome data include pronounced over-dispersion in taxon abundances, and the existence of a phylogenetic tree that relates all bacterial species. To address some of these challenges, microbiome analysis workflows often normalize the read counts prior to downstream analysis. However, there are limitations in the current literature on the normalization of microbiome data. Results Under the multinomial distribution for the read counts and a prior for the unknown proportions, we propose an empirical Bayes approach to microbiome data normalization. Using a tree-based extension of the Dirichlet prior, we further extend our method by incorporating the phylogenetic tree into the normalization process. We study the impact of normalization on differential abundance analysis. In the presence of tree structure, we propose a phylogeny-aware detection procedure. Conclusions Extensive simulations and gut microbiome data applications are conducted to demonstrate the superior performance of our empirical Bayes method over other normalization methods, and over commonly-used methods for differential abundance testing. Original R scripts are available at GitHub (https://github.com/liudoubletian/eBay).
Collapse
Affiliation(s)
- Tiantian Liu
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, 300 George Street, New Haven, 06511, USA.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China. .,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China. .,MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China.
| |
Collapse
|
20
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
21
|
Singh SP, Staicu AM, Dunn RR, Fierer N, Reich BJ. A nonparametric spatial test to identify factors that shape a microbiome. Ann Appl Stat 2019. [DOI: 10.1214/19-aoas1262] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
22
|
Song Y, Zhao H, Wang T. An adaptive independence test for microbiome community data. Biometrics 2019; 76:414-426. [DOI: 10.1111/biom.13154] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2018] [Accepted: 09/16/2019] [Indexed: 11/29/2022]
Affiliation(s)
- Yaru Song
- Department of Bioinformatics and BiostatisticsShanghai Jiao Tong University Shanghai China
- SJTU‐Yale Joint Center for Biostatistics and Data ScienceShanghai Jiao Tong University Shanghai China
| | - Hongyu Zhao
- Department of BiostatisticsYale University New Haven Connecticut
- SJTU‐Yale Joint Center for Biostatistics and Data ScienceShanghai Jiao Tong University Shanghai China
| | - Tao Wang
- Department of Bioinformatics and BiostatisticsShanghai Jiao Tong University Shanghai China
- SJTU‐Yale Joint Center for Biostatistics and Data ScienceShanghai Jiao Tong University Shanghai China
- MoE Key Lab of Artificial IntelligenceShanghai Jiao Tong University Shanghai China
| |
Collapse
|
23
|
Tang ZZ, Chen G. Zero-inflated generalized Dirichlet multinomial regression model for microbiome compositional data analysis. Biostatistics 2019; 20:698-713. [PMID: 29939212 PMCID: PMC7410344 DOI: 10.1093/biostatistics/kxy025] [Citation(s) in RCA: 38] [Impact Index Per Article: 7.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2017] [Revised: 04/26/2018] [Accepted: 05/06/2018] [Indexed: 12/19/2022] Open
Abstract
There is heightened interest in using high-throughput sequencing technologies to quantify abundances of microbial taxa and linking the abundance to human diseases and traits. Proper modeling of multivariate taxon counts is essential to the power of detecting this association. Existing models are limited in handling excessive zero observations in taxon counts and in flexibly accommodating complex correlation structures and dispersion patterns among taxa. In this article, we develop a new probability distribution, zero-inflated generalized Dirichlet multinomial (ZIGDM), that overcomes these limitations in modeling multivariate taxon counts. Based on this distribution, we propose a ZIGDM regression model to link microbial abundances to covariates (e.g. disease status) and develop a fast expectation-maximization algorithm to efficiently estimate parameters in the model. The derived tests enable us to reveal rich patterns of variation in microbial compositions including differential mean and dispersion. The advantages of the proposed methods are demonstrated through simulation studies and an analysis of a gut microbiome dataset.
Collapse
Affiliation(s)
- Zheng-Zheng Tang
- Department of Biostatistics and Medical Informatics, University of
Wisconsin-Madison, Madison, WI, USA and Wisconsin Institute for
Discovery, Madison, WI, USA
| | - Guanhua Chen
- Department of Biostatistics and Medical Informatics, University of
Wisconsin-Madison, Madison, WI, USA
| |
Collapse
|
24
|
Affiliation(s)
- Jialiang Mao
- Department of Statistical Science, Duke University, Durham, NC
| | - Yuhan Chen
- Department of Statistical Science, Duke University, Durham, NC
| | - Li Ma
- Department of Statistical Science, Duke University, Durham, NC
| |
Collapse
|
25
|
Wang T, Yang C, Zhao H. Prediction analysis for microbiome sequencing data. Biometrics 2019; 75:875-884. [PMID: 30994187 DOI: 10.1111/biom.13061] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2017] [Revised: 03/08/2019] [Accepted: 03/13/2019] [Indexed: 01/22/2023]
Abstract
One goal of human microbiome studies is to relate host traits with human microbiome compositions. The analysis of microbial community sequencing data presents great statistical challenges, especially when the samples have different library sizes and the data are overdispersed with many zeros. To address these challenges, we introduce a new statistical framework, called predictive analysis in metagenomics via inverse regression (PAMIR), to analyze microbiome sequencing data. Within this framework, an inverse regression model is developed for overdispersed microbiota counts given the trait, and then a prediction rule is constructed by taking advantage of the dimension-reduction structure in the model. An efficient Monte Carlo expectation-maximization algorithm is proposed for maximum likelihood estimation. The method is further generalized to accommodate other types of covariates. We demonstrate the advantages of PAMIR through simulations and two real data examples.
Collapse
Affiliation(s)
- Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
| | - Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Kowloon, Hong Kong
| | - Hongyu Zhao
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
26
|
Tang Y, Ma L, Nicolae DL. A phylogenetic scan test on a Dirichlet-tree multinomial model for microbiome data. Ann Appl Stat 2018. [DOI: 10.1214/17-aoas1086] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
27
|
Xia Y, Sun J, Chen DG. Introductory Overview of Statistical Analysis of Microbiome Data. STATISTICAL ANALYSIS OF MICROBIOME DATA WITH R 2018. [DOI: 10.1007/978-981-13-1534-3_3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|