1
|
Deek RA, Ma S, Lewis J, Li H. Statistical and computational methods for integrating microbiome, host genomics, and metabolomics data. eLife 2024; 13:e88956. [PMID: 38832759 PMCID: PMC11149933 DOI: 10.7554/elife.88956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2023] [Accepted: 05/10/2024] [Indexed: 06/05/2024] Open
Abstract
Large-scale microbiome studies are progressively utilizing multiomics designs, which include the collection of microbiome samples together with host genomics and metabolomics data. Despite the increasing number of data sources, there remains a bottleneck in understanding the relationships between different data modalities due to the limited number of statistical and computational methods for analyzing such data. Furthermore, little is known about the portability of general methods to the metagenomic setting and few specialized techniques have been developed. In this review, we summarize and implement some of the commonly used methods. We apply these methods to real data sets where shotgun metagenomic sequencing and metabolomics data are available for microbiome multiomics data integration analysis. We compare results across methods, highlight strengths and limitations of each, and discuss areas where statistical and computational innovation is needed.
Collapse
Affiliation(s)
- Rebecca A Deek
- Department of Biostatistics, University of PittsburghPittsburghUnited States
| | - Siyuan Ma
- Department of Biostatistics, Vanderbilt School of MedicineNashvilleUnited States
| | - James Lewis
- Division of Gastroenterology and Hepatology, Perelman School of Medicine, University of PennsylvaniaPhiladelphiaUnited States
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of PennsylvaniaPhiladelphiaUnited States
| |
Collapse
|
2
|
Chi J, Ye J, Zhou Y. A GLM-based zero-inflated generalized Poisson factor model for analyzing microbiome data. Front Microbiol 2024; 15:1394204. [PMID: 38873138 PMCID: PMC11173601 DOI: 10.3389/fmicb.2024.1394204] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2024] [Accepted: 05/20/2024] [Indexed: 06/15/2024] Open
Abstract
Motivation High-throughput sequencing technology facilitates the quantitative analysis of microbial communities, improving the capacity to investigate the associations between the human microbiome and diseases. Our primary motivating application is to explore the association between gut microbes and obesity. The complex characteristics of microbiome data, including high dimensionality, zero inflation, and over-dispersion, pose new statistical challenges for downstream analysis. Results We propose a GLM-based zero-inflated generalized Poisson factor analysis (GZIGPFA) model to analyze microbiome data with complex characteristics. The GZIGPFA model is based on a zero-inflated generalized Poisson (ZIGP) distribution for modeling microbiome count data. A link function between the generalized Poisson rate and the probability of excess zeros is established within the generalized linear model (GLM) framework. The latent parameters of the GZIGPFA model constitute a low-rank matrix comprising a low-dimensional score matrix and a loading matrix. An alternating maximum likelihood algorithm is employed to estimate the unknown parameters, and cross-validation is utilized to determine the rank of the model in this study. The proposed GZIGPFA model demonstrates superior performance and advantages through comprehensive simulation studies and real data applications.
Collapse
Affiliation(s)
- Jinling Chi
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Jimin Ye
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Ying Zhou
- School of Mathematical Sciences, Heilongjiang University, Harbin, China
| |
Collapse
|
3
|
Ozminkowski S, Solís‐Lemus C. Identifying microbial drivers in biological phenotypes with a Bayesian network regression model. Ecol Evol 2024; 14:e11039. [PMID: 38774136 PMCID: PMC11106058 DOI: 10.1002/ece3.11039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2023] [Revised: 01/29/2024] [Accepted: 02/03/2024] [Indexed: 05/24/2024] Open
Abstract
In Bayesian Network Regression models, networks are considered the predictors of continuous responses. These models have been successfully used in brain research to identify regions in the brain that are associated with specific human traits, yet their potential to elucidate microbial drivers in biological phenotypes for microbiome research remains unknown. In particular, microbial networks are challenging due to their high dimension and high sparsity compared to brain networks. Furthermore, unlike in brain connectome research, in microbiome research, it is usually expected that the presence of microbes has an effect on the response (main effects), not just the interactions. Here, we develop the first thorough investigation of whether Bayesian Network Regression models are suitable for microbial datasets on a variety of synthetic and real data under diverse biological scenarios. We test whether the Bayesian Network Regression model that accounts only for interaction effects (edges in the network) is able to identify key drivers (microbes) in phenotypic variability. We show that this model is indeed able to identify influential nodes and edges in the microbial networks that drive changes in the phenotype for most biological settings, but we also identify scenarios where this method performs poorly which allows us to provide practical advice for domain scientists aiming to apply these tools to their datasets. BNR models provide a framework for microbiome researchers to identify connections between microbes and measured phenotypes. We allow the use of this statistical model by providing an easy-to-use implementation which is publicly available Julia package at https://github.com/solislemuslab/BayesianNetworkRegression.jl.
Collapse
Affiliation(s)
- Samuel Ozminkowski
- Department of Statistics and Wisconsin Institute for DiscoveryUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
| | - Claudia Solís‐Lemus
- Department of Plant Pathology and Wisconsin Institute for DiscoveryUniversity of Wisconsin‐MadisonMadisonWisconsinUSA
| |
Collapse
|
4
|
Chi J, Ye J, Zhou Y. Mapping QTL controlling count traits with excess zeros and ones using a zero-and-one-inflated generalized Poisson regression model. Biom J 2024; 66:e2200342. [PMID: 38616336 DOI: 10.1002/bimj.202200342] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2022] [Revised: 11/26/2023] [Accepted: 12/08/2023] [Indexed: 04/16/2024]
Abstract
The research on the quantitative trait locus (QTL) mapping of count data has aroused the wide attention of researchers. There are frequent problems in applied research that limit the application of the conventional Poisson model in the analysis of count phenotypes, which include the overdispersion and excess zeros and ones. In this article, a novel model, that is, the zero-and-one-inflated generalized Poisson (ZOIGP) model, is proposed to deal with these problems. Based on the proposed model, a score test is performed for the inflation parameter, in which the ZOIGP model with a constant proportion of excess zeros and ones is compared with a standard generalized Poisson model. To illustrate the practicability of the ZOIGP model, we extend it to the QTL interval mapping application that underpins count phenotype with excess zeros and excess ones. The genetic effects are estimated utilizing the expectation-maximization algorithm embedded with the Newton-Raphson algorithm, and the genome-wide scan and likelihood ratio test is performed to map and test the potential QTLs. The statistical properties exhibited by the proposed method are investigated through simulation. Finally, a real data analysis example is used to illustrate the utility of the proposed method for QTL mapping.
Collapse
Affiliation(s)
- Jinling Chi
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Jimin Ye
- School of Mathematics and Statistics, Xidian University, Xi'an, China
| | - Ying Zhou
- School of Mathematical Sciences, Heilongjiang University, Harbin, China
| |
Collapse
|
5
|
Koslovsky MD. A Bayesian zero-inflated Dirichlet-multinomial regression model for multivariate compositional count data. Biometrics 2023; 79:3239-3251. [PMID: 36896642 DOI: 10.1111/biom.13853] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/25/2022] [Accepted: 02/23/2023] [Indexed: 03/11/2023]
Abstract
The Dirichlet-multinomial (DM) distribution plays a fundamental role in modern statistical methodology development and application. Recently, the DM distribution and its variants have been used extensively to model multivariate count data generated by high-throughput sequencing technology in omics research due to its ability to accommodate the compositional structure of the data as well as overdispersion. A major limitation of the DM distribution is that it is unable to handle excess zeros typically found in practice which may bias inference. To fill this gap, we propose a novel Bayesian zero-inflated DM model for multivariate compositional count data with excess zeros. We then extend our approach to regression settings and embed sparsity-inducing priors to perform variable selection for high-dimensional covariate spaces. Throughout, modeling decisions are made to boost scalability without sacrificing interpretability or imposing limiting assumptions. Extensive simulations and an application to a human gut microbiome dataset are presented to compare the performance of the proposed method to existing approaches. We provide an accompanying R package with a user-friendly vignette to apply our method to other datasets.
Collapse
Affiliation(s)
- Matthew D Koslovsky
- Department of Statistics, Colorado State University, Fort Collins, Colorado, USA
| |
Collapse
|
6
|
Ribaud M, Gabriel E, Hughes J, Soubeyrand S. Identifying potential significant factors impacting zero-inflated proportion data. Stat Med 2023; 42:3467-3486. [PMID: 37290435 DOI: 10.1002/sim.9814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/23/2022] [Revised: 04/03/2023] [Accepted: 05/19/2023] [Indexed: 06/10/2023]
Abstract
Classical supervised methods like linear regression and decision trees are not completely adapted for identifying impacting factors on a response variable corresponding to zero-inflated proportion data (ZIPD) that are dependent, continuous and bounded. In this article we propose a within-block permutation-based methodology to identify factors (discrete or continuous) that are significantly correlated with ZIPD, we propose a performance indicator quantifying the percentage of correlation explained by the subset of significant factors, and we show how to predict the ranks of the response variables conditionally on the observation of these factors. The methodology is illustrated on simulated data and on two real data sets dealing with epidemiology. In the first data set, ZIPD correspond to probabilities of transmission of Influenza between horses. In the second data set, ZIPD correspond to probabilities that geographic entities (eg, states and countries) have the same COVID-19 mortality dynamics.
Collapse
Affiliation(s)
| | | | - Joseph Hughes
- Centre for Virus Research, MRC-University of Glasgow, Glasgow, UK
| | | |
Collapse
|
7
|
Boshuizen HC, Te Beest DE. Pitfalls in the statistical analysis of microbiome amplicon sequencing data. Mol Ecol Resour 2023; 23:539-548. [PMID: 36330663 DOI: 10.1111/1755-0998.13730] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Accepted: 10/27/2022] [Indexed: 11/06/2022]
Abstract
Microbiome data are characterized by several aspects that make them challenging to analyse statistically: they are compositional, high dimensional and rich in zeros. A large array of statistical methods exist to analyse these data. Some are borrowed from other fields, such as ecology or RNA-sequencing, while others are custom-made for microbiome data. The large range of available methods, and which is continuously expanding, means that researchers have to invest considerable effort in choosing what method(s) to apply. In this paper we list 14 statistical methods or approaches that we think should be generally avoided. In several cases this is because we believe the assumptions behind the method are unlikely to be met for microbiome data. In other cases we see methods that are used in ways they are not intended to be used. We believe researchers would be helped by more critical evaluations of existing methods, as not all methods in use are suitable or have been sufficiently reviewed. We hope this paper contributes to a critical discussion on what methods are appropriate to use in the analysis of microbiome data.
Collapse
Affiliation(s)
| | - Dennis E Te Beest
- Biometris, Wageningen University and Research, Wageningen, The Netherlands
| |
Collapse
|
8
|
Aldirawi H, Morales FG. Univariate and Multivariate Statistical Analysis of Microbiome Data: An Overview. Appl Microbiol 2023. [DOI: 10.3390/applmicrobiol3020023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/30/2023]
Abstract
Microbiome data is high dimensional, sparse, compositional, and over-dispersed. Therefore, modeling microbiome data is very challenging and it is an active research area. Microbiome analysis has become a progressing area of research as microorganisms constitute a large part of life. Since many methods of microbiome data analysis have been presented, this review summarizes the challenges, methods used, and the advantages and disadvantages of those methods, to serve as an updated guide for those in the field. This review also compared different methods of analysis to progress the development of newer methods.
Collapse
|
9
|
Zhao X, Zhang J, Lin W. Clustering multivariate count data via Dirichlet-multinomial network fusion. Comput Stat Data Anal 2023. [DOI: 10.1016/j.csda.2022.107634] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022]
|
10
|
Jiang R, Zhan X, Wang T. A Flexible Zero-Inflated Poisson-Gamma Model with Application to Microbiome Sequence Count Data. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2151447] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022]
Affiliation(s)
- Roulan Jiang
- Center for Statistical Science and Department of Industrial Engineering, Tsinghua University, Beijing 100084, China
| | - Xiang Zhan
- Department of Biostatistics, School of Public Health, Beijing International Center for Mathematical Research and Center for Statistical Science, Peking University, Beijing 100871, China
| | - Tianying Wang
- 3Center for Statistical Science and Department of Industrial Engineering, Tsinghua University, Beijing 100084, China
| |
Collapse
|
11
|
Li Z, Yu X, Guo H, Lee T, Hu J. A maximum-type microbial differential abundance test with application to high-dimensional microbiome data analyses. Front Cell Infect Microbiol 2022; 12:988717. [PMID: 36389165 PMCID: PMC9650337 DOI: 10.3389/fcimb.2022.988717] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2022] [Accepted: 10/04/2022] [Indexed: 12/03/2022] Open
Abstract
Background High-throughput metagenomic sequencing technologies have shown prominent advantages over traditional pathogen detection methods, bringing great potential in clinical pathogen diagnosis and treatment of infectious diseases. Nevertheless, how to accurately detect the difference in microbiome profiles between treatment or disease conditions remains computationally challenging. Results In this study, we propose a novel test for identifying the difference between two high-dimensional microbiome abundance data matrices based on the centered log-ratio transformation of the microbiome compositions. The test p-value can be calculated directly with a closed-form solution from the derived asymptotic null distribution. We also investigate the asymptotic statistical power against sparse alternatives that are typically encountered in microbiome studies. The proposed test is maximum-type equal-covariance-assumption-free (MECAF), making it widely applicable to studies that compare microbiome compositions between conditions. Our simulation studies demonstrated that the proposed MECAF test achieves more desirable power than competing methods while having the type I error rate well controlled under various scenarios. The usefulness of the proposed test is further illustrated with two real microbiome data analyses. The source code of the proposed method is freely available at https://github.com/Jiyuan-NYU-Langone/MECAF. Conclusions MECAF is a flexible differential abundance test and achieves statistical efficiency in analyzing high-throughput microbiome data. The proposed new method will allow us to efficiently discover shifts in microbiome abundances between disease and treatment conditions, broadening our understanding of the disease and ultimately improving clinical diagnosis and treatment.
Collapse
Affiliation(s)
- Zhengbang Li
- School of Mathematics and Statistics, Central China Normal University, Wuhan, China
| | - Xiaochen Yu
- School of Mathematics and Statistics, Central China Normal University, Wuhan, China
| | - Hongping Guo
- School of Mathematics and Statistics, Hubei Normal University, Huangshi, China
| | - TingFang Lee
- Division of Biostatistics, Department of Population Health, New York University (NYU) Grossman School of Medicine, New York, NY, United States
| | - Jiyuan Hu
- Division of Biostatistics, Department of Population Health, New York University (NYU) Grossman School of Medicine, New York, NY, United States
- *Correspondence: Jiyuan Hu,
| |
Collapse
|
12
|
Love CJ, Gubert C, Kodikara S, Kong G, Lê Cao KA, Hannan AJ. Microbiota DNA isolation, 16S rRNA amplicon sequencing, and bioinformatic analysis for bacterial microbiome profiling of rodent fecal samples. STAR Protoc 2022; 3:101772. [PMID: 36313541 PMCID: PMC9597187 DOI: 10.1016/j.xpro.2022.101772] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022] Open
Abstract
Fecal samples are frequently used to characterize bacterial populations of the gastrointestinal tract. A protocol is provided to profile gut bacterial populations using rodent fecal samples. We describe the optimal procedures for collecting rodent fecal samples, isolating genomic DNA, 16S rRNA gene V4 region sequencing, and bioinformatic analyses. This protocol includes detailed instructions and example outputs to ensure accurate, reproducible results and data visualization. Comprehensive troubleshooting and limitation sections address technical and statistical issues that may arise when profiling microbiota. For complete details on the use and execution of this protocol, please refer to Gubert et al. (2022).
Collapse
Affiliation(s)
- Chloe J. Love
- The Florey Institute of Neuroscience and Mental Health, University of Melbourne, Parkville, VIC 3010, Australia
| | - Carolina Gubert
- The Florey Institute of Neuroscience and Mental Health, University of Melbourne, Parkville, VIC 3010, Australia,Corresponding author
| | - Saritha Kodikara
- Department of Anatomy and Physiology, University of Melbourne, Parkville, VIC 3010, Australia,Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Parkville VIC, 3010, Australia
| | - Geraldine Kong
- The Florey Institute of Neuroscience and Mental Health, University of Melbourne, Parkville, VIC 3010, Australia
| | - Kim-Anh Lê Cao
- Department of Anatomy and Physiology, University of Melbourne, Parkville, VIC 3010, Australia
| | - Anthony J. Hannan
- The Florey Institute of Neuroscience and Mental Health, University of Melbourne, Parkville, VIC 3010, Australia,Department of Anatomy and Physiology, University of Melbourne, Parkville, VIC 3010, Australia,Corresponding author
| |
Collapse
|
13
|
Identification of microbial features in multivariate regression under false discovery rate control. Comput Stat Data Anal 2022. [DOI: 10.1016/j.csda.2022.107621] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
14
|
Jensen AJ, Kelly RP, Anderson EC, Satterthwaite WH, Shelton AO, Ward EJ. Introducing zoid: A mixture model and R package for modeling proportional data with zeros and ones in ecology. Ecology 2022; 103:e3804. [PMID: 35804486 DOI: 10.1002/ecy.3804] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Revised: 04/26/2022] [Accepted: 04/29/2022] [Indexed: 11/08/2022]
Abstract
Many ecological datasets are proportional, representing mixtures of constituent elements such as species, populations, or strains. Analyses of proportional data are challenged by categories with zero observations (zeros), all the observations (ones), and overdispersion. In lieu of ad-hoc data adjustments, we describe and evaluate a zero-and-one inflated Dirichlet regression model, with its corresponding R package (zoid), capable of handling observed data (x consisting of three possible categories: zeros, proportions, or ones. Instead of fitting the model to observations of single biological units (e.g., individual organisms) within a sample, we sum proportional contributions across units and estimate mixture proportions using one aggregated observation per sample. Optional estimation of overdispersion and covariate influences expand model applications. We evaluate model performance, as implemented in Stan, using simulations and two ecological case studies. We show that zoid successfully estimates mixture proportions using simulated data with varying sample sizes and is robust to overdispersion and covariate structure. In the empirical case studies, we estimate composition of a mixed-stock Chinook salmon (Oncorhynchus tshawytscha) fishery and analyze stomach contents of Atlantic cod (Gadus morhua). Our implementation of the model as an R package facilitates its application to varied ecological datasets composed of proportional observations.
Collapse
Affiliation(s)
- Alexander J Jensen
- University of Washington, School of Marine and Environmental Affairs, Seattle, WA, USA
| | - Ryan P Kelly
- University of Washington, School of Marine and Environmental Affairs, Seattle, WA, USA
| | - Eric C Anderson
- Fisheries Ecology Division, Southwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic & Atmospheric Administration, Santa Cruz, CA, USA
| | - William H Satterthwaite
- Fisheries Ecology Division, Southwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic & Atmospheric Administration, Santa Cruz, CA, USA
| | - Andrew Olaf Shelton
- Conservation Biology Division, Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, Seattle, WA, USA
| | - Eric J Ward
- Conservation Biology Division, Northwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, Seattle, WA, USA
| |
Collapse
|
15
|
Verster A, Petronella N, Green J, Matias F, Brooks SPJ. A Bayesian method for identifying associations between response variables and bacterial community composition. PLoS Comput Biol 2022; 18:e1010108. [PMID: 35793382 PMCID: PMC9307184 DOI: 10.1371/journal.pcbi.1010108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2021] [Revised: 07/22/2022] [Accepted: 04/14/2022] [Indexed: 11/18/2022] Open
Abstract
Determining associations between intestinal bacteria and continuously measured physiological outcomes is important for understanding the bacteria-host relationship but is not straightforward since abundance data (compositional data) are not normally distributed. To address this issue, we developed a fully Bayesian linear regression model (BRACoD; Bayesian Regression Analysis of Compositional Data) with physiological measurements (continuous data) as a function of a matrix of relative bacterial abundances. Bacteria can be classified as operational taxonomic units or by taxonomy (genus, family, etc.). Bacteria associated with the physiological measurement were identified using a Bayesian variable selection method: Stochastic Search Variable Selection. The output is a list of inclusion probabilities ([Formula: see text]) and coefficients that indicate the strength of the association ([Formula: see text]) for each bacterial taxa. Tests with simulated communities showed that adopting a cut point value of [Formula: see text] ≥ 0.3 for identifying included bacteria optimized the true positive rate (TPR) while maintaining a false positive rate (FPR) of ≤ 5%. At this point, the chances of identifying non-contributing bacteria were low and all well-established contributors were included. Comparison with other methods showed that BRACoD (at [Formula: see text] ≥ 0.3) had higher precision and a higher TPR than a commonly used center log transformed LASSO procedure (clr-LASSO) as well as higher TPR than an off-the-shelf Spike and Slab method after center log transformation (clr-SS). BRACoD was also less likely to include non-contributing bacteria that merely correlate with contributing bacteria. Analysis of a rat microbiome experiment identified 47 operational taxonomic units that contributed to fecal butyrate levels. Of these, 31 were positively and 16 negatively associated with butyrate. Consistent with their known role in butyrate metabolism, most of these fell within the Lachnospiraceae and Ruminococcaceae. We conclude that BRACoD provides a more precise and accurate method for determining bacteria associated with a continuous physiological outcome compared to clr-LASSO. It is more sensitive than a generalized clr-SS algorithm, although it has a higher FPR. Its ability to distinguish genuine contributors from correlated bacteria makes it better suited to discriminating bacteria that directly contribute to an outcome. The algorithm corrects for the distortions arising from compositional data making it appropriate for analysis of microbiome data.
Collapse
Affiliation(s)
- Adrian Verster
- Bureau of Food Surveillance and Science Integration, Food Directorate, Health Products and Food Branch, Health Canada, Ottawa, Canada
| | - Nicholas Petronella
- Bureau of Food Surveillance and Science Integration, Food Directorate, Health Products and Food Branch, Health Canada, Ottawa, Canada
| | - Judy Green
- Bureau of Nutritional Sciences, Food Directorate, Health Products and Food Branch, Health Canada, Ottawa, Canada
| | - Fernando Matias
- Bureau of Nutritional Sciences, Food Directorate, Health Products and Food Branch, Health Canada, Ottawa, Canada
| | - Stephen P. J. Brooks
- Bureau of Nutritional Sciences, Food Directorate, Health Products and Food Branch, Health Canada, Ottawa, Canada
| |
Collapse
|
16
|
Wu Q, O’Malley J, Datta S, Gharaibeh RZ, Jobin C, Karagas MR, Coker MO, Hoen AG, Christensen BC, Madan JC, Li Z. MarZIC: A Marginal Mediation Model for Zero-Inflated Compositional Mediators with Applications to Microbiome Data. Genes (Basel) 2022; 13:1049. [PMID: 35741811 PMCID: PMC9223163 DOI: 10.3390/genes13061049] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2022] [Revised: 06/06/2022] [Accepted: 06/07/2022] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND The human microbiome can contribute to pathogeneses of many complex diseases by mediating disease-leading causal pathways. However, standard mediation analysis methods are not adequate to analyze the microbiome as a mediator due to the excessive number of zero-valued sequencing reads in the data and that the relative abundances have to sum to one. The two main challenges raised by the zero-inflated data structure are: (a) disentangling the mediation effect induced by the point mass at zero; and (b) identifying the observed zero-valued data points that are not zero (i.e., false zeros). METHODS We develop a novel marginal mediation analysis method under the potential-outcomes framework to address the issues. We also show that the marginal model can account for the compositional structure of microbiome data. RESULTS The mediation effect can be decomposed into two components that are inherent to the two-part nature of zero-inflated distributions. With probabilistic models to account for observing zeros, we also address the challenge with false zeros. A comprehensive simulation study and the application in a real microbiome study showcase our approach in comparison with existing approaches. CONCLUSIONS When analyzing the zero-inflated microbiome composition as the mediators, MarZIC approach has better performance than standard causal mediation analysis approaches and existing competing approach.
Collapse
Affiliation(s)
- Quran Wu
- Department of Biostatistics, University of Florida, Gainesville, FL 32611, USA; (Q.W.); (S.D.)
| | - James O’Malley
- The Dartmouth Institute, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA;
| | - Susmita Datta
- Department of Biostatistics, University of Florida, Gainesville, FL 32611, USA; (Q.W.); (S.D.)
| | - Raad Z. Gharaibeh
- Department of Medicine, University of Florida, Gainesville, FL 32611, USA; (R.Z.G.); (C.J.)
| | - Christian Jobin
- Department of Medicine, University of Florida, Gainesville, FL 32611, USA; (R.Z.G.); (C.J.)
| | - Margaret R. Karagas
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; (M.R.K.); (M.O.C.); (A.G.H.); (B.C.C.); (J.C.M.)
| | - Modupe O. Coker
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; (M.R.K.); (M.O.C.); (A.G.H.); (B.C.C.); (J.C.M.)
| | - Anne G. Hoen
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; (M.R.K.); (M.O.C.); (A.G.H.); (B.C.C.); (J.C.M.)
| | - Brock C. Christensen
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; (M.R.K.); (M.O.C.); (A.G.H.); (B.C.C.); (J.C.M.)
| | - Juliette C. Madan
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH 03755, USA; (M.R.K.); (M.O.C.); (A.G.H.); (B.C.C.); (J.C.M.)
| | - Zhigang Li
- Department of Biostatistics, University of Florida, Gainesville, FL 32611, USA; (Q.W.); (S.D.)
| |
Collapse
|
17
|
Zeng Y, Pang D, Zhao H, Wang T. A Zero-inflated Logistic Normal Multinomial Model for Extracting Microbial Compositions. J Am Stat Assoc 2022. [DOI: 10.1080/01621459.2022.2044827] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Yanyan Zeng
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University
| | - Daolin Pang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University
| | - Hongyu Zhao
- Department of Biostatistics, Yale University
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University
| | - Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University
- Department of Statistics, Shanghai Jiao Tong University
| |
Collapse
|
18
|
Alenazi A. A review of compositional data analysis and recent advances. COMMUN STAT-THEOR M 2021. [DOI: 10.1080/03610926.2021.2014890] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Affiliation(s)
- Abdulaziz Alenazi
- Department of Mathematics, College of Science, Northern Border University, Arar, Saudi Arabia
| |
Collapse
|
19
|
Tang M, Wu Q, Yang S, Tian G. Dirichlet composition distribution for compositional data with zero components: An application to fluorescence in situ hybridization (FISH) detection of chromosome. Biom J 2021; 64:714-732. [PMID: 34914842 PMCID: PMC9300144 DOI: 10.1002/bimj.202000334] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2020] [Revised: 08/24/2021] [Accepted: 08/31/2021] [Indexed: 11/26/2022]
Abstract
Zeros in compositional data are very common and can be classified into rounded and essential zeros. The rounded zero refers to a small proportion or below detection limit value, while the essential zero refers to the complete absence of the component in the composition. In this article, we propose a new framework for analyzing compositional data with zero entries by introducing a stochastic representation. In particular, a new distribution, namely the Dirichlet composition distribution, is developed to accommodate the possible essential‐zero feature in compositional data. We derive its distributional properties (e.g., its moments). The calculation of maximum likelihood estimates via the Expectation‐Maximization (EM) algorithm will be proposed. The regression model based on the new Dirichlet composition distribution will be considered. Simulation studies are conducted to evaluate the performance of the proposed methodologies. Finally, our method is employed to analyze a dataset of fluorescence in situ hybridization (FISH) for chromosome detection.
Collapse
Affiliation(s)
- Man‐Lai Tang
- Department of MathematicsCollege of Engineering, Design & Physical SciencesBrunel University LondonUxbridgeUnited Kingdom
| | - Qin Wu
- Department of StatisticsSchool of Mathematical SciencesSouth China Normal University, Guangzhou CityGuangdongP. R. China
| | - Sheng Yang
- Zhongshan People's HospitalZhongshanP. R. China
| | - Guo‐Liang Tian
- Department of Statistics and Data ScienceSouthern University of Science and TechnologyShenzhen CityGuangdongP. R. China
| |
Collapse
|
20
|
Ostner J, Carcy S, Müller CL. tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data. Front Genet 2021; 12:766405. [PMID: 34950190 PMCID: PMC8689185 DOI: 10.3389/fgene.2021.766405] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Accepted: 11/01/2021] [Indexed: 12/11/2022] Open
Abstract
Accurate generative statistical modeling of count data is of critical relevance for the analysis of biological datasets from high-throughput sequencing technologies. Important instances include the modeling of microbiome compositions from amplicon sequencing surveys and the analysis of cell type compositions derived from single-cell RNA sequencing. Microbial and cell type abundance data share remarkably similar statistical features, including their inherent compositionality and a natural hierarchical ordering of the individual components from taxonomic or cell lineage tree information, respectively. To this end, we introduce a Bayesian model for tree-aggregated amplicon and single-cell compositional data analysis (tascCODA) that seamlessly integrates hierarchical information and experimental covariate data into the generative modeling of compositional count data. By combining latent parameters based on the tree structure with spike-and-slab Lasso penalization, tascCODA can determine covariate effects across different levels of the population hierarchy in a data-driven parsimonious way. In the context of differential abundance testing, we validate tascCODA's excellent performance on a comprehensive set of synthetic benchmark scenarios. Our analyses on human single-cell RNA-seq data from ulcerative colitis patients and amplicon data from patients with irritable bowel syndrome, respectively, identified aggregated cell type and taxon compositional changes that were more predictive and parsimonious than those proposed by other schemes. We posit that tascCODA constitutes a valuable addition to the growing statistical toolbox for generative modeling and analysis of compositional changes in microbial or cell population data.
Collapse
Affiliation(s)
- Johannes Ostner
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Salomé Carcy
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Department of Biology, École Normale Supérieure, PSL University, Paris, France
| | - Christian L. Müller
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Center for Computational Mathematics, Flatiron Institute, New York, NY, United States
| |
Collapse
|
21
|
Chen B, Xu W. Functional response regression model on correlated longitudinal microbiome sequencing data. Stat Methods Med Res 2021; 31:361-371. [PMID: 34866471 PMCID: PMC8829735 DOI: 10.1177/09622802211061634] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022]
Abstract
Functional regression has been widely used on longitudinal data, but it is not clear how to apply functional regression to microbiome sequencing data. We propose a novel functional response regression model analyzing correlated longitudinal microbiome sequencing data, which extends the classic functional response regression model only working for independent functional responses. We derive the theory of generalized least squares estimators for predictors' effects when functional responses are correlated, and develop a data transformation technique to solve the computational challenge for analyzing correlated functional response data using existing functional regression method. We show by extensive simulations that our proposed method provides unbiased estimations for predictors' effect, and our model has accurate type I error and power performance for correlated functional response data, compared with classic functional response regression model. Finally we implement our method to a real infant gut microbiome study to evaluate the relationship of clinical factors to predominant taxa along time.
Collapse
Affiliation(s)
- Bo Chen
- Department of Biostatistics, Princess Margaret Cancer Centre, 7989University Health Network, Toronto, Ontario, Canada
| | - Wei Xu
- Department of Biostatistics, Princess Margaret Cancer Centre, 7989University Health Network, Toronto, Ontario, Canada.,Dalla Lana School of Public Health, 7938University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
22
|
Correcting the Estimation of Viral Taxa Distributions in Next-Generation Sequencing Data after Applying Artificial Neural Networks. Genes (Basel) 2021; 12:genes12111755. [PMID: 34828361 PMCID: PMC8624964 DOI: 10.3390/genes12111755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2021] [Revised: 10/25/2021] [Accepted: 10/27/2021] [Indexed: 11/16/2022] Open
Abstract
Estimating the taxonomic composition of viral sequences in a biological samples processed by next-generation sequencing is an important step in comparative metagenomics. Mapping sequencing reads against a database of known viral reference genomes, however, fails to classify reads from novel viruses whose reference sequences are not yet available in public databases. Instead of a mapping approach, and in order to classify sequencing reads at least to a taxonomic level, the performance of artificial neural networks and other machine learning models was studied. Taxonomic and genomic data from the NCBI database were used to sample labelled sequencing reads as training data. The fitted neural network was applied to classify unlabelled reads of simulated and real-world test sets. Additional auxiliary test sets of labelled reads were used to estimate the conditional class probabilities, and to correct the prior estimation of the taxonomic distribution in the actual test set. Among the taxonomic levels, the biological order of viruses provided the most comprehensive data base to generate training data. The prediction accuracy of the artificial neural network to classify test reads to their viral order was considerably higher than that of a random classification. Posterior estimation of taxa frequencies could correct the primary classification results.
Collapse
|
23
|
Srinivasan A, Xue L, Zhan X. Compositional knockoff filter for high-dimensional regression analysis of microbiome data. Biometrics 2021; 77:984-995. [PMID: 32683674 PMCID: PMC7831267 DOI: 10.1111/biom.13336] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2019] [Revised: 06/29/2020] [Accepted: 07/09/2020] [Indexed: 01/10/2023]
Abstract
A critical task in microbiome data analysis is to explore the association between a scalar response of interest and a large number of microbial taxa that are summarized as compositional data at different taxonomic levels. Motivated by fine-mapping of the microbiome, we propose a two-step compositional knockoff filter to provide the effective finite-sample false discovery rate (FDR) control in high-dimensional linear log-contrast regression analysis of microbiome compositional data. In the first step, we propose a new compositional screening procedure to remove insignificant microbial taxa while retaining the essential sum-to-zero constraint. In the second step, we extend the knockoff filter to identify the significant microbial taxa in the sparse regression model for compositional data. Thereby, a subset of the microbes is selected from the high-dimensional microbial taxa as related to the response under a prespecified FDR threshold. We study the theoretical properties of the proposed two-step procedure, including both sure screening and effective false discovery control. We demonstrate these properties in numerical simulation studies to compare our methods to some existing ones and show power gain of the new method while controlling the nominal FDR. The potential usefulness of the proposed method is also illustrated with application to an inflammatory bowel disease data set to identify microbial taxa that influence host gene expressions.
Collapse
Affiliation(s)
- Arun Srinivasan
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A
| | - Lingzhou Xue
- Department of Statistics, Pennsylvania State University, University Park, PA 16802, U.S.A
| | - Xiang Zhan
- Department of Public Health Sciences, Pennsylvania State University, Hershey, PA 17033, U.S.A
| |
Collapse
|
24
|
Stevens BR, Roesch L, Thiago P, Russell JT, Pepine CJ, Holbert RC, Raizada MK, Triplett EW. Depression phenotype identified by using single nucleotide exact amplicon sequence variants of the human gut microbiome. Mol Psychiatry 2021; 26:4277-4287. [PMID: 31988436 DOI: 10.1038/s41380-020-0652-5] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Revised: 01/13/2020] [Accepted: 01/16/2020] [Indexed: 12/15/2022]
Abstract
Single nucleotide exact amplicon sequence variants (ASV) of the human gut microbiome were used to evaluate if individuals with a depression phenotype (DEPR) could be identified from healthy reference subjects (NODEP). Microbial DNA in stool samples obtained from 40 subjects were characterized using high throughput microbiome sequence data processed via DADA2 error correction combined with PIME machine-learning de-noising and taxa binning/parsing of prevalent ASVs at the single nucleotide level of resolution. Application of ALDEx2 differential abundance analysis with assessed effect sizes and stringent PICRUSt2 predicted metabolic pathways. This multivariate machine-learning approach significantly differentiated DEPR (n = 20) vs. NODEP (n = 20) (PERMANOVA P < 0.001) based on microbiome taxa clustering and neurocircuit-relevant metabolic pathway network analysis for GABA, butyrate, glutamate, monoamines, monosaturated fatty acids, and inflammasome components. Gut microbiome dysbiosis using ASV prevalence data may offer the diagnostic potential of using human metaorganism biomarkers to identify individuals with a depression phenotype.
Collapse
Affiliation(s)
- Bruce R Stevens
- Department of Physiology and Functional Genomics, University of Florida College of Medicine, Gainesville, FL, USA. .,Department of Psychiatry, University of Florida College of Medicine, Gainesville, FL, USA. .,Division of Gastroenterology, Department of Medicine, University of Florida College of Medicine, Gainesville, FL, USA.
| | - Luiz Roesch
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA.,Centro Interdisciplinar de Pesquisas em Biotecnologia-CIP-Biotec, Universidade Federal do Pampa, São Gabriel, Bagé, Brazil
| | - Priscila Thiago
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Jordan T Russell
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| | - Carl J Pepine
- Division of Cardiovascular Medicine, Department of Medicine, University of Florida College of Medicine, Gainesville, FL, USA
| | - Richard C Holbert
- Department of Psychiatry, University of Florida College of Medicine, Gainesville, FL, USA
| | - Mohan K Raizada
- Department of Physiology and Functional Genomics, University of Florida College of Medicine, Gainesville, FL, USA
| | - Eric W Triplett
- Department of Microbiology and Cell Science, University of Florida, Gainesville, FL, USA
| |
Collapse
|
25
|
Zhou C, Zhao H, Wang T. Transformation and differential abundance analysis of microbiome data incorporating phylogeny. Bioinformatics 2021; 37:4652-4660. [PMID: 34302462 DOI: 10.1093/bioinformatics/btab543] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 05/31/2021] [Accepted: 07/22/2021] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Microbiome data have proven extremely useful for understanding microbial communities and their impacts in health and disease. Although microbiome analysis methods and standards are evolving rapidly, obtaining meaningful and interpretable results from microbiome studies still requires careful statistical treatment. In particular, many existing and emerging methods for differential abundance analysis fail to account for the fact that microbiome data are high-dimensional and sparse, compositional, negatively and positively correlated, and phylogenetically structured. To better describe microbiome data and improve the power of differential abundance testing, there is still a great need for the continued development of appropriate statistical methodology. RESULTS In this paper, we propose a model-based approach for microbiome data transformation, and a phylogenetically informed procedure for differential abundance (DA) testing based on the transformed data. First, we extend the Dirichlet-tree multinomial (DTM) to zero-inflated DTM (ZIDTM) for multivariate modeling of microbial counts, addressing data sparsity, and correlation and phylogeny among bacterial taxa. Then, within this framework and using a Bayesian formulation, we introduce posterior mean transformation to convert raw counts into nonzero relative abundances that sum to one, accounting for the compositionality nature of microbiome data. Second, using the transformed data, we propose adaptive analysis of composition of microbiomes (adaANCOM) for DA testing by constructing log-ratios adaptively on the tree for each taxon, greatly reducing the computational complexity of ANCOM in high dimensions. Finally, we present extensive simulation studies, an analysis of HMP data across 18 body sites and 2 visits, and an application to a gut microbiome and malnutrition study, to investigate the performance of posterior mean transformation and adaANCOM. Comparisons with ANCOM and other DA testing procedures show that adaANCOM controls the false discovery rate well, allows for easy interpretation of the results, and is computationally efficient for high-dimensional problems. AVAILABILITY The developed R package is available at https://github.com/ZRChao/adaANCOM. For replicability purposes, scripts for our simulations and data analysis are available at https://github.com/ZRChao/Papers_supplementary. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chao Zhou
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, Connecticut, U.S.A.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
| | - Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China.,MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
26
|
Shuler K, Verbanic S, Chen IA, Lee J. A Bayesian nonparametric analysis for zero‐inflated multivariate count data with application to microbiome study. J R Stat Soc Ser C Appl Stat 2021. [DOI: 10.1111/rssc.12493] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Affiliation(s)
- Kurtis Shuler
- Sandia National Laboratories in Albuquerque Albuquerque NM USA
| | - Samuel Verbanic
- Department of Chemical and Biomolecular Engineering University of California Los Angeles Los Angeles CA USA
| | - Irene A. Chen
- Department of Chemical and Biomolecular Engineering University of California Los Angeles Los Angeles CA USA
| | - Juhee Lee
- Department of Statistics University of California Santa Cruz Santa Cruz CA USA
| |
Collapse
|
27
|
Fiksel J, Datta A, Amouzou A, Zeger S. Generalized Bayes Quantification Learning under Dataset Shift. J Am Stat Assoc 2021. [DOI: 10.1080/01621459.2021.1909599] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
Affiliation(s)
- Jacob Fiksel
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD
| | - Abhirup Datta
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD
| | - Agbessi Amouzou
- Department of International Health, Johns Hopkins University, Baltimore, MD
| | - Scott Zeger
- Department of Biostatistics, Johns Hopkins University, Baltimore, MD
| |
Collapse
|
28
|
Data Analysis Strategies for Microbiome Studies in Human Populations-a Systematic Review of Current Practice. mSystems 2021; 6:6/1/e01154-20. [PMID: 33622856 PMCID: PMC8573962 DOI: 10.1128/msystems.01154-20] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Reproducibility is a major issue in microbiome studies, which is partly caused by missing consensus about data analysis strategies. The complex nature of microbiome data, which are high-dimensional, zero-inflated, and compositional, makes them challenging to analyze, as they often violate assumptions of classic statistical methods. With advances in human microbiome research, research questions and study designs increase in complexity so that more sophisticated data analysis concepts are applied. To improve current practice of the analysis of microbiome studies, it is important to understand what kind of research questions are asked and which tools are used to answer these questions. We conducted a systematic literature review considering all publications focusing on the analysis of human microbiome data from June 2018 to June 2019. Of 1,444 studies screened, 419 fulfilled the inclusion criteria. Information about research questions, study designs, and analysis strategies were extracted. The results confirmed the expected shift to more advanced research questions, as one-third of the studies analyzed clustered data. Although heterogeneity in the methods used was found at any stage of the analysis process, it was largest for differential abundance testing. Especially if the underlying data structure was clustered, we identified a lack of use of methods that appropriately addressed the underlying data structure while taking into account additional dependencies in the data. Our results confirm considerable heterogeneity in analysis strategies among microbiome studies; increasingly complex research questions require better guidance for analysis strategies. IMPORTANCE The human microbiome has emerged as an important factor in the development of health and disease. Growing interest in this topic has led to an increasing number of studies investigating the human microbiome using high-throughput sequencing methods. However, the development of suitable analytical methods for analyzing microbiome data has not kept pace with the rapid progression in the field. It is crucial to understand current practice to identify the scope for development. Our results highlight the need for an extensive evaluation of the strengths and shortcomings of existing methods in order to guide the choice of proper analysis strategies. We have identified where new methods could be designed to address more advanced research questions while taking into account the complex structure of the data.
Collapse
|
29
|
Li Z, Tian L, O’Malley AJ, Karagas MR, Hoen AG, Christensen BC, Madan JC, Wu Q, Gharaibeh RZ, Jobin C, Li H. IFAA: Robust Association Identification and Inference for Absolute Abundance in Microbiome Analyses. J Am Stat Assoc 2021; 116:1595-1608. [PMID: 35241863 PMCID: PMC8890673 DOI: 10.1080/01621459.2020.1860770] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/10/2019] [Revised: 09/30/2020] [Accepted: 12/03/2020] [Indexed: 12/15/2022]
Abstract
The target of inference in microbiome analyses is usually relative abundance (RA) because RA in a sample (e.g., stool) can be considered as an approximation of RA in an entire ecosystem (e.g., gut). However, inference on RA suffers from the fact that RA are calculated by dividing absolute abundances (AAs) over the common denominator (CD), the summation of all AA (i.e., library size). Because of that, perturbation in one taxon will result in a change in the CD and thus cause false changes in RA of all other taxa, and those false changes could lead to false positive/negative findings. We propose a novel analysis approach (IFAA) to make robust inference on AA of an ecosystem that can circumvent the issues induced by the CD problem and compositional structure of RA. IFAA can also address the issues of overdispersion and handle zero-inflated data structures. IFAA identifies microbial taxa associated with the covariates in Phase 1 and estimates the association parameters by employing an independent reference taxon in Phase 2. Two real data applications are presented and extensive simulations show that IFAA outperforms other established existing approaches by a big margin in the presence of unbalanced library size. Supplementary materials for this article are available online.
Collapse
Affiliation(s)
- Zhigang Li
- Department of Biostatistics, University of Florida, Gainesville, FL
| | - Lu Tian
- Department of Biomedical Data Science, Stanford University, Palo Alto, CA
| | - A. James O’Malley
- The Dartmouth Institute, Geisel School of Medicine at Dartmouth, Hanover, NH
| | - Margaret R. Karagas
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH
| | - Anne G. Hoen
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH
| | | | - Juliette C. Madan
- Department of Epidemiology, Geisel School of Medicine at Dartmouth, Hanover, NH
| | - Quran Wu
- Department of Biostatistics, University of Florida, Gainesville, FL
| | | | - Christian Jobin
- Department of Medicine, University of Florida, Gainesville, FL
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology & Informatics, University of Pennsylvania, Philadelphia, PA
| |
Collapse
|
30
|
Deek RA, Li H. A Zero-Inflated Latent Dirichlet Allocation Model for Microbiome Studies. Front Genet 2021; 11:602594. [PMID: 33552122 PMCID: PMC7862749 DOI: 10.3389/fgene.2020.602594] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/03/2020] [Accepted: 12/29/2020] [Indexed: 11/13/2022] Open
Abstract
The human microbiome consists of a community of microbes in varying abundances and is shown to be associated with many diseases. An important first step in many microbiome studies is to identify possible distinct microbial communities in a given data set and to identify the important bacterial taxa that characterize these communities. The data from typical microbiome studies are high dimensional count data with excessive zeros due to both absence of species (structural zeros) and low sequencing depth or dropout. Although methods have been developed for identifying the microbial communities based on mixture models of counts, these methods do not account for excessive zeros observed in the data and do not differentiate structural from sampling zeros. In this paper, we introduce a zero-inflated Latent Dirichlet Allocation model (zinLDA) for sparse count data observed in microbiome studies. zinLDA builds on the flexible Latent Dirichlet Allocation model and allows for zero inflation in observed counts. We develop an efficient Markov chain Monte Carlo (MCMC) sampling procedure to fit the model. Results from our simulations show zinLDA provides better fits to the data and is able to separate structural zeros from sampling zeros. We apply zinLDA to the data set from the American Gut Project and identify microbial communities characterized by different bacterial genera.
Collapse
Affiliation(s)
- Rebecca A Deek
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology, and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| |
Collapse
|
31
|
Hu YJ, Lane A, Satten GA. A rarefaction-based extension of the LDM for testing presence-absence associations in the microbiome. Bioinformatics 2021; 37:1652-1657. [PMID: 33479757 PMCID: PMC8289387 DOI: 10.1093/bioinformatics/btab012] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2020] [Revised: 12/16/2020] [Accepted: 01/05/2021] [Indexed: 12/13/2022] Open
Abstract
MOTIVATION Many methods for testing association between the microbiome and covariates of interest (e.g., clinical outcomes, environmental factors) assume that these associations are driven by changes in the relative abundance of taxa. However, these associations may also result from changes in which taxa are present and which are absent. Analyses of such presence-absence associations face a unique challenge: confounding by library size (total sample read count), which occurs when library size is associated with covariates in the analysis. It is known that rarefaction (subsampling to a common library size) controls this bias, but at the potential cost of information loss as well as the introduction of a stochastic component into the analysis. Currently, there is a need for robust and efficient methods for testing presence-absence associations in the presence of such confounding, both at the community level and at the individual-taxon level, that avoid the drawbacks of rarefaction. RESULTS We have previously developed the linear decomposition model (LDM) that unifies the community-level and taxon-level tests into one framework. Here we present an extension of the LDM for testing presence-absence associations. The extended LDM is a non-stochastic approach that repeatedly applies the LDM to all rarefied taxa count tables, averages the residual sum-of-squares (RSS) terms over the rarefaction replicates, and then forms an F-statistic based on these average RSS terms. We show that this approach compares favorably to averaging the F-statistic from R rarefaction replicates, which can only be calculated stochastically. The flexible nature of the LDM allows discrete or continuous traits or interactions to be tested while allowing confounding covariates to be adjusted for. Our simulations indicate that our proposed method is robust to any systematic differences in library size and has better power than alternative approaches. We illustrate our method using an analysis of data on inflammatory bowel disease (IBD) in which cases have systematically smaller library sizes than controls. AVAILABILITY The R package LDM is available on GitHub at https://github.com/yijuanhu/LDM in formats appropriate for Macintosh or Windows. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yi-Juan Hu
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, USA
| | - Andrea Lane
- Department of Biostatistics and Bioinformatics, Emory University, Atlanta, GA, USA
| | - Glen A Satten
- Department of Gynecology and Obstetrics, Emory University School of Medicine, Atlanta, GA, USA
| |
Collapse
|
32
|
Chen B, Xu W. Generalized estimating equation modeling on correlated microbiome sequencing data with longitudinal measures. PLoS Comput Biol 2020; 16:e1008108. [PMID: 32898133 PMCID: PMC7500673 DOI: 10.1371/journal.pcbi.1008108] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2020] [Revised: 09/18/2020] [Accepted: 06/30/2020] [Indexed: 11/19/2022] Open
Abstract
Existing models for assessing microbiome sequencing such as operational taxonomic units (OTUs) can only test predictors' effects on OTUs. There is limited work on how to estimate the correlations between multiple OTUs and incorporate such relationship into models to evaluate longitudinal OTU measures. We propose a novel approach to estimate OTU correlations based on their taxonomic structure, and apply such correlation structure in Generalized Estimating Equations (GEE) models to estimate both predictors' effects and OTU correlations. We develop a two-part Microbiome Taxonomic Longitudinal Correlation (MTLC) model for multivariate zero-inflated OTU outcomes based on the GEE framework. In addition, longitudinal and other types of repeated OTU measures are integrated in the MTLC model. Extensive simulations have been conducted to evaluate the performance of the MTLC method. Compared with the existing methods, the MTLC method shows robust and consistent estimation, and improved statistical power for testing predictors' effects. Lastly we demonstrate our proposed method by implementing it into a real human microbiome study to evaluate the obesity on twins.
Collapse
Affiliation(s)
- Bo Chen
- Princess Margaret Hospital, Toronto, Ontario, Canada
| | - Wei Xu
- Princess Margaret Hospital, Toronto, Ontario, Canada
- Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
33
|
Liu T, Zhao H, Wang T. An empirical Bayes approach to normalization and differential abundance testing for microbiome data. BMC Bioinformatics 2020; 21:225. [PMID: 32493208 PMCID: PMC7268703 DOI: 10.1186/s12859-020-03552-z] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 05/18/2020] [Indexed: 12/14/2022] Open
Abstract
Background Advances in DNA sequencing have offered researchers an unprecedented opportunity to better study the variety of species living in and on the human body. However, the analysis of microbiome data is complicated by several challenges. First, the sequencing depth may vary by orders of magnitude across samples. Second, species are rare and the data often contain many zeros. Third, the specimen is a fraction of the microbial ecosystem, and so the data are compositional carrying only relative information. Other characteristics of microbiome data include pronounced over-dispersion in taxon abundances, and the existence of a phylogenetic tree that relates all bacterial species. To address some of these challenges, microbiome analysis workflows often normalize the read counts prior to downstream analysis. However, there are limitations in the current literature on the normalization of microbiome data. Results Under the multinomial distribution for the read counts and a prior for the unknown proportions, we propose an empirical Bayes approach to microbiome data normalization. Using a tree-based extension of the Dirichlet prior, we further extend our method by incorporating the phylogenetic tree into the normalization process. We study the impact of normalization on differential abundance analysis. In the presence of tree structure, we propose a phylogeny-aware detection procedure. Conclusions Extensive simulations and gut microbiome data applications are conducted to demonstrate the superior performance of our empirical Bayes method over other normalization methods, and over commonly-used methods for differential abundance testing. Original R scripts are available at GitHub (https://github.com/liudoubletian/eBay).
Collapse
Affiliation(s)
- Tiantian Liu
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, 300 George Street, New Haven, 06511, USA.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China
| | - Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China. .,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China. .,MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, 800 Dongchuan Road, Shanghai, 200240, China.
| |
Collapse
|
34
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
35
|
Cullen CM, Aneja KK, Beyhan S, Cho CE, Woloszynek S, Convertino M, McCoy SJ, Zhang Y, Anderson MZ, Alvarez-Ponce D, Smirnova E, Karstens L, Dorrestein PC, Li H, Sen Gupta A, Cheung K, Powers JG, Zhao Z, Rosen GL. Emerging Priorities for Microbiome Research. Front Microbiol 2020; 11:136. [PMID: 32140140 PMCID: PMC7042322 DOI: 10.3389/fmicb.2020.00136] [Citation(s) in RCA: 77] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 01/21/2020] [Indexed: 12/12/2022] Open
Abstract
Microbiome research has increased dramatically in recent years, driven by advances in technology and significant reductions in the cost of analysis. Such research has unlocked a wealth of data, which has yielded tremendous insight into the nature of the microbial communities, including their interactions and effects, both within a host and in an external environment as part of an ecological community. Understanding the role of microbiota, including their dynamic interactions with their hosts and other microbes, can enable the engineering of new diagnostic techniques and interventional strategies that can be used in a diverse spectrum of fields, spanning from ecology and agriculture to medicine and from forensics to exobiology. From June 19-23 in 2017, the NIH and NSF jointly held an Innovation Lab on Quantitative Approaches to Biomedical Data Science Challenges in our Understanding of the Microbiome. This review is inspired by some of the topics that arose as priority areas from this unique, interactive workshop. The goal of this review is to summarize the Innovation Lab's findings by introducing the reader to emerging challenges, exciting potential, and current directions in microbiome research. The review is broken into five key topic areas: (1) interactions between microbes and the human body, (2) evolution and ecology of microbes, including the role played by the environment and microbe-microbe interactions, (3) analytical and mathematical methods currently used in microbiome research, (4) leveraging knowledge of microbial composition and interactions to develop engineering solutions, and (5) interventional approaches and engineered microbiota that may be enabled by selectively altering microbial composition. As such, this review seeks to arm the reader with a broad understanding of the priorities and challenges in microbiome research today and provide inspiration for future investigation and multi-disciplinary collaboration.
Collapse
Affiliation(s)
- Chad M. Cullen
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA, United States
| | | | - Sinem Beyhan
- Department of Infectious Diseases, J. Craig Venter Institute, La Jolla, CA, United States
| | - Clara E. Cho
- Department of Nutrition, Dietetics and Food Sciences, Utah State University, Logan, UT, United States
| | - Stephen Woloszynek
- Ecological and Evolutionary Signal-processing and Informatics Laboratory (EESI), Electrical and Computer Engineering, Drexel University, Philadelphia, PA, United States
- College of Medicine, Drexel University, Philadelphia, PA, United States
| | - Matteo Convertino
- Nexus Group, Faculty of Information Science and Technology, Gi-CoRE Station for Big Data & Cybersecurity, Hokkaido University, Sapporo, Japan
| | - Sophie J. McCoy
- Department of Biological Science, Florida State University, Tallahassee, FL, United States
| | - Yanyan Zhang
- Department of Civil Engineering, New Mexico State University, Las Cruces, NM, United States
| | - Matthew Z. Anderson
- Department of Microbiology, The Ohio State University, Columbus, OH, United States
- Department of Microbial Infection and Immunity, The Ohio State University, Columbus, OH, United States
| | | | - Ekaterina Smirnova
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA, United States
| | - Lisa Karstens
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, United States
- Department of Obstetrics and Gynecology, Oregon Health & Science University, Portland, OR, United States
| | - Pieter C. Dorrestein
- Collaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, San Diego, CA, United States
| | - Hongzhe Li
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, United States
| | - Ananya Sen Gupta
- Department of Electrical and Computer Engineering, The University of Iowa, Iowa City, IA, United States
| | - Kevin Cheung
- Department of Dermatology, The University of Iowa, Iowa City, IA, United States
| | | | - Zhengqiao Zhao
- Ecological and Evolutionary Signal-processing and Informatics Laboratory (EESI), Electrical and Computer Engineering, Drexel University, Philadelphia, PA, United States
| | - Gail L. Rosen
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, PA, United States
- Ecological and Evolutionary Signal-processing and Informatics Laboratory (EESI), Electrical and Computer Engineering, Drexel University, Philadelphia, PA, United States
| |
Collapse
|
36
|
Jiang D, Armour CR, Hu C, Mei M, Tian C, Sharpton TJ, Jiang Y. Microbiome Multi-Omics Network Analysis: Statistical Considerations, Limitations, and Opportunities. Front Genet 2019; 10:995. [PMID: 31781153 PMCID: PMC6857202 DOI: 10.3389/fgene.2019.00995] [Citation(s) in RCA: 83] [Impact Index Per Article: 16.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2019] [Accepted: 09/18/2019] [Indexed: 12/21/2022] Open
Abstract
The advent of large-scale microbiome studies affords newfound analytical opportunities to understand how these communities of microbes operate and relate to their environment. However, the analytical methodology needed to model microbiome data and integrate them with other data constructs remains nascent. This emergent analytical toolset frequently ports over techniques developed in other multi-omics investigations, especially the growing array of statistical and computational techniques for integrating and representing data through networks. While network analysis has emerged as a powerful approach to modeling microbiome data, oftentimes by integrating these data with other types of omics data to discern their functional linkages, it is not always evident if the statistical details of the approach being applied are consistent with the assumptions of microbiome data or how they impact data interpretation. In this review, we overview some of the most important network methods for integrative analysis, with an emphasis on methods that have been applied or have great potential to be applied to the analysis of multi-omics integration of microbiome data. We compare advantages and disadvantages of various statistical tools, assess their applicability to microbiome data, and discuss their biological interpretability. We also highlight on-going statistical challenges and opportunities for integrative network analysis of microbiome data.
Collapse
Affiliation(s)
- Duo Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Courtney R Armour
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Chenxiao Hu
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Meng Mei
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Chuan Tian
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| | - Thomas J Sharpton
- Department of Statistics, Oregon State University, Corvallis, OR, United States
- Department of Microbiology, Oregon State University, Corvallis, OR, United States
| | - Yuan Jiang
- Department of Statistics, Oregon State University, Corvallis, OR, United States
| |
Collapse
|
37
|
Song Y, Zhao H, Wang T. An adaptive independence test for microbiome community data. Biometrics 2019; 76:414-426. [DOI: 10.1111/biom.13154] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2018] [Accepted: 09/16/2019] [Indexed: 11/29/2022]
Affiliation(s)
- Yaru Song
- Department of Bioinformatics and BiostatisticsShanghai Jiao Tong University Shanghai China
- SJTU‐Yale Joint Center for Biostatistics and Data ScienceShanghai Jiao Tong University Shanghai China
| | - Hongyu Zhao
- Department of BiostatisticsYale University New Haven Connecticut
- SJTU‐Yale Joint Center for Biostatistics and Data ScienceShanghai Jiao Tong University Shanghai China
| | - Tao Wang
- Department of Bioinformatics and BiostatisticsShanghai Jiao Tong University Shanghai China
- SJTU‐Yale Joint Center for Biostatistics and Data ScienceShanghai Jiao Tong University Shanghai China
- MoE Key Lab of Artificial IntelligenceShanghai Jiao Tong University Shanghai China
| |
Collapse
|
38
|
Tang ZZ, Chen G. Robust and Powerful Differential Composition Tests for Clustered Microbiome Data. STATISTICS IN BIOSCIENCES 2019. [DOI: 10.1007/s12561-019-09251-5] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
|
39
|
Tang ZZ, Chen G, Hong Q, Huang S, Smith HM, Shah RD, Scholz M, Ferguson JF. Multi-Omic Analysis of the Microbiome and Metabolome in Healthy Subjects Reveals Microbiome-Dependent Relationships Between Diet and Metabolites. Front Genet 2019; 10:454. [PMID: 31164901 PMCID: PMC6534069 DOI: 10.3389/fgene.2019.00454] [Citation(s) in RCA: 67] [Impact Index Per Article: 13.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Accepted: 04/30/2019] [Indexed: 12/22/2022] Open
Abstract
The human microbiome has been associated with health status, and risk of disease development. While the etiology of microbiome-mediated disease remains to be fully elucidated, one mechanism may be through microbial metabolism. Metabolites produced by commensal organisms, including in response to host diet, may affect host metabolic processes, with potentially protective or pathogenic consequences. We conducted multi-omic phenotyping of healthy subjects (N = 136), in order to investigate the interaction between diet, the microbiome, and the metabolome in a cross-sectional sample. We analyzed the nutrient composition of self-reported diet (3-day food records and food frequency questionnaires). We profiled the gut and oral microbiome (16S rRNA) from stool and saliva, and applied metabolomic profiling to plasma and stool samples in a subset of individuals (N = 75). We analyzed these multi-omic data to investigate the relationship between diet, the microbiome, and the gut and circulating metabolome. On a global level, we observed significant relationships, particularly between long-term diet, the gut microbiome and the metabolome. Intake of plant-derived nutrients as well as consumption of artificial sweeteners were associated with significant differences in circulating metabolites, particularly bile acids, which were dependent on gut enterotype, indicating that microbiome composition mediates the effect of diet on host physiology. Our analysis identifies dietary compounds and phytochemicals that may modulate bacterial abundance within the gut and interact with microbiome composition to alter host metabolism.
Collapse
Affiliation(s)
- Zheng-Zheng Tang
- Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, Madison, WI, United States
- Wisconsin Institute for Discovery, Madison, WI, United States
| | - Guanhua Chen
- Department of Biostatistics and Medical Informatics, University of Wisconsin–Madison, Madison, WI, United States
| | - Qilin Hong
- Department of Statistics, University of Wisconsin–Madison, Madison, WI, United States
| | - Shi Huang
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Holly M. Smith
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Rachana D. Shah
- Division of Pediatric Endocrinology, Children’s Hospital of Philadelphia, Philadelphia, PA, United States
| | - Matthew Scholz
- Vanderbilt Technologies for Advanced Genomics (VANTAGE), Vanderbilt University Medical Center, Nashville, TN, United States
| | - Jane F. Ferguson
- Division of Cardiovascular Medicine, Vanderbilt University Medical Center, Nashville, TN, United States
- Vanderbilt Translational and Clinical Cardiovascular Research Center (VTRACC), Vanderbilt University Medical Center, Nashville, TN, United States
| |
Collapse
|