1
|
Li G, Li Y, Chen K. It's all relative: Regression analysis with compositional predictors. Biometrics 2023; 79:1318-1329. [PMID: 35616500 PMCID: PMC9767704 DOI: 10.1111/biom.13703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2021] [Accepted: 05/18/2022] [Indexed: 01/05/2023]
Abstract
Compositional data reside in a simplex and measure fractions or proportions of parts to a whole. Most existing regression methods for such data rely on log-ratio transformations that are inadequate or inappropriate in modeling high-dimensional data with excessive zeros and hierarchical structures. Moreover, such models usually lack a straightforward interpretation due to the interrelation between parts of a composition. We develop a novel relative-shift regression framework that directly uses proportions as predictors. The new framework provides a paradigm shift for regression analysis with compositional predictors and offers a superior interpretation of how shifting concentration between parts affects the response. New equi-sparsity and tree-guided regularization methods and an efficient smoothing proximal gradient algorithm are developed to facilitate feature aggregation and dimension reduction in regression. A unified finite-sample prediction error bound is derived for the proposed regularized estimators. We demonstrate the efficacy of the proposed methods in extensive simulation studies and a real gut microbiome study. Guided by the taxonomy of the microbiome data, the framework identifies important taxa at different taxonomic levels associated with the neurodevelopment of preterm infants.
Collapse
Affiliation(s)
- Gen Li
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor., Michigan, USA
| | - Yan Li
- Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor., Michigan, USA
| | - Kun Chen
- Department of Statistics, University of Connecticut, Connecticut, USA
| |
Collapse
|
2
|
Principal Amalgamation Analysis for Microbiome Data. Genes (Basel) 2022; 13:genes13071139. [PMID: 35885922 PMCID: PMC9318429 DOI: 10.3390/genes13071139] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2022] [Revised: 06/14/2022] [Accepted: 06/21/2022] [Indexed: 12/02/2022] Open
Abstract
In recent years microbiome studies have become increasingly prevalent and large-scale. Through high-throughput sequencing technologies and well-established analytical pipelines, relative abundance data of operational taxonomic units and their associated taxonomic structures are routinely produced. Since such data can be extremely sparse and high dimensional, there is often a genuine need for dimension reduction to facilitate data visualization and downstream statistical analysis. We propose Principal Amalgamation Analysis (PAA), a novel amalgamation-based and taxonomy-guided dimension reduction paradigm for microbiome data. Our approach aims to aggregate the compositions into a smaller number of principal compositions, guided by the available taxonomic structure, by minimizing a properly measured loss of information. The choice of the loss function is flexible and can be based on familiar diversity indices for preserving either within-sample or between-sample diversity in the data. To enable scalable computation, we develop a hierarchical PAA algorithm to trace the entire trajectory of successive simple amalgamations. Visualization tools including dendrogram, scree plot, and ordination plot are developed. The effectiveness of PAA is demonstrated using gut microbiome data from a preterm infant study and an HIV infection study.
Collapse
|
3
|
Goren E, Wang C, He Z, Sheflin AM, Chiniquy D, Prenni JE, Tringe S, Schachtman DP, Liu P. Feature selection and causal analysis for microbiome studies in the presence of confounding using standardization. BMC Bioinformatics 2021; 22:362. [PMID: 34229628 PMCID: PMC8261956 DOI: 10.1186/s12859-021-04232-2] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 06/03/2021] [Indexed: 12/25/2022] Open
Abstract
BACKGROUND Microbiome studies have uncovered associations between microbes and human, animal, and plant health outcomes. This has led to an interest in developing microbial interventions for treatment of disease and optimization of crop yields which requires identification of microbiome features that impact the outcome in the population of interest. That task is challenging because of the high dimensionality of microbiome data and the confounding that results from the complex and dynamic interactions among host, environment, and microbiome. In the presence of such confounding, variable selection and estimation procedures may have unsatisfactory performance in identifying microbial features with an effect on the outcome. RESULTS In this manuscript, we aim to estimate population-level effects of individual microbiome features while controlling for confounding by a categorical variable. Due to the high dimensionality and confounding-induced correlation between features, we propose feature screening, selection, and estimation conditional on each stratum of the confounder followed by a standardization approach to estimation of population-level effects of individual features. Comprehensive simulation studies demonstrate the advantages of our approach in recovering relevant features. Utilizing a potential-outcomes framework, we outline assumptions required to ascribe causal, rather than associational, interpretations to the identified microbiome effects. We conducted an agricultural study of the rhizosphere microbiome of sorghum in which nitrogen fertilizer application is a confounding variable. In this study, the proposed approach identified microbial taxa that are consistent with biological understanding of potential plant-microbe interactions. CONCLUSIONS Standardization enables more accurate identification of individual microbiome features with an effect on the outcome of interest compared to other variable selection and estimation procedures when there is confounding by a categorical variable.
Collapse
Affiliation(s)
- Emily Goren
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA
| | - Chong Wang
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA.,Department of Veterinary Diagnostic and Production Animal Medicine, Iowa State University, 2203 Lloyd Veterinary Medical Center, Ames, IA, 50011, USA
| | - Zhulin He
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA
| | - Amy M Sheflin
- Department of Horticulture and Landscape Architecture, Colorado State University, 301 University Ave, Fort Collins, CO, 80523, USA
| | - Dawn Chiniquy
- Department of Energy, Joint Genome Institute, 2800 Mitchell Dr, Walnut Creek, CA, 94598, USA
| | - Jessica E Prenni
- Department of Horticulture and Landscape Architecture, Colorado State University, 301 University Ave, Fort Collins, CO, 80523, USA
| | - Susannah Tringe
- Department of Energy, Joint Genome Institute, 2800 Mitchell Dr, Walnut Creek, CA, 94598, USA
| | - Daniel P Schachtman
- Department of Agronomy and Horticulture, University of Nebraska, 1825 N 38th St, Lincoln, NE, 68583, USA
| | - Peng Liu
- Department of Statistics, Iowa State University, 2438 Osborn Dr, Ames, IA, 50011, USA.
| |
Collapse
|
4
|
Thomas EG, Trippa L, Parmigiani G, Dominici F. Estimating the Effects of Fine Particulate Matter on 432 Cardiovascular Diseases Using Multi-Outcome Regression With Tree-Structured Shrinkage. J Am Stat Assoc 2020. [DOI: 10.1080/01621459.2020.1722134] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/24/2022]
Affiliation(s)
- Emma G. Thomas
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
| | - Lorenzo Trippa
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA
| | - Giovanni Parmigiani
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, MA
| | - Francesca Dominici
- Department of Biostatistics, Harvard T.H. Chan School of Public Health, Boston, MA
- Harvard Data Science Initiative, Cambridge, MA
| |
Collapse
|
5
|
Koslovsky MD, Hoffman KL, Daniel CR, Vannucci M. A Bayesian model of microbiome data for simultaneous identification of covariate associations and prediction of phenotypic outcomes. Ann Appl Stat 2020. [DOI: 10.1214/20-aoas1354] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
6
|
Koslovsky MD, Vannucci M. MicroBVS: Dirichlet-tree multinomial regression models with Bayesian variable selection - an R package. BMC Bioinformatics 2020; 21:301. [PMID: 32660471 PMCID: PMC7359232 DOI: 10.1186/s12859-020-03640-0] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Accepted: 07/02/2020] [Indexed: 11/29/2022] Open
Abstract
Background Understanding the relation between the human microbiome and modulating factors, such as diet, may help researchers design intervention strategies that promote and maintain healthy microbial communities. Numerous analytical tools are available to help identify these relations, oftentimes via automated variable selection methods. However, available tools frequently ignore evolutionary relations among microbial taxa, potential relations between modulating factors, as well as model selection uncertainty. Results We present MicroBVS, an R package for Dirichlet-tree multinomial models with Bayesian variable selection, for the identification of covariates associated with microbial taxa abundance data. The underlying Bayesian model accommodates phylogenetic structure in the abundance data and various parameterizations of covariates’ prior probabilities of inclusion. Conclusion While developed to study the human microbiome, our software can be employed in various research applications, where the aim is to generate insights into the relations between a set of covariates and compositional data with or without a known tree-like structure.
Collapse
|
7
|
Xia Y. Correlation and association analyses in microbiome study integrating multiomics in health and disease. PROGRESS IN MOLECULAR BIOLOGY AND TRANSLATIONAL SCIENCE 2020; 171:309-491. [PMID: 32475527 DOI: 10.1016/bs.pmbts.2020.04.003] [Citation(s) in RCA: 37] [Impact Index Per Article: 9.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2023]
Abstract
Correlation and association analyses are one of the most widely used statistical methods in research fields, including microbiome and integrative multiomics studies. Correlation and association have two implications: dependence and co-occurrence. Microbiome data are structured as phylogenetic tree and have several unique characteristics, including high dimensionality, compositionality, sparsity with excess zeros, and heterogeneity. These unique characteristics cause several statistical issues when analyzing microbiome data and integrating multiomics data, such as large p and small n, dependency, overdispersion, and zero-inflation. In microbiome research, on the one hand, classic correlation and association methods are still applied in real studies and used for the development of new methods; on the other hand, new methods have been developed to target statistical issues arising from unique characteristics of microbiome data. Here, we first provide a comprehensive view of classic and newly developed univariate correlation and association-based methods. We discuss the appropriateness and limitations of using classic methods and demonstrate how the newly developed methods mitigate the issues of microbiome data. Second, we emphasize that concepts of correlation and association analyses have been shifted by introducing network analysis, microbe-metabolite interactions, functional analysis, etc. Third, we introduce multivariate correlation and association-based methods, which are organized by the categories of exploratory, interpretive, and discriminatory analyses and classification methods. Fourth, we focus on the hypothesis testing of univariate and multivariate regression-based association methods, including alpha and beta diversities-based, count-based, and relative abundance (or compositional)-based association analyses. We demonstrate the characteristics and limitations of each approaches. Fifth, we introduce two specific microbiome-based methods: phylogenetic tree-based association analysis and testing for survival outcomes. Sixth, we provide an overall view of longitudinal methods in analysis of microbiome and omics data, which cover standard, static, regression-based time series methods, principal trend analysis, and newly developed univariate overdispersed and zero-inflated as well as multivariate distance/kernel-based longitudinal models. Finally, we comment on current association analysis and future direction of association analysis in microbiome and multiomics studies.
Collapse
Affiliation(s)
- Yinglin Xia
- Department of Medicine, University of Illinois at Chicago, Chicago, IL, United States.
| |
Collapse
|
8
|
Bichat A, Plassais J, Ambroise C, Mariadassou M. Incorporating Phylogenetic Information in Microbiome Differential Abundance Studies Has No Effect on Detection Power and FDR Control. Front Microbiol 2020; 11:649. [PMID: 32351481 PMCID: PMC7174607 DOI: 10.3389/fmicb.2020.00649] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 03/20/2020] [Indexed: 12/18/2022] Open
Abstract
We consider the problem of incorporating evolutionary information (e.g., taxonomic or phylogenic trees) in the context of metagenomics differential analysis. Recent results published in the literature propose different ways to leverage the tree structure to increase the detection rate of differentially abundant taxa. Here, we propose instead to use a different hierarchical structure, in the form of a correlation-based tree, as it may capture the structure of the data better than the phylogeny. We first show that the correlation tree and the phylogeny are significantly different before turning to the impact of tree choice on detection rates. Using synthetic data, we show that the tree does have an impact: smoothing p-values according to the phylogeny leads to equal or inferior rates as smoothing according to the correlation tree. However, both trees are outperformed by the classical, non-hierarchical, Benjamini–Hochberg (BH) procedure in terms of detection rates. Other procedures may use the hierarchical structure with profit but do not control the False Discovery Rate (FDR) a priori and remain inferior to a classical Benjamini–Hochberg procedure with the same nominal FDR. On real datasets, no hierarchical procedure had significantly higher detection rate that BH. Intuition advocates that the use of hierarchical structures should increase the detection rate of differentially abundant taxa in microbiome studies. However, our results suggest that current hierarchical procedures are still inferior to standard methods and more effective procedures remain to be invented.
Collapse
Affiliation(s)
- Antoine Bichat
- LaMME, Université Paris-Saclay, CNRS, Université d'Évry val d'Essonne, Évry, France.,Enterome, Paris, France
| | | | - Christophe Ambroise
- LaMME, Université Paris-Saclay, CNRS, Université d'Évry val d'Essonne, Évry, France
| | | |
Collapse
|
9
|
Wang T, Yang C, Zhao H. Prediction analysis for microbiome sequencing data. Biometrics 2019; 75:875-884. [PMID: 30994187 DOI: 10.1111/biom.13061] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2017] [Revised: 03/08/2019] [Accepted: 03/13/2019] [Indexed: 01/22/2023]
Abstract
One goal of human microbiome studies is to relate host traits with human microbiome compositions. The analysis of microbial community sequencing data presents great statistical challenges, especially when the samples have different library sizes and the data are overdispersed with many zeros. To address these challenges, we introduce a new statistical framework, called predictive analysis in metagenomics via inverse regression (PAMIR), to analyze microbiome sequencing data. Within this framework, an inverse regression model is developed for overdispersed microbiota counts given the trait, and then a prediction rule is constructed by taking advantage of the dimension-reduction structure in the model. An efficient Monte Carlo expectation-maximization algorithm is proposed for maximum likelihood estimation. The method is further generalized to accommodate other types of covariates. We demonstrate the advantages of PAMIR through simulations and two real data examples.
Collapse
Affiliation(s)
- Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,MoE Key Lab of Artificial Intelligence, Shanghai Jiao Tong University, Shanghai, China.,SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China
| | - Can Yang
- Department of Mathematics, Hong Kong University of Science and Technology, Kowloon, Hong Kong
| | - Hongyu Zhao
- SJTU-Yale Joint Center for Biostatistics and Data Science, Shanghai Jiao Tong University, Shanghai, China.,Department of Biostatistics, Yale University, New Haven, Connecticut
| |
Collapse
|
10
|
Ajana S, Acar N, Bretillon L, Hejblum BP, Jacqmin-Gadda H, Delcourt C, Acar N, Ajana S, Berdeaux O, Bouton S, Bretillon L, Bron A, Buaud B, Cabaret S, Cougnard-Grégoire A, Creuzot-Garcher C, Delcourt C, Delyfer MN, Féart-Couret C, Febvret V, Grégoire S, He Z, Korobelnik JF, Martine L, Merle B, Vaysse C. Benefits of dimension reduction in penalized regression methods for high-dimensional grouped data: a case study in low sample size. Bioinformatics 2019; 35:3628-3634. [DOI: 10.1093/bioinformatics/btz135] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2018] [Revised: 02/08/2019] [Accepted: 02/23/2019] [Indexed: 01/10/2023] Open
Abstract
Abstract
Motivation
In some prediction analyses, predictors have a natural grouping structure and selecting predictors accounting for this additional information could be more effective for predicting the outcome accurately. Moreover, in a high dimension low sample size framework, obtaining a good predictive model becomes very challenging. The objective of this work was to investigate the benefits of dimension reduction in penalized regression methods, in terms of prediction performance and variable selection consistency, in high dimension low sample size data. Using two real datasets, we compared the performances of lasso, elastic net, group lasso, sparse group lasso, sparse partial least squares (PLS), group PLS and sparse group PLS.
Results
Considering dimension reduction in penalized regression methods improved the prediction accuracy. The sparse group PLS reached the lowest prediction error while consistently selecting a few predictors from a single group.
Availability and implementation
R codes for the prediction methods are freely available at https://github.com/SoufianeAjana/Blisar.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Soufiane Ajana
- Inserm, Bordeaux Population Health Research Center, Team LEHA, UMR 1219, University of Bordeaux, F-33000 Bordeaux, France
| | - Niyazi Acar
- Centre des Sciences du Goût et de l'Alimentation, AgroSup Dijon, CNRS, INRA, Université Bourgogne Franche-Comté, Dijon, France
| | - Lionel Bretillon
- Centre des Sciences du Goût et de l'Alimentation, AgroSup Dijon, CNRS, INRA, Université Bourgogne Franche-Comté, Dijon, France
| | - Boris P Hejblum
- ISPED, Inserm, Bordeaux Population Health Research Center 1219, Inria SISTM, University of Bordeaux, F-33000 Bordeaux, France
- Vaccine Research Institute (VRI), Hôpital Henri Mondor, Créteil, France
| | - Hélène Jacqmin-Gadda
- Inserm, Bordeaux Population Health Research Center, Team Biostatistics, UMR 1219, University of Bordeaux, F-33000 Bordeaux, France
| | - Cécile Delcourt
- Inserm, Bordeaux Population Health Research Center, Team LEHA, UMR 1219, University of Bordeaux, F-33000 Bordeaux, France
| | - Niyazi Acar
- Centre des Sciences du Goût et de l'Alimentation, AgroSup Dijon, CNRS, INRA, Université Bourgogne Franche-Comté, Dijon, France
| | - Soufiane Ajana
- Inserm, Bordeaux Population Health Research Center, Team LEHA, UMR 1219, University of Bordeaux, F-33000 Bordeaux, France
| | - Olivier Berdeaux
- Centre des Sciences du Goût et de l'Alimentation, AgroSup Dijon, CNRS, INRA, Université Bourgogne Franche-Comté, Dijon, France
| | | | - Lionel Bretillon
- Centre des Sciences du Goût et de l'Alimentation, AgroSup Dijon, CNRS, INRA, Université Bourgogne Franche-Comté, Dijon, France
| | - Alain Bron
- Centre des Sciences du Goût et de l'Alimentation, AgroSup Dijon, CNRS, INRA, Université Bourgogne Franche-Comté, Dijon, France
- Department of Ophthalmology, University Hospital, Dijon, France
| | - Benjamin Buaud
- ITERG—Equipe Nutrition Métabolisme & Santé, Bordeaux, France
| | - Stéphanie Cabaret
- Centre des Sciences du Goût et de l'Alimentation, AgroSup Dijon, CNRS, INRA, Université Bourgogne Franche-Comté, Dijon, France
| | - Audrey Cougnard-Grégoire
- Inserm, Bordeaux Population Health Research Center, Team LEHA, UMR 1219, University of Bordeaux, F-33000 Bordeaux, France
| | - Catherine Creuzot-Garcher
- Centre des Sciences du Goût et de l'Alimentation, AgroSup Dijon, CNRS, INRA, Université Bourgogne Franche-Comté, Dijon, France
- Department of Ophthalmology, University Hospital, Dijon, France
| | - Cécile Delcourt
- Inserm, Bordeaux Population Health Research Center, Team LEHA, UMR 1219, University of Bordeaux, F-33000 Bordeaux, France
| | - Marie-Noelle Delyfer
- Inserm, Bordeaux Population Health Research Center, Team LEHA, UMR 1219, University of Bordeaux, F-33000 Bordeaux, France
- Service d’Ophtalmologie, CHU de Bordeaux, F-33000 Bordeaux, France
| | - Catherine Féart-Couret
- Inserm, Bordeaux Population Health Research Center, Team LEHA, UMR 1219, University of Bordeaux, F-33000 Bordeaux, France
| | - Valérie Febvret
- Centre des Sciences du Goût et de l'Alimentation, AgroSup Dijon, CNRS, INRA, Université Bourgogne Franche-Comté, Dijon, France
| | - Stéphane Grégoire
- Centre des Sciences du Goût et de l'Alimentation, AgroSup Dijon, CNRS, INRA, Université Bourgogne Franche-Comté, Dijon, France
| | - Zhiguo He
- Laboratory for Biology, Imaging, and Engineering of Corneal Grafts, EA2521, Faculty of Medicine, University Jean Monnet, Saint-Etienne, France
| | - Jean-François Korobelnik
- Inserm, Bordeaux Population Health Research Center, Team LEHA, UMR 1219, University of Bordeaux, F-33000 Bordeaux, France
- Service d’Ophtalmologie, CHU de Bordeaux, F-33000 Bordeaux, France
| | - Lucy Martine
- Centre des Sciences du Goût et de l'Alimentation, AgroSup Dijon, CNRS, INRA, Université Bourgogne Franche-Comté, Dijon, France
| | - Bénédicte Merle
- Inserm, Bordeaux Population Health Research Center, Team LEHA, UMR 1219, University of Bordeaux, F-33000 Bordeaux, France
| | - Carole Vaysse
- ITERG—Equipe Nutrition Métabolisme & Santé, Bordeaux, France
| | | |
Collapse
|
11
|
Xiao J, Chen L, Yu Y, Zhang X, Chen J. A Phylogeny-Regularized Sparse Regression Model for Predictive Modeling of Microbial Community Data. Front Microbiol 2018; 9:3112. [PMID: 30619188 PMCID: PMC6305753 DOI: 10.3389/fmicb.2018.03112] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2018] [Accepted: 12/03/2018] [Indexed: 12/16/2022] Open
Abstract
Fueled by technological advancement, there has been a surge of human microbiome studies surveying the microbial communities associated with the human body and their links with health and disease. As a complement to the human genome, the human microbiome holds great potential for precision medicine. Efficient predictive models based on microbiome data could be potentially used in various clinical applications such as disease diagnosis, patient stratification and drug response prediction. One important characteristic of the microbial community data is the phylogenetic tree that relates all the microbial taxa based on their evolutionary history. The phylogenetic tree is an informative prior for more efficient prediction since the microbial community changes are usually not randomly distributed on the tree but tend to occur in clades at varying phylogenetic depths (clustered signal). Although community-wide changes are possible for some conditions, it is also likely that the community changes are only associated with a small subset of "marker" taxa (sparse signal). Unfortunately, predictive models of microbial community data taking into account both the sparsity and the tree structure remain under-developed. In this paper, we propose a predictive framework to exploit sparse and clustered microbiome signals using a phylogeny-regularized sparse regression model. Our approach is motivated by evolutionary theory, where a natural correlation structure among microbial taxa exists according to the phylogenetic relationship. A novel phylogeny-based smoothness penalty is proposed to smooth the coefficients of the microbial taxa with respect to the phylogenetic tree. Using simulated and real datasets, we show that our method achieves better prediction performance than competing sparse regression methods for sparse and clustered microbiome signals.
Collapse
Affiliation(s)
- Jian Xiao
- Division of Biomedical Statistics and Informatics, Center for Individualized Medicine, Mayo Clinic Rochester, MN, United States.,School of Statistics and Mathematics Zhongnan University of Economics and Law, Wuhan, China
| | - Li Chen
- Department of Health Outcomes Research and Policy, Harrison School of Pharmacy, Auburn University Auburn, AL, United States
| | - Yue Yu
- Division of Biomedical Statistics and Informatics, Center for Individualized Medicine, Mayo Clinic Rochester, MN, United States
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University College Station, TX, United States
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Center for Individualized Medicine, Mayo Clinic Rochester, MN, United States
| |
Collapse
|
12
|
Hui FKC, Müller S, Welsh AH. Sparse Pairwise Likelihood Estimation for Multivariate Longitudinal Mixed Models. J Am Stat Assoc 2018. [DOI: 10.1080/01621459.2017.1371026] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023]
Affiliation(s)
- Francis K. C. Hui
- Mathematical Sciences Institute, The Australian National University, Canberra, Australia
| | - Samuel Müller
- School of Mathematics and Statistics, University of Sydney, Sydney, Australia
| | - A. H. Welsh
- Mathematical Sciences Institute, The Australian National University, Canberra, Australia
| |
Collapse
|
13
|
Xiao J, Chen L, Johnson S, Yu Y, Zhang X, Chen J. Predictive Modeling of Microbiome Data Using a Phylogeny-Regularized Generalized Linear Mixed Model. Front Microbiol 2018; 9:1391. [PMID: 29997602 PMCID: PMC6030386 DOI: 10.3389/fmicb.2018.01391] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2018] [Accepted: 06/06/2018] [Indexed: 12/21/2022] Open
Abstract
Recent human microbiome studies have revealed an essential role of the human microbiome in health and disease, opening up the possibility of building microbiome-based predictive models for individualized medicine. One unique characteristic of microbiome data is the existence of a phylogenetic tree that relates all the microbial species. It has frequently been observed that a cluster or clusters of bacteria at varying phylogenetic depths are associated with some clinical or biological outcome due to shared biological function (clustered signal). Moreover, in many cases, we observe a community-level change, where a large number of functionally interdependent species are associated with the outcome (dense signal). We thus develop "glmmTree," a prediction method based on a generalized linear mixed model framework, for capturing clustered and dense microbiome signals. glmmTree uses the similarity between microbiomes, which is defined based on the microbiome composition and the phylogenetic tree, to predict the outcome. The effects of other predictive variables (e.g., age, sex) can be incorporated readily in the regression framework. Additional tuning parameters enable a data-adaptive approach to capture signals at different phylogenetic depth and abundance level. Simulation studies and real data applications demonstrated that "glmmTree" outperformed existing methods in the dense and clustered signal scenarios.
Collapse
Affiliation(s)
- Jian Xiao
- Division of Biomedical Statistics and Informatics and Center for Individualized Medicine, Mayo Clinic, Rochester, MN, United States
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Hubei, China
| | - Li Chen
- Department of Health Outcomes Research and Policy, Harrison School of Pharmacy, Auburn University, Auburn, AL, United States
| | - Stephen Johnson
- Division of Biomedical Statistics and Informatics and Center for Individualized Medicine, Mayo Clinic, Rochester, MN, United States
| | - Yue Yu
- Division of Biomedical Statistics and Informatics and Center for Individualized Medicine, Mayo Clinic, Rochester, MN, United States
| | - Xianyang Zhang
- Department of Statistics, Texas A&M University, College Station, TX, United States
| | - Jun Chen
- Division of Biomedical Statistics and Informatics and Center for Individualized Medicine, Mayo Clinic, Rochester, MN, United States
| |
Collapse
|
14
|
Sutton M, Thiébaut R, Liquet B. Sparse partial least squares with group and subgroup structure. Stat Med 2018; 37:3338-3356. [PMID: 29888397 DOI: 10.1002/sim.7821] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2018] [Revised: 03/08/2018] [Accepted: 04/19/2018] [Indexed: 11/07/2022]
Abstract
Integrative analysis of high dimensional omics datasets has been studied by many authors in recent years. By incorporating prior known relationships among the variables, these analyses have been successful in elucidating the relationships between different sets of omics data. In this article, our goal is to identify important relationships between genomic expression and cytokine data from a human immunodeficiency virus vaccine trial. We proposed a flexible partial least squares technique, which incorporates group and subgroup structure in the modelling process. Our new method accounts for both grouping of genetic markers (eg, gene sets) and temporal effects. The method generalises existing sparse modelling techniques in the partial least squares methodology and establishes theoretical connections to variable selection methods for supervised and unsupervised problems. Simulation studies are performed to investigate the performance of our methods over alternative sparse approaches. Our R package sgspls is available at https://github.com/matt-sutton/sgspls.
Collapse
Affiliation(s)
- Matthew Sutton
- ARC Centre of Excellence for Mathematical and Statistical Frontiers, Queensland University of Technology, Brisbane, Australia
| | - Rodolphe Thiébaut
- Inria, SISTM, Talence and Inserm, U1219 Bordeaux University, Bordeaux, France
- Vaccine Research Institute, Creteil, France
| | - Benoît Liquet
- ARC Centre of Excellence for Mathematical and Statistical Frontiers, Queensland University of Technology, Brisbane, Australia
- Université de Pau et des Pays de l'Adour, Laboratoire de Mathematiques et de leurs Applications, UMR CNRS 5142, Pau, France
| |
Collapse
|
15
|
Zhai J, Kim J, Knox KS, Twigg HL, Zhou H, Zhou JJ. Variance Component Selection With Applications to Microbiome Taxonomic Data. Front Microbiol 2018; 9:509. [PMID: 29643839 PMCID: PMC5883493 DOI: 10.3389/fmicb.2018.00509] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2017] [Accepted: 03/06/2018] [Indexed: 12/21/2022] Open
Abstract
High-throughput sequencing technology has enabled population-based studies of the role of the human microbiome in disease etiology and exposure response. Microbiome data are summarized as counts or composition of the bacterial taxa at different taxonomic levels. An important problem is to identify the bacterial taxa that are associated with a response. One method is to test the association of specific taxon with phenotypes in a linear mixed effect model, which incorporates phylogenetic information among bacterial communities. Another type of approaches consider all taxa in a joint model and achieves selection via penalization method, which ignores phylogenetic information. In this paper, we consider regression analysis by treating bacterial taxa at different level as multiple random effects. For each taxon, a kernel matrix is calculated based on distance measures in the phylogenetic tree and acts as one variance component in the joint model. Then taxonomic selection is achieved by the lasso (least absolute shrinkage and selection operator) penalty on variance components. Our method integrates biological information into the variable selection problem and greatly improves selection accuracies. Simulation studies demonstrate the superiority of our methods versus existing methods, for example, group-lasso. Finally, we apply our method to a longitudinal microbiome study of Human Immunodeficiency Virus (HIV) infected patients. We implement our method using the high performance computing language Julia. Software and detailed documentation are freely available at https://github.com/JingZhai63/VCselection.
Collapse
Affiliation(s)
- Jing Zhai
- Department of Epidemiology and Biostatistics, University of Arizona, Tucson, AZ, United States
| | - Juhyun Kim
- Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA, United States
| | - Kenneth S Knox
- Division of Pulmonary, Allergy, Critical Care, and Sleep Medicine, Department of Medicine, University of Arizona, Tucson, AZ, United States
| | - Homer L Twigg
- Division of Pulmonary, Critical Care, Sleep, and Occupational Medicine, Indiana University Medical Center, Indianapolis, IN, United States
| | - Hua Zhou
- Department of Biostatistics, University of California, Los Angeles, Los Angeles, CA, United States
| | - Jin J Zhou
- Department of Epidemiology and Biostatistics, University of Arizona, Tucson, AZ, United States
| |
Collapse
|
16
|
Affiliation(s)
- Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, CT
| |
Collapse
|
17
|
Affiliation(s)
- Francis K. C. Hui
- Mathematical Sciences Institute, The Australian National University, Canberra, ACT, Australia
| | - Samuel Müller
- Mathematical Sciences Institute, The Australian National University, Canberra, Australia
| | - A. H. Welsh
- School of Mathematics and Statistics, University of Sydney, Sydney, Australia
| |
Collapse
|
18
|
Wang T, Zhao H. Structured subcomposition selection in regression and its application to microbiome data analysis. Ann Appl Stat 2017. [DOI: 10.1214/16-aoas1017] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
19
|
Wang T, Zhao H. A Dirichlet-tree multinomial regression model for associating dietary nutrients with gut microorganisms. Biometrics 2017; 73:792-801. [PMID: 28112797 DOI: 10.1111/biom.12654] [Citation(s) in RCA: 27] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2016] [Revised: 12/01/2016] [Accepted: 12/01/2016] [Indexed: 12/22/2022]
Abstract
Understanding the factors that alter the composition of the human microbiota may help personalized healthcare strategies and therapeutic drug targets. In many sequencing studies, microbial communities are characterized by a list of taxa, their counts, and their evolutionary relationships represented by a phylogenetic tree. In this article, we consider an extension of the Dirichlet multinomial distribution, called the Dirichlet-tree multinomial distribution, for multivariate, over-dispersed, and tree-structured count data. To address the relationships between these counts and a set of covariates, we propose the Dirichlet-tree multinomial regression model for which we develop a penalized likelihood method for estimating parameters and selecting covariates. For efficient optimization, we adopt the accelerated proximal gradient approach. Simulation studies are presented to demonstrate the good performance of the proposed procedure. An analysis of a data set relating dietary nutrients with bacterial counts is used to show that the incorporation of the tree structure into the model helps increase the prediction power.
Collapse
Affiliation(s)
- Tao Wang
- Department of Bioinformatics and Biostatistics, Shanghai Jiao Tong University, Shanghai, China.,SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China
| | - Hongyu Zhao
- Department of Biostatistics, Yale University, New Haven, Connecticut, U.S.A.,SJTU-Yale Joint Center for Biostatistics, Shanghai Jiao Tong University, Shanghai, China
| |
Collapse
|
20
|
Garcia TP, Müller S. Cox regression with exclusion frequency-based weights to identify neuroimaging markers relevant to Huntington’s disease onset. Ann Appl Stat 2016; 10:2130-2156. [DOI: 10.1214/16-aoas967] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
21
|
Modeling the Cholesky factors of covariance matrices of multivariate longitudinal data. J MULTIVARIATE ANAL 2016. [DOI: 10.1016/j.jmva.2015.11.014] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
|
22
|
Liquet B, Lafaye de Micheaux P, Hejblum BP, Thiébaut R. Group and sparse group partial least square approaches applied in genomics context. Bioinformatics 2015; 32:35-42. [DOI: 10.1093/bioinformatics/btv535] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2015] [Accepted: 09/03/2015] [Indexed: 01/07/2023] Open
|
23
|
Shankar J, Szpakowski S, Solis NV, Mounaud S, Liu H, Losada L, Nierman WC, Filler SG. A systematic evaluation of high-dimensional, ensemble-based regression for exploring large model spaces in microbiome analyses. BMC Bioinformatics 2015; 16:31. [PMID: 25638274 PMCID: PMC4339743 DOI: 10.1186/s12859-015-0467-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2014] [Accepted: 01/15/2015] [Indexed: 01/09/2023] Open
Abstract
BACKGROUND Microbiome studies incorporate next-generation sequencing to obtain profiles of microbial communities. Data generated from these experiments are high-dimensional with a rich correlation structure but modest sample sizes. A statistical model that utilizes these microbiome profiles to explain a clinical or biological endpoint needs to tackle high-dimensionality resulting from the very large space of variable configurations. Ensemble models are a class of approaches that can address high-dimensionality by aggregating information across large model spaces. Although such models are popular in fields as diverse as economics and genetics, their performance on microbiome data has been largely unexplored. RESULTS We developed a simulation framework that accurately captures the constraints of experimental microbiome data. Using this setup, we systematically evaluated a selection of both frequentist and Bayesian regression modeling ensembles. These are represented by variants of stability selection in conjunction with elastic net and spike-and-slab Bayesian model averaging (BMA), respectively. BMA ensembles that explore a larger space of models relative to stability selection variants performed better and had lower variability across simulations. However, stability selection ensembles were able to match the performance of BMA in scenarios of low sparsity where several variables had large regression coefficients. CONCLUSIONS Given a microbiome dataset of interest, we present a methodology to generate simulated data that closely mimics its characteristics in a manner that enables meaningful evaluation of analytical strategies. Our evaluation demonstrates that the largest ensembles yield the strongest performance on microbiome data with modest sample sizes and high-dimensional measurements. We also demonstrate the ability of these ensembles to identify microbiome signatures that are associated with opportunistic Candida albicans colonization during antibiotic exposure. As the focus of microbiome research evolves from pilot to translational studies, we anticipate that our strategy will aid investigators in making evaluation-based decisions for selecting appropriate analytical methods.
Collapse
Affiliation(s)
- Jyoti Shankar
- J. Craig Venter Institute, 9704, Medical Center Drive, Rockville, Maryland, 20850, US.
| | - Sebastian Szpakowski
- J. Craig Venter Institute, 9704, Medical Center Drive, Rockville, Maryland, 20850, US.
| | - Norma V Solis
- Los Angeles Biomedical Research Institute at Harbor, UCLA Medical Center, 1124 West Carson Street, Torrance, California, 90509, US.
| | - Stephanie Mounaud
- J. Craig Venter Institute, 9704, Medical Center Drive, Rockville, Maryland, 20850, US.
| | - Hong Liu
- Los Angeles Biomedical Research Institute at Harbor, UCLA Medical Center, 1124 West Carson Street, Torrance, California, 90509, US.
| | - Liliana Losada
- J. Craig Venter Institute, 9704, Medical Center Drive, Rockville, Maryland, 20850, US.
| | - William C Nierman
- J. Craig Venter Institute, 9704, Medical Center Drive, Rockville, Maryland, 20850, US.
| | - Scott G Filler
- Los Angeles Biomedical Research Institute at Harbor, UCLA Medical Center, 1124 West Carson Street, Torrance, California, 90509, US.
- David Geffen School of Medicine, University of California at Los Angeles, California, 90095, US.
| |
Collapse
|
24
|
Zhang Q, Abel H, Wells A, Lenzini P, Gomez F, Province MA, Templeton AA, Weinstock GM, Salzman NH, Borecki IB. Selection of models for the analysis of risk-factor trees: leveraging biological knowledge to mine large sets of risk factors with application to microbiome data. ACTA ACUST UNITED AC 2015; 31:1607-13. [PMID: 25568281 DOI: 10.1093/bioinformatics/btu855] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2014] [Accepted: 12/23/2014] [Indexed: 12/29/2022]
Abstract
MOTIVATION Establishment of a statistical association between microbiome features and clinical outcomes is of growing interest because of the potential for yielding insights into biological mechanisms and pathogenesis. Extracting microbiome features that are relevant for a disease is challenging and existing variable selection methods are limited due to large number of risk factor variables from microbiome sequence data and their complex biological structure. RESULTS We propose a tree-based scanning method, Selection of Models for the Analysis of Risk factor Trees (referred to as SMART-scan), for identifying taxonomic groups that are associated with a disease or trait. SMART-scan is a model selection technique that uses a predefined taxonomy to organize the large pool of possible predictors into optimized groups, and hierarchically searches and determines variable groups for association test. We investigate the statistical properties of SMART-scan through simulations, in comparison to a regular single-variable analysis and three commonly-used variable selection methods, stepwise regression, least absolute shrinkage and selection operator (LASSO) and classification and regression tree (CART). When there are taxonomic group effects in the data, SMART-scan can significantly increase power by using bacterial taxonomic information to split large numbers of variables into groups. Through an application to microbiome data from a vervet monkey diet experiment, we demonstrate that SMART-scan can identify important phenotype-associated taxonomic features missed by single-variable analysis, stepwise regression, LASSO and CART.
Collapse
Affiliation(s)
- Qunyuan Zhang
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Haley Abel
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Alan Wells
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Petra Lenzini
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Felicia Gomez
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Michael A Province
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Alan A Templeton
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - George M Weinstock
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Nita H Salzman
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| | - Ingrid B Borecki
- Division of Statistical Genomics, Washington University School of Medicine, St. Louis, MO, USA, Department of Biology, Washington University, St. Louis, MO, USA, The Jackson Laboratory for Genomic Medicine, Farmington, CT, USA and Department of Pediatrics, Medical College of Wisconsin, Milwaukee, WI, USA
| |
Collapse
|