1
|
Morton JT, Marotz C, Washburne A, Silverman J, Zaramela LS, Edlund A, Zengler K, Knight R. Establishing microbial composition measurement standards with reference frames. Nat Commun 2019; 10:2719. [PMID: 31222023 PMCID: PMC6586903 DOI: 10.1038/s41467-019-10656-5] [Citation(s) in RCA: 328] [Impact Index Per Article: 65.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Accepted: 05/14/2019] [Indexed: 12/30/2022] Open
Abstract
Differential abundance analysis is controversial throughout microbiome research. Gold standard approaches require laborious measurements of total microbial load, or absolute number of microorganisms, to accurately determine taxonomic shifts. Therefore, most studies rely on relative abundance data. Here, we demonstrate common pitfalls in comparing relative abundance across samples and identify two solutions that reveal microbial changes without the need to estimate total microbial load. We define the notion of "reference frames", which provide deep intuition about the compositional nature of microbiome data. In an oral time series experiment, reference frames alleviate false positives and produce consistent results on both raw and cell-count normalized data. Furthermore, reference frames identify consistent, differentially abundant microbes previously undetected in two independent published datasets from subjects with atopic dermatitis. These methods allow reassessment of published relative abundance data to reveal reproducible microbial changes from standard sequencing output without the need for new assays.
Collapse
Affiliation(s)
- James T Morton
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, 92093, USA
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, 92093, USA
| | - Clarisse Marotz
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, 92093, USA
| | - Alex Washburne
- Department of Microbiology and Immunology, Montana State University, Bozeman, MT, 59717, USA
| | - Justin Silverman
- Program in Computational Biology and Bioinformatics, Duke University, Durham, 27708, USA
- Medical Scientist Training Program, Duke University, Durham, 27708, USA
- Center for Genomic and Computational Biology, Duke University, Durham, 27708, USA
| | - Livia S Zaramela
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, 92093, USA
| | - Anna Edlund
- J. Craig Venter Institute, Genomic Medicine Group, La Jolla, CA, 92037, USA
| | - Karsten Zengler
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, 92093, USA.
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, 92093, USA.
- Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, 92093, USA.
| | - Rob Knight
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, 92093, USA.
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, 92093, USA.
- Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, 92093, USA.
| |
Collapse
|
2
|
Piles M, Fernandez-Lozano C, Velasco-Galilea M, González-Rodríguez O, Sánchez JP, Torrallardona D, Ballester M, Quintanilla R. Machine learning applied to transcriptomic data to identify genes associated with feed efficiency in pigs. Genet Sel Evol 2019; 51:10. [PMID: 30866799 PMCID: PMC6417084 DOI: 10.1186/s12711-019-0453-y] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2018] [Accepted: 03/04/2019] [Indexed: 12/19/2022] Open
Abstract
Background To date, the molecular mechanisms that underlie residual feed intake (RFI) in pigs are unknown. Results from different genome-wide association studies and gene expression analyses are not always consistent. The aim of this research was to use machine learning to identify genes associated with feed efficiency (FE) using transcriptomic (RNA-Seq) data from pigs that are phenotypically extreme for RFI. Methods RFI was computed by considering within-sex regression on mean metabolic body weight, average daily gain, and average backfat gain. RNA-Seq analyses were performed on liver and duodenum tissue from 32 high and 33 low RFI pigs collected at 153 d of age. Machine-learning algorithms were used to predict RFI class based on gene expression levels in liver and duodenum after adjusting for batch effects. Genes were ranked according to their contribution to the classification using the permutation accuracy importance score in an unbiased random forest (RF) algorithm based on conditional inference. Support vector machine, RF, elastic net (ENET) and nearest shrunken centroid algorithms were tested using different subsets of the top rank genes. Nested resampling for hyperparameter tuning was implemented with tenfold cross-validation in the outer and inner loops. Results The best classification was obtained with ENET using the expression of 200 genes in liver [area under the receiver operating characteristic curve (AUROC): 0.85; accuracy: 0.78] and 100 genes in duodenum (AUROC: 0.76; accuracy: 0.69). Canonical pathways and candidate genes that were previously reported as associated with FE in several species were identified. The most remarkable pathways and genes identified were NRF2-mediated oxidative stress response and aldosterone signalling in epithelial cells, the DNAJC6, DNAJC1, MAPK8, PRKD3 genes in duodenum, and melatonin degradation II, PPARα/RXRα activation, and GPCR-mediated nutrient sensing in enteroendocrine cells and SMOX, IL4I1, PRKAR2B, CLOCK and CCK genes in liver. Conclusions ML algorithms and RNA-Seq expression data were found to provide good performance for classifying pigs into high or low RFI groups. Classification was better with gene expression data from liver than from duodenum. Genes associated with FE in liver and duodenum tissue that can be used as predictive biomarkers for this trait were identified. Electronic supplementary material The online version of this article (10.1186/s12711-019-0453-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Miriam Piles
- Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Torre Marimon s/n, 08140, Caldes de Montbui, Barcelona, Spain.
| | - Carlos Fernandez-Lozano
- Computer Science Department, University of A Coruña, Campus Elviña s/n, 15071, A Coruña, Spain
| | - María Velasco-Galilea
- Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Torre Marimon s/n, 08140, Caldes de Montbui, Barcelona, Spain
| | - Olga González-Rodríguez
- Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Torre Marimon s/n, 08140, Caldes de Montbui, Barcelona, Spain
| | - Juan Pablo Sánchez
- Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Torre Marimon s/n, 08140, Caldes de Montbui, Barcelona, Spain
| | - David Torrallardona
- Animal Nutrition Program, Institute of Agriculture and Food Research and Technology (IRTA), Mas de Bover, 43120, Constantí, Spain
| | - Maria Ballester
- Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Torre Marimon s/n, 08140, Caldes de Montbui, Barcelona, Spain
| | - Raquel Quintanilla
- Animal Breeding and Genetics Program, Institute of Agriculture and Food Research and Technology (IRTA), Torre Marimon s/n, 08140, Caldes de Montbui, Barcelona, Spain
| |
Collapse
|
3
|
Callahan BJ, Sankaran K, Fukuyama JA, McMurdie PJ, Holmes SP. Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses. F1000Res 2016; 5:1492. [PMID: 27508062 PMCID: PMC4955027 DOI: 10.12688/f1000research.8986.2] [Citation(s) in RCA: 445] [Impact Index Per Article: 55.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 10/17/2016] [Indexed: 11/20/2022] Open
Abstract
High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or OTU composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, whether parametric or nonparametric. We provide examples of using the R packages dada2, phyloseq, DESeq2, ggplot2 and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests and nonparametric testing using community networks and the ggnetwork package.
Collapse
Affiliation(s)
- Ben J Callahan
- Statistics Department, Stanford University, Stanford, CA, 94305, USA
| | - Kris Sankaran
- Statistics Department, Stanford University, Stanford, CA, 94305, USA
| | - Julia A Fukuyama
- Statistics Department, Stanford University, Stanford, CA, 94305, USA
| | | | - Susan P Holmes
- Statistics Department, Stanford University, Stanford, CA, 94305, USA
| |
Collapse
|
4
|
Callahan BJ, Sankaran K, Fukuyama JA, McMurdie PJ, Holmes SP. Bioconductor Workflow for Microbiome Data Analysis: from raw reads to community analyses. F1000Res 2016; 5:1492. [PMID: 27508062 DOI: 10.12688/f1000research.8986.1] [Citation(s) in RCA: 284] [Impact Index Per Article: 35.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/14/2016] [Indexed: 11/20/2022] Open
Abstract
High-throughput sequencing of PCR-amplified taxonomic markers (like the 16S rRNA gene) has enabled a new level of analysis of complex bacterial communities known as microbiomes. Many tools exist to quantify and compare abundance levels or OTU composition of communities in different conditions. The sequencing reads have to be denoised and assigned to the closest taxa from a reference database. Common approaches use a notion of 97% similarity and normalize the data by subsampling to equalize library sizes. In this paper, we show that statistical models allow more accurate abundance estimates. By providing a complete workflow in R, we enable the user to do sophisticated downstream statistical analyses, whether parametric or nonparametric. We provide examples of using the R packages dada2, phyloseq, DESeq2, ggplot2 and vegan to filter, visualize and test microbiome data. We also provide examples of supervised analyses using random forests and nonparametric testing using community networks and the ggnetwork package.
Collapse
Affiliation(s)
- Ben J Callahan
- Statistics Department, Stanford University, Stanford, CA, 94305, USA
| | - Kris Sankaran
- Statistics Department, Stanford University, Stanford, CA, 94305, USA
| | - Julia A Fukuyama
- Statistics Department, Stanford University, Stanford, CA, 94305, USA
| | | | - Susan P Holmes
- Statistics Department, Stanford University, Stanford, CA, 94305, USA
| |
Collapse
|
5
|
Liu J, Kandasamy S, Zhang J, Kirby CW, Karakach T, Hafting J, Critchley AT, Evans F, Prithiviraj B. Prebiotic effects of diet supplemented with the cultivated red seaweed Chondrus crispus or with fructo-oligo-saccharide on host immunity, colonic microbiota and gut microbial metabolites. Altern Ther Health Med 2015; 15:279. [PMID: 26271359 PMCID: PMC4535385 DOI: 10.1186/s12906-015-0802-5] [Citation(s) in RCA: 50] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2014] [Accepted: 08/04/2015] [Indexed: 01/26/2023]
Abstract
Background Gastrointestinal microbial communities are diverse and are composed of both beneficial and pathogenic groups. Prebiotics, such as digestion-resistant fibers, influence the composition of gut microbiota, and can contribute to the improvement of host health. The red seaweed Chondrus crispus is rich in dietary fiber and oligosaccharides, however its prebiotic potential has not been studied to date. Methods Prebiotic effects were investigated with weaning rats fed a cultivated C. crispus-supplemented diet. Comparison standards included a fructo-oligo-saccharide (FOS) diet and a basal diet. The colonic microbiome was profiled with a 16S rRNA sequencing-based Phylochip array. Concentrations of short chain fatty acids (SCFAs) in the feacal samples were determined by gas chromatography with a flame ionization detector (GC-FID) analysis. Immunoglobulin levels in the blood plasma were analyzed with an enzyme-linked immunosorbent assay (ELISA). Histo-morphological parameters of the proximal colon tissue were characterized by hematoxylin and eosin (H&E) staining. Results Phylochip array analysis indicated differing microbiome composition among the diet-supplemented and the control groups, with the C. crispus group (2.5 % supplementation) showing larger separation from the control than other treatment groups. In the 2.5 % C. crispus group, the population of beneficial bacteria such as Bifidobacterium breve increased (4.9-fold, p = 0.001), and the abundance of pathogenic species such as Clostridium septicum and Streptococcus pneumonia decreased. Higher concentrations of short chain fatty acids (i.e., gut microbial metabolites), including acetic, propionic and butyric acids, were found in faecal samples of the C. crispus-fed rats. Furthermore, both C. crispus and FOS supplemented rats showed significant improvements in proximal colon histo-morphology . Higher faecal moisture was noted in the 2.5 % C. crispus group, and elevated plasma immunoglobulin (IgA and IgG) levels were observed in the 0.5 % C. crispus group, as compared to the basal feed group. Conclusions The results suggest multiple prebiotic effects, such as influencing the composition of gut microbial communities, improvement of gut health and immune modulation in rats supplemented with cultivated C. crispus. Electronic supplementary material The online version of this article (doi:10.1186/s12906-015-0802-5) contains supplementary material, which is available to authorized users.
Collapse
|
6
|
Reese SE, Archer KJ, Therneau TM, Atkinson EJ, Vachon CM, de Andrade M, Kocher JPA, Eckel-Passow JE. A new statistic for identifying batch effects in high-throughput genomic data that uses guided principal component analysis. Bioinformatics 2013; 29:2877-83. [PMID: 23958724 PMCID: PMC3810845 DOI: 10.1093/bioinformatics/btt480] [Citation(s) in RCA: 86] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2012] [Revised: 07/03/2013] [Accepted: 08/14/2013] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Batch effects are due to probe-specific systematic variation between groups of samples (batches) resulting from experimental features that are not of biological interest. Principal component analysis (PCA) is commonly used as a visual tool to determine whether batch effects exist after applying a global normalization method. However, PCA yields linear combinations of the variables that contribute maximum variance and thus will not necessarily detect batch effects if they are not the largest source of variability in the data. RESULTS We present an extension of PCA to quantify the existence of batch effects, called guided PCA (gPCA). We describe a test statistic that uses gPCA to test whether a batch effect exists. We apply our proposed test statistic derived using gPCA to simulated data and to two copy number variation case studies: the first study consisted of 614 samples from a breast cancer family study using Illumina Human 660 bead-chip arrays, whereas the second case study consisted of 703 samples from a family blood pressure study that used Affymetrix SNP Array 6.0. We demonstrate that our statistic has good statistical properties and is able to identify significant batch effects in two copy number variation case studies. CONCLUSION We developed a new statistic that uses gPCA to identify whether batch effects exist in high-throughput genomic data. Although our examples pertain to copy number data, gPCA is general and can be used on other data types as well. AVAILABILITY AND IMPLEMENTATION The gPCA R package (Available via CRAN) provides functionality and data to perform the methods in this article. CONTACT reesese@vcu.edu
Collapse
Affiliation(s)
- Sarah E Reese
- Department of Biostatistics, Biostatistics Shared Resource Core, VCU Massey Cancer Center, Virginia Commonwealth University, Richmond, VA 23284, USA, Division of Biomedical Statistics and Informatics and Division of Epidemiology, Department of Health Sciences Research, Mayo Clinic, Rochester, MN 55905, USA
| | | | | | | | | | | | | | | |
Collapse
|
7
|
phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One 2013; 8:e61217. [PMID: 23630581 PMCID: PMC3632530 DOI: 10.1371/journal.pone.0061217] [Citation(s) in RCA: 9353] [Impact Index Per Article: 850.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2012] [Accepted: 03/06/2013] [Indexed: 12/20/2022] Open
Abstract
Background The analysis of microbial communities through DNA sequencing brings many challenges: the integration of different types of data with methods from ecology, genetics, phylogenetics, multivariate statistics, visualization and testing. With the increased breadth of experimental designs now being pursued, project-specific statistical analyses are often needed, and these analyses are often difficult (or impossible) for peer researchers to independently reproduce. The vast majority of the requisite tools for performing these analyses reproducibly are already implemented in R and its extensions (packages), but with limited support for high throughput microbiome census data. Results Here we describe a software project, phyloseq, dedicated to the object-oriented representation and analysis of microbiome census data in R. It supports importing data from a variety of common formats, as well as many analysis techniques. These include calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics; all in a manner that is easy to document, share, and modify. We show how to apply functions from other R packages to phyloseq-represented data, illustrating the availability of a large number of open source analysis techniques. We discuss the use of phyloseq with tools for reproducible research, a practice common in other fields but still rare in the analysis of highly parallel microbiome census data. We have made available all of the materials necessary to completely reproduce the analysis and figures included in this article, an example of best practices for reproducible research. Conclusions The phyloseq project for R is a new open-source software package, freely available on the web from both GitHub and Bioconductor.
Collapse
|
8
|
McMurdie PJ, Holmes S. phyloseq: an R package for reproducible interactive analysis and graphics of microbiome census data. PLoS One 2013. [PMID: 23630581 DOI: 10.371/journal.pone.0061217] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/20/2023] Open
Abstract
BACKGROUND the analysis of microbial communities through dna sequencing brings many challenges: the integration of different types of data with methods from ecology, genetics, phylogenetics, multivariate statistics, visualization and testing. With the increased breadth of experimental designs now being pursued, project-specific statistical analyses are often needed, and these analyses are often difficult (or impossible) for peer researchers to independently reproduce. The vast majority of the requisite tools for performing these analyses reproducibly are already implemented in R and its extensions (packages), but with limited support for high throughput microbiome census data. RESULTS Here we describe a software project, phyloseq, dedicated to the object-oriented representation and analysis of microbiome census data in R. It supports importing data from a variety of common formats, as well as many analysis techniques. These include calibration, filtering, subsetting, agglomeration, multi-table comparisons, diversity analysis, parallelized Fast UniFrac, ordination methods, and production of publication-quality graphics; all in a manner that is easy to document, share, and modify. We show how to apply functions from other R packages to phyloseq-represented data, illustrating the availability of a large number of open source analysis techniques. We discuss the use of phyloseq with tools for reproducible research, a practice common in other fields but still rare in the analysis of highly parallel microbiome census data. We have made available all of the materials necessary to completely reproduce the analysis and figures included in this article, an example of best practices for reproducible research. CONCLUSIONS The phyloseq project for R is a new open-source software package, freely available on the web from both GitHub and Bioconductor.
Collapse
Affiliation(s)
- Paul J McMurdie
- Department of Statistics, Stanford University, Stanford, California, United States of America
| | | |
Collapse
|
9
|
McMURDIE PAULJ, HOLMES SUSAN. Phyloseq: a bioconductor package for handling and analysis of high-throughput phylogenetic sequence data. PACIFIC SYMPOSIUM ON BIOCOMPUTING. PACIFIC SYMPOSIUM ON BIOCOMPUTING 2012:235-46. [PMID: 22174279 PMCID: PMC3357092] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
We present a detailed description of a new Bioconductor package, phyloseq, for integrated data and analysis of taxonomically-clustered phylogenetic sequencing data in conjunction with related data types. The phyloseq package integrates abundance data, phylogenetic information and covariates so that exploratory transformations, plots, and confirmatory testing and diagnostic plots can be carried out seamlessly. The package is built following the S4 object-oriented framework of the R language so that once the data have been input the user can easily transform, plot and analyze the data. We present some examples that highlight the methods and the ease with which we can leverage existing packages.
Collapse
Affiliation(s)
- PAUL J. McMURDIE
- Statistics Department, Stanford University, Stanford, CA 94305, USA
| | - SUSAN HOLMES
- Statistics Department, Stanford University, Stanford, CA 94305, USA
| |
Collapse
|
10
|
Paliy O, Agans R. Application of phylogenetic microarrays to interrogation of human microbiota. FEMS Microbiol Ecol 2011; 79:2-11. [PMID: 22092522 DOI: 10.1111/j.1574-6941.2011.01222.x] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2011] [Revised: 09/09/2011] [Accepted: 09/28/2011] [Indexed: 12/22/2022] Open
Abstract
Human-associated microbiota is recognized to play vital roles in maintaining host health, and it is implicated in many disease states. While the initial surge in the profiling of these microbial communities was achieved with Sanger and next-generation sequencing, many oligonucleotide microarrays have also been developed recently for this purpose. Containing probes complementary to small ribosomal subunit RNA gene sequences of community members, such phylogenetic arrays provide direct quantitative comparisons of microbiota composition among samples and between sample groups. Some of the developed microarrays including PhyloChip, Microbiota Array, and HITChip can simultaneously measure the presence and abundance of hundreds and thousands of phylotypes in a single sample. This review describes the currently available phylogenetic microarrays that can be used to analyze human microbiota, delineates the approaches for the optimization of microarray use, and provides examples of recent findings based on microarray interrogation of human-associated microbial communities.
Collapse
Affiliation(s)
- Oleg Paliy
- Department of Biochemistry and Molecular Biology, Wright State University, Dayton, OH 45435, USA.
| | | |
Collapse
|