1
|
Zschaubitz E, Schröder H, Glackin CC, Vogel L, Labrenz M, Sperlea T. A benchmark analysis of feature selection and machine learning methods for environmental metabarcoding datasets. Comput Struct Biotechnol J 2025; 27:1636-1647. [PMID: 40322584 PMCID: PMC12049816 DOI: 10.1016/j.csbj.2025.04.017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2024] [Revised: 04/10/2025] [Accepted: 04/11/2025] [Indexed: 05/08/2025] Open
Abstract
Next-Generation Sequencing methods like DNA metabarcoding enable the generation of large community composition datasets and have grown instrumental in many branches of ecology in recent years. However, the sparsity, compositionality, and high dimensionality of metabarcoding datasets pose challenges in data analysis. In theory, feature selection methods improve the analyzability of eDNA metabarcoding datasets by identifying a subset of informative taxa that are relevant for a certain task and discarding those that are redundant or irrelevant. However, general guidelines on selecting a feature selection method for application to a given setting are lacking. Here, we report a comparison of feature selection methods in a supervised machine learning setup across 13 environmental metabarcoding datasets with differing characteristics. We evaluate workflows that consist of data preprocessing, feature selection and a machine learning model by their ability to capture the ecological relationship between the microbial community composition and environmental parameters. Our results demonstrate that, while the optimal feature selection approach depends on dataset characteristics, feature selection is more likely to impair model performance than to improve it for tree ensemble models like Random Forests. Furthermore, our results show that calculating relative counts impairs model performance, which suggests that novel methods to combat the compositionality of metabarcoding data are required.
Collapse
Affiliation(s)
- Erik Zschaubitz
- Department of Biological Oceanography, Leibniz Institute for Baltic Sea Research, Seestraße 15, Rostock, 18119, Germany
| | | | - Conor Christopher Glackin
- Department of Biological Oceanography, Leibniz Institute for Baltic Sea Research, Seestraße 15, Rostock, 18119, Germany
| | - Lukas Vogel
- Department of Biological Oceanography, Leibniz Institute for Baltic Sea Research, Seestraße 15, Rostock, 18119, Germany
| | - Matthias Labrenz
- Department of Biological Oceanography, Leibniz Institute for Baltic Sea Research, Seestraße 15, Rostock, 18119, Germany
| | - Theodor Sperlea
- Department of Biological Oceanography, Leibniz Institute for Baltic Sea Research, Seestraße 15, Rostock, 18119, Germany
| |
Collapse
|
2
|
Chen C, Murphy TE, Speiser JL, Bandeen-Roche K, Allore H, Travison TG, Griswold M, Shardell M. Gerontologic Biostatistics and Data Science: Aging Research in the Era of Big Data. J Gerontol A Biol Sci Med Sci 2024; 80:glae269. [PMID: 39500720 PMCID: PMC11683485 DOI: 10.1093/gerona/glae269] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2024] [Indexed: 01/04/2025] Open
Abstract
Introduced in 2010, the subdiscipline of gerontologic biostatistics was conceptualized to address the specific challenges of analyzing data from clinical research studies involving older adults. Since then, the evolving technological landscape has led to a proliferation of advancements in biostatistics and other data sciences that have significantly influenced the practice of gerontologic research, including studies beyond the clinic. Data science is the field at the intersection of statistics and computer science, and although the term "data science" was not widely used in 2010, the field has quickly made palpable effects on gerontologic research. In this Review in Depth, we describe multiple advancements of biostatistics and data science that have been particularly impactful. Moreover, we propose the subdiscipline of "gerontologic biostatistics and data science," which subsumes gerontologic biostatistics into a more encompassing practice. Prominent gerontologic biostatistics and data science advancements that we discuss herein include cutting-edge methods in experimental design and causal inference, adaptations of machine learning, the rigorous quantification of deep phenotypic measurement, and analysis of high-dimensional -omics data. We additionally describe the need for integration of information from multiple studies and propose strategies to foster reproducibility, replicability, and open science. Lastly, we provide information on software resources for gerontologic biostatistics and data science practitioners to apply these approaches to their own work and propose areas where further advancement is needed. The methodological topics reviewed here aim to enhance data-rich research on aging and foster the next generation of gerontologic researchers.
Collapse
Affiliation(s)
- Chixiang Chen
- Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland, USA
- Department of Eurosurgery, University of Maryland School of Medicine, Baltimore, Maryland, USA
| | - Terrence E Murphy
- Department of Public Health Sciences, Penn State College of Medicine, Hershey, Pennsylvania, USA
| | - Jaime Lynn Speiser
- Department of Biostatistics and Data Science, Wake Forest University School of Medicine, Winston-Salem, North Carolina, USA
| | - Karen Bandeen-Roche
- Departments of Biostatistics, Medicine and Nursing, Johns Hopkins University, Baltimore, Maryland, USA
| | - Heather Allore
- Department of Internal Medicine, Yale School of Medicine and Department of Biostatistics Yale School of Public Health, New Haven, Connecticut, USA
| | - Thomas G Travison
- Marcus Institute for Aging Research, Hebrew Senior Life, Boston, Massachusetts, USA
- Division of Gerontology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, Massachusetts, USA
| | - Michael Griswold
- Departments of Medicine and Data Science, University of Mississippi Medical Center, Jackson, Mississippi, USA
| | - Michelle Shardell
- Department of Epidemiology and Public Health, University of Maryland School of Medicine, Baltimore, Maryland, USA
- Institute for Genome Sciences, University of Maryland School of Medicine, Baltimore, Maryland, USA
| |
Collapse
|
3
|
Ioannou M, Borkent J, Andreu-Sánchez S, Wu J, Fu J, Sommer IEC, Haarman BCM. Reproducible gut microbial signatures in bipolar and schizophrenia spectrum disorders: A metagenome-wide study. Brain Behav Immun 2024; 121:165-175. [PMID: 39032544 DOI: 10.1016/j.bbi.2024.07.009] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/12/2024] [Revised: 05/30/2024] [Accepted: 07/15/2024] [Indexed: 07/23/2024] Open
Abstract
BACKGROUND Numerous studies report gut microbiome variations in bipolar disorder (BD) and schizophrenia spectrum disorders (SSD) compared to healthy individuals, though, there is limited consensus on which specific bacteria are associated with these disorders. METHODS In this study, we performed a comprehensive metagenomic shotgun sequencing analysis in 103 Dutch patients with BD/SSD and 128 healthy controls matched for age, sex, body mass index and income, while accounting for diet quality, transit time and technical confounders. To assess the replicability of the findings, we used two validation cohorts (total n = 203), including participants from a distinct population with a different metagenomic isolation protocol. RESULTS The gut microbiome of the patients had a significantly different β-diversity, but not α-diversity nor neuroactive potential compared to healthy controls. Initially, twenty-six bacterial taxa were identified as differentially abundant in patients. Among these, the previously reported genera Lachnoclostridium and Eggerthella were replicated in the validation cohorts. Employing the CoDaCoRe learning algorithm, we identified two bacterial balances specific to BD/SSD, which demonstrated an area under the receiver operating characteristic curve (AUC) of 0.77 in the test dataset. These balances were replicated in the validation cohorts and showed a positive association with the severity of psychiatric symptoms and antipsychotic use. Last, we showed a positive association between the relative abundance of Klebsiella and Klebsiella pneumoniae with antipsychotic use and between the Anaeromassilibacillus and lithium use. CONCLUSIONS Our findings suggest that microbial balances could be a reproducible method for identifying BD/SSD-specific microbial signatures, with potential diagnostic and prognostic applications. Notably, Lachnoclostridium and Eggerthella emerge as frequently occurring bacteria in BD/SSD. Last, our study reaffirms the previously established link between Klebsiella and antipsychotic medication use and identifies a novel association between Anaeromassilibacillus and lithium use.
Collapse
Affiliation(s)
- Magdalini Ioannou
- University of Groningen and University Medical Center Groningen, Department of Psychiatry, Groningen, the Netherlands; University of Groningen and University Medical Center Groningen, Department of Biomedical Sciences, Groningen, the Netherlands.
| | - Jenny Borkent
- University of Groningen and University Medical Center Groningen, Department of Biomedical Sciences, Groningen, the Netherlands
| | - Sergio Andreu-Sánchez
- University of Groningen and University Medical Center Groningen, Department of Genetics, Groningen, the Netherlands; University of Groningen and University Medical Center Groningen, Department of Pediatrics, Groningen, the Netherlands
| | - Jiafei Wu
- University of Groningen and University Medical Center Groningen, Department of Genetics, Groningen, the Netherlands
| | - Jingyuan Fu
- University of Groningen and University Medical Center Groningen, Department of Genetics, Groningen, the Netherlands; University of Groningen and University Medical Center Groningen, Department of Pediatrics, Groningen, the Netherlands
| | - Iris E C Sommer
- University of Groningen and University Medical Center Groningen, Department of Biomedical Sciences, Groningen, the Netherlands
| | - Bartholomeus C M Haarman
- University of Groningen and University Medical Center Groningen, Department of Psychiatry, Groningen, the Netherlands
| |
Collapse
|
4
|
Swarte JC, Zhang S, Nieuwenhuis LM, Gacesa R, Knobbe TJ, De Meijer VE, Damman K, Verschuuren EAM, Gan TC, Fu J, Zhernakova A, Harmsen HJM, Blokzijl H, Bakker SJL, Björk JR, Weersma RK. Multiple indicators of gut dysbiosis predict all-cause and cause-specific mortality in solid organ transplant recipients. Gut 2024; 73:1650-1661. [PMID: 38955400 DOI: 10.1136/gutjnl-2023-331441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/01/2023] [Accepted: 05/12/2024] [Indexed: 07/04/2024]
Abstract
OBJECTIVE Gut microbiome composition is associated with multiple diseases, but relatively little is known about its relationship with long-term outcome measures. While gut dysbiosis has been linked to mortality risk in the general population, the relationship with overall survival in specific diseases has not been extensively studied. In the current study, we present results from an in-depth analysis of the relationship between gut dysbiosis and all-cause and cause-specific mortality in the setting of solid organ transplant recipients (SOTR). DESIGN We analysed 1337 metagenomes derived from faecal samples of 766 kidney, 334 liver, 170 lung and 67 heart transplant recipients part of the TransplantLines Biobank and Cohort-a prospective cohort study including extensive phenotype data with 6.5 years of follow-up. To analyze gut dysbiosis, we included an additional 8208 metagenomes from the general population of the same geographical area (northern Netherlands). Multivariable Cox regression and a machine learning algorithm were used to analyse the association between multiple indicators of gut dysbiosis, including individual species abundances, and all-cause and cause-specific mortality. RESULTS We identified two patterns representing overall microbiome community variation that were associated with both all-cause and cause-specific mortality. The gut microbiome distance between each transplantation recipient to the average of the general population was associated with all-cause mortality and death from infection, malignancy and cardiovascular disease. A multivariable Cox regression on individual species abundances identified 23 bacterial species that were associated with all-cause mortality, and by applying a machine learning algorithm, we identified a balance (a type of log-ratio) consisting of 19 out of the 23 species that were associated with all-cause mortality. CONCLUSION Gut dysbiosis is consistently associated with mortality in SOTR. Our results support the observations that gut dysbiosis is associated with long-term survival. Since our data do not allow us to infer causality, more preclinical research is needed to understand mechanisms before we can determine whether gut microbiome-directed therapies may be designed to improve long-term outcomes.
Collapse
Affiliation(s)
- J Casper Swarte
- Gastroenterology and Hepatology, University Medical Centre, Groningen, Netherlands
| | - Shuyan Zhang
- Gastroenterology and Hepatology, University Medical Centre, Groningen, Netherlands
| | | | - Ranko Gacesa
- Gastroenterology and Hepatology, University Medical Centre, Groningen, Netherlands
- Department of Genetics, University of Groningen, University Medical Center, Groningen, Netherlands
| | - Tim J Knobbe
- University Medical Centre, Groningen, Netherlands
| | | | - Kevin Damman
- University Medical Centre, Groningen, Netherlands
| | | | - Tji C Gan
- University Medical Centre, Groningen, Netherlands
| | - Jingyuan Fu
- Department of Genetics, University Medical Center, Groningen, Netherlands
- Department of Pediatrics, University Medical Center, Groningen, Netherlands
| | | | - Hermie J M Harmsen
- Medical Microbiology, University of Groningen, University Medical Center, Groningen, Netherlands
| | | | | | - Johannes R Björk
- Gastroenterology and Hepatology, University Medical Centre, Groningen, Netherlands
| | - Rinse K Weersma
- Gastroenterology and Hepatology, University Medical Centre, Groningen, Netherlands
| |
Collapse
|
5
|
Gorman ED, Lladser ME. Interpretable metric learning in comparative metagenomics: The adaptive Haar-like distance. PLoS Comput Biol 2024; 20:e1011543. [PMID: 38768195 PMCID: PMC11142682 DOI: 10.1371/journal.pcbi.1011543] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Revised: 05/31/2024] [Accepted: 04/25/2024] [Indexed: 05/22/2024] Open
Abstract
Random forests have emerged as a promising tool in comparative metagenomics because they can predict environmental characteristics based on microbial composition in datasets where β-diversity metrics fall short of revealing meaningful relationships between samples. Nevertheless, despite this efficacy, they lack biological insight in tandem with their predictions, potentially hindering scientific advancement. To overcome this limitation, we leverage a geometric characterization of random forests to introduce a data-driven phylogenetic β-diversity metric, the adaptive Haar-like distance. This new metric assigns a weight to each internal node (i.e., split or bifurcation) of a reference phylogeny, indicating the relative importance of that node in discerning environmental samples based on their microbial composition. Alongside this, a weighted nearest-neighbors classifier, constructed using the adaptive metric, can be used as a proxy for the random forest while maintaining accuracy on par with that of the original forest and another state-of-the-art classifier, CoDaCoRe. As shown in datasets from diverse microbial environments, however, the new metric and classifier significantly enhance the biological interpretability and visualization of high-dimensional metagenomic samples.
Collapse
Affiliation(s)
- Evan D. Gorman
- Department of Applied Mathematics, University of Colorado, Boulder, Colorado, United States of America
| | - Manuel E. Lladser
- Department of Applied Mathematics, University of Colorado, Boulder, Colorado, United States of America
| |
Collapse
|
6
|
Björk JR, Bolte LA, Maltez Thomas A, Lee KA, Rossi N, Wind TT, Smit LM, Armanini F, Asnicar F, Blanco-Miguez A, Board R, Calbet-Llopart N, Derosa L, Dhomen N, Brooks K, Harland M, Harries M, Lorigan P, Manghi P, Marais R, Newton-Bishop J, Nezi L, Pinto F, Potrony M, Puig S, Serra-Bellver P, Shaw HM, Tamburini S, Valpione S, Waldron L, Zitvogel L, Zolfo M, de Vries EGE, Nathan P, Fehrmann RSN, Spector TD, Bataille V, Segata N, Hospers GAP, Weersma RK. Longitudinal gut microbiome changes in immune checkpoint blockade-treated advanced melanoma. Nat Med 2024; 30:785-796. [PMID: 38365950 PMCID: PMC10957474 DOI: 10.1038/s41591-024-02803-3] [Citation(s) in RCA: 30] [Impact Index Per Article: 30.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Accepted: 01/03/2024] [Indexed: 02/18/2024]
Abstract
Multiple clinical trials targeting the gut microbiome are being conducted to optimize treatment outcomes for immune checkpoint blockade (ICB). To improve the success of these interventions, understanding gut microbiome changes during ICB is urgently needed. Here through longitudinal microbiome profiling of 175 patients treated with ICB for advanced melanoma, we show that several microbial species-level genome bins (SGBs) and pathways exhibit distinct patterns from baseline in patients achieving progression-free survival (PFS) of 12 months or longer (PFS ≥12) versus patients with PFS shorter than 12 months (PFS <12). Out of 99 SGBs that could discriminate between these two groups, 20 were differentially abundant only at baseline, while 42 were differentially abundant only after treatment initiation. We identify five and four SGBs that had consistently higher abundances in patients with PFS ≥12 and <12 months, respectively. Constructing a log ratio of these SGBs, we find an association with overall survival. Finally, we find different microbial dynamics in different clinical contexts including the type of ICB regimen, development of immune-related adverse events and concomitant medication use. Insights into the longitudinal dynamics of the gut microbiome in association with host factors and treatment regimens will be critical for guiding rational microbiome-targeted therapies aimed at enhancing ICB efficacy.
Collapse
Affiliation(s)
- Johannes R Björk
- Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, Groningen, the Netherlands.
| | - Laura A Bolte
- Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, Groningen, the Netherlands
| | - Andrew Maltez Thomas
- Department of CellularComputational and Integrative Biology, University of Trento, Trento, Italy
| | - Karla A Lee
- Department of Twin Research and Genetic Epidemiology, King's College London, London, UK
| | - Niccolo Rossi
- Department of Twin Research and Genetic Epidemiology, King's College London, London, UK
| | - Thijs T Wind
- Department of Medical Oncology, Groningen University of Groningen and University Medical Center Groningen, Groningent, the Netherlands
| | - Lotte M Smit
- Department of Medical Oncology, Groningen University of Groningen and University Medical Center Groningen, Groningent, the Netherlands
| | - Federica Armanini
- Department of CellularComputational and Integrative Biology, University of Trento, Trento, Italy
| | - Francesco Asnicar
- Department of CellularComputational and Integrative Biology, University of Trento, Trento, Italy
| | - Aitor Blanco-Miguez
- Department of CellularComputational and Integrative Biology, University of Trento, Trento, Italy
| | - Ruth Board
- Department of Oncology, Lancashire Teaching Hospitals NHS Trust, Preston, UK
| | - Neus Calbet-Llopart
- Department of Dermatology, Melanoma Group, Hospital Clínic Barcelona, IDIBAPS, Universitat de Barcelona, Barcelona, Spain
- Centro de Investigación Biomédica en Red en Enfermedades Raras, Instituto de Salud Carlos III, Barcelona, Spain
| | - Lisa Derosa
- Gustave Roussy Cancer Center, U1015 INSERM and Oncobiome Network, University Paris Saclay, Villejuif-Grand-Paris, France
| | - Nathalie Dhomen
- Division of Immunology, Immunity to Infection and Respiratory Medicine, University of Manchester, Manchester, UK
| | - Kelly Brooks
- Division of Immunology, Immunity to Infection and Respiratory Medicine, University of Manchester, Manchester, UK
| | - Mark Harland
- Division of Haematology and Immunology, Institute of Medical Research at St. James's, University of Leeds, Leeds, UK
| | - Mark Harries
- Department of Medical Oncology, Guys Cancer Centre, Guy's and St Thomas' NHS Trust, London, UK
- Biochemical and Molecular Genetics Department, Hospital Clínic de Barcelona and IDIBAPS, University of Barcelona, Barcelona, Spain
| | - Paul Lorigan
- The Christie NHS Foundation Trust, Manchester, UK
- Division of Cancer Sciences, University of Manchester, Manchester, UK
| | - Paolo Manghi
- Department of CellularComputational and Integrative Biology, University of Trento, Trento, Italy
| | - Richard Marais
- Molecular Oncology Group, Cancer Research UK Manchester Institute, University of Manchester, Manchester, UK
| | - Julia Newton-Bishop
- Division of Haematology and Immunology, Institute of Medical Research at St. James's, University of Leeds, Leeds, UK
| | - Luigi Nezi
- European Institute of Oncology (Istituto Europeo di Oncologia), Milan, Italy
| | - Federica Pinto
- Department of CellularComputational and Integrative Biology, University of Trento, Trento, Italy
| | - Miriam Potrony
- Centro de Investigación Biomédica en Red en Enfermedades Raras, Instituto de Salud Carlos III, Barcelona, Spain
- Biochemical and Molecular Genetics Department, Hospital Clínic de Barcelona and IDIBAPS, University of Barcelona, Barcelona, Spain
| | - Susana Puig
- Department of Dermatology, Melanoma Group, Hospital Clínic Barcelona, IDIBAPS, Universitat de Barcelona, Barcelona, Spain
- Centro de Investigación Biomédica en Red en Enfermedades Raras, Instituto de Salud Carlos III, Barcelona, Spain
| | | | - Heather M Shaw
- Department of Medical Oncology, Mount Vernon Cancer Centre, East and North Herts NHS Trust, Northwood, UK
| | - Sabrina Tamburini
- European Institute of Oncology (Istituto Europeo di Oncologia), Milan, Italy
| | - Sara Valpione
- Division of Immunology, Immunity to Infection and Respiratory Medicine, University of Manchester, Manchester, UK
- The Christie NHS Foundation Trust, Manchester, UK
| | - Levi Waldron
- Department of CellularComputational and Integrative Biology, University of Trento, Trento, Italy
- Graduate School of Public Health and Health Policy, City University of New York, New York, NY, USA
| | - Laurence Zitvogel
- Gustave Roussy Cancer Center, U1015 INSERM and Oncobiome Network, University Paris Saclay, Villejuif-Grand-Paris, France
| | - Moreno Zolfo
- Department of CellularComputational and Integrative Biology, University of Trento, Trento, Italy
| | - Elisabeth G E de Vries
- Department of Medical Oncology, Groningen University of Groningen and University Medical Center Groningen, Groningent, the Netherlands
| | - Paul Nathan
- Biochemical and Molecular Genetics Department, Hospital Clínic de Barcelona and IDIBAPS, University of Barcelona, Barcelona, Spain
- Department of Medical Oncology, Mount Vernon Cancer Centre, East and North Herts NHS Trust, Northwood, UK
| | - Rudolf S N Fehrmann
- Department of Medical Oncology, Groningen University of Groningen and University Medical Center Groningen, Groningent, the Netherlands
| | - Tim D Spector
- Department of Twin Research and Genetic Epidemiology, King's College London, London, UK
| | - Véronique Bataille
- Department of Twin Research and Genetic Epidemiology, King's College London, London, UK
- Department of Dermatology, Mount Vernon Cancer Centre, Northwood, UK
- Department of Dermatology, Hemel Hempstead Hospital, West Hertfordshire NHS Trust, Hemel Hempstead, UK
| | - Nicola Segata
- Department of CellularComputational and Integrative Biology, University of Trento, Trento, Italy
- European Institute of Oncology (Istituto Europeo di Oncologia), Milan, Italy
| | - Geke A P Hospers
- Department of Medical Oncology, Groningen University of Groningen and University Medical Center Groningen, Groningent, the Netherlands
| | - Rinse K Weersma
- Department of Gastroenterology and Hepatology, University of Groningen and University Medical Center Groningen, Groningen, the Netherlands.
| |
Collapse
|
7
|
Marcos-Zambrano LJ, López-Molina VM, Bakir-Gungor B, Frohme M, Karaduzovic-Hadziabdic K, Klammsteiner T, Ibrahimi E, Lahti L, Loncar-Turukalo T, Dhamo X, Simeon A, Nechyporenko A, Pio G, Przymus P, Sampri A, Trajkovik V, Lacruz-Pleguezuelos B, Aasmets O, Araujo R, Anagnostopoulos I, Aydemir Ö, Berland M, Calle ML, Ceci M, Duman H, Gündoğdu A, Havulinna AS, Kaka Bra KHN, Kalluci E, Karav S, Lode D, Lopes MB, May P, Nap B, Nedyalkova M, Paciência I, Pasic L, Pujolassos M, Shigdel R, Susín A, Thiele I, Truică CO, Wilmes P, Yilmaz E, Yousef M, Claesson MJ, Truu J, Carrillo de Santa Pau E. A toolbox of machine learning software to support microbiome analysis. Front Microbiol 2023; 14:1250806. [PMID: 38075858 PMCID: PMC10704913 DOI: 10.3389/fmicb.2023.1250806] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2023] [Accepted: 09/11/2023] [Indexed: 05/14/2025] Open
Abstract
The human microbiome has become an area of intense research due to its potential impact on human health. However, the analysis and interpretation of this data have proven to be challenging due to its complexity and high dimensionality. Machine learning (ML) algorithms can process vast amounts of data to uncover informative patterns and relationships within the data, even with limited prior knowledge. Therefore, there has been a rapid growth in the development of software specifically designed for the analysis and interpretation of microbiome data using ML techniques. These software incorporate a wide range of ML algorithms for clustering, classification, regression, or feature selection, to identify microbial patterns and relationships within the data and generate predictive models. This rapid development with a constant need for new developments and integration of new features require efforts into compile, catalog and classify these tools to create infrastructures and services with easy, transparent, and trustable standards. Here we review the state-of-the-art for ML tools applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on ML based software and framework resources currently available for the analysis of microbiome data in humans. The aim is to support microbiologists and biomedical scientists to go deeper into specialized resources that integrate ML techniques and facilitate future benchmarking to create standards for the analysis of microbiome data. The software resources are organized based on the type of analysis they were developed for and the ML techniques they implement. A description of each software with examples of usage is provided including comments about pitfalls and lacks in the usage of software based on ML methods in relation to microbiome data that need to be considered by developers and users. This review represents an extensive compilation to date, offering valuable insights and guidance for researchers interested in leveraging ML approaches for microbiome analysis.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Víctor Manuel López-Molina
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gül University, Kayseri, Türkiye
| | - Marcus Frohme
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | | | - Thomas Klammsteiner
- Department of Microbiology and Department of Ecology, University of Innsbruck, Innsbruck, Austria
| | - Eliana Ibrahimi
- Department of Biology, University of Tirana, Tirana, Albania
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | | | - Xhilda Dhamo
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Andrea Simeon
- BioSense Institute, University of Novi Sad, Novi Sad, Serbia
| | - Alina Nechyporenko
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
- Department of Systems Engineering, Kharkiv National University of Radioelectronics, Kharkiv, Ukraine
| | - Gianvito Pio
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
- Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
| | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Alexia Sampri
- Victor Phillip Dahdaleh Heart and Lung Research Institute, University of Cambridge, Cambridge, United Kingdom
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Blanca Lacruz-Pleguezuelos
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Ricardo Araujo
- Nephrology and Infectious Diseases R & D Group, i3S—Instituto de Investigação e Inovação em Saúde; INEB—Instituto de Engenharia Biomédica, Universidade do Porto, Porto, Portugal
| | - Ioannis Anagnostopoulos
- Department of Informatics, University of Piraeus, Piraeus, Greece
- Computer Science and Biomedical Informatics Department, University of Thessaly, Lamia, Greece
| | - Önder Aydemir
- Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Türkiye
| | - Magali Berland
- INRAE, MetaGenoPolis, Université Paris-Saclay, Jouy-en-Josas, France
| | - M. Luz Calle
- Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain
- IRIS-CC, Fundació Institut de Recerca i Innovació en Ciències de la Vida i la Salut a la Catalunya Central, Vic, Barcelona, Spain
| | - Michelangelo Ceci
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
- Big Data Lab, National Interuniversity Consortium for Informatics, Rome, Italy
| | - Hatice Duman
- Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
| | - Aycan Gündoğdu
- Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Türkiye
- Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Türkiye
| | - Aki S. Havulinna
- Finnish Institute for Health and Welfare - THL, Helsinki, Finland
- Institute for Molecular Medicine Finland, FIMM-HiLIFE, Helsinki, Finland
| | | | - Eglantina Kalluci
- Department of Applied Mathematics, Faculty of Natural Sciences, University of Tirana, Tirana, Albania
| | - Sercan Karav
- Department of Molecular Biology and Genetics, Çanakkale Onsekiz Mart University, Çanakkale, Türkiye
| | - Daniel Lode
- Division Molecular Biotechnology and Functional Genomics, Technical University of Applied Sciences Wildau, Wildau, Germany
| | - Marta B. Lopes
- Department of Mathematics, Center for Mathematics and Applications (NOVA Math), NOVA School of Science and Technology, Caparica, Portugal
- UNIDEMI, Department of Mechanical and Industrial Engineering, NOVA School of Science and Technology, Caparica, Portugal
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Bram Nap
- School of Medicine, University of Galway, Galway, Ireland
| | - Miroslava Nedyalkova
- Department of Inorganic Chemistry, Faculty of Chemistry and Pharmacy, University of Sofia, Sofia, Bulgaria
| | - Inês Paciência
- Center for Environmental and Respiratory Health Research (CERH), Research Unit of Population Health, University of Oulu, Oulu, Finland
- Biocenter Oulu, University of Oulu, Oulu, Finland
| | - Lejla Pasic
- Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Meritxell Pujolassos
- Faculty of Sciences, Technology and Engineering, University of Vic – Central University of Catalonia, Vic, Barcelona, Spain
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Antonio Susín
- Mathematical Department, UPC-Barcelona Tech, Barcelona, Spain
| | - Ines Thiele
- School of Medicine, University of Galway, Galway, Ireland
- APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Ciprian-Octavian Truică
- Computer Science and Engineering Department, Faculty of Automatic Control and Computers, National University of Science and Technology Politehnica, Bucharest, Romania
| | - Paul Wilmes
- Systems Ecology Group, Luxembourg Centre for Systems Biomedicine, Esch-sur-Alzette, Luxembourg
- Department of Life Sciences and Medicine, Faculty of Science, Technology and Medicine, University of Luxembourg, Belvaux, Luxembourg
| | - Ercument Yilmaz
- Department of Computer Technologies, Karadeniz Technical University, Trabzon, Türkiye
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Marcus Joakim Claesson
- APC Microbiome Ireland, University College Cork, Cork, Ireland
- School of Microbiology, University College Cork, Cork, Ireland
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | | |
Collapse
|
8
|
Vich Vila A, Hu S, Andreu-Sánchez S, Collij V, Jansen BH, Augustijn HE, Bolte LA, Ruigrok RAAA, Abu-Ali G, Giallourakis C, Schneider J, Parkinson J, Al-Garawi A, Zhernakova A, Gacesa R, Fu J, Weersma RK. Faecal metabolome and its determinants in inflammatory bowel disease. Gut 2023; 72:1472-1485. [PMID: 36958817 PMCID: PMC10359577 DOI: 10.1136/gutjnl-2022-328048] [Citation(s) in RCA: 62] [Impact Index Per Article: 31.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Accepted: 03/05/2023] [Indexed: 03/25/2023]
Abstract
OBJECTIVE Inflammatory bowel disease (IBD) is a multifactorial immune-mediated inflammatory disease of the intestine, comprising Crohn's disease and ulcerative colitis. By characterising metabolites in faeces, combined with faecal metagenomics, host genetics and clinical characteristics, we aimed to unravel metabolic alterations in IBD. DESIGN We measured 1684 different faecal metabolites and 8 short-chain and branched-chain fatty acids in stool samples of 424 patients with IBD and 255 non-IBD controls. Regression analyses were used to compare concentrations of metabolites between cases and controls and determine the relationship between metabolites and each participant's lifestyle, clinical characteristics and gut microbiota composition. Moreover, genome-wide association analysis was conducted on faecal metabolite levels. RESULTS We identified over 300 molecules that were differentially abundant in the faeces of patients with IBD. The ratio between a sphingolipid and L-urobilin could discriminate between IBD and non-IBD samples (AUC=0.85). We found changes in the bile acid pool in patients with dysbiotic microbial communities and a strong association between faecal metabolome and gut microbiota. For example, the abundance of Ruminococcus gnavus was positively associated with tryptamine levels. In addition, we found 158 associations between metabolites and dietary patterns, and polymorphisms near NAT2 strongly associated with coffee metabolism. CONCLUSION In this large-scale analysis, we identified alterations in the metabolome of patients with IBD that are independent of commonly overlooked confounders such as diet and surgical history. Considering the influence of the microbiome on faecal metabolites, our results pave the way for future interventions targeting intestinal inflammation.
Collapse
Affiliation(s)
- Arnau Vich Vila
- Department of Genetics, University Medical Centre, Groningen, The Netherlands
- Department of Pediatrics, University Medical Centre, Groningen, The Netherlands
| | - Shixian Hu
- Department of Genetics, University Medical Centre, Groningen, The Netherlands
- Department of Pediatrics, University Medical Centre, Groningen, The Netherlands
| | - Sergio Andreu-Sánchez
- Department of Pediatrics, University Medical Centre, Groningen, The Netherlands
- Department of Gastroenterology and Hepatology, University Medical Centre, Groningen, The Netherlands
| | - Valerie Collij
- Department of Genetics, University Medical Centre, Groningen, The Netherlands
- Department of Pediatrics, University Medical Centre, Groningen, The Netherlands
| | - Bernadien H Jansen
- Department of Genetics, University Medical Centre, Groningen, The Netherlands
| | - Hannah E Augustijn
- Department of Pediatrics, University Medical Centre, Groningen, The Netherlands
| | - Laura A Bolte
- Department of Genetics, University Medical Centre, Groningen, The Netherlands
| | - Renate A A A Ruigrok
- Department of Genetics, University Medical Centre, Groningen, The Netherlands
- Department of Pediatrics, University Medical Centre, Groningen, The Netherlands
| | - Galeb Abu-Ali
- Gastroenterology Drug Discovery Unit, Takeda Pharmaceutical, Cambridge, Massachusetts, USA
| | - Cosmas Giallourakis
- Gastroenterology Drug Discovery Unit, Takeda Pharmaceutical, Cambridge, Massachusetts, USA
| | - Jessica Schneider
- Gastroenterology Drug Discovery Unit, Takeda Pharmaceutical, Cambridge, Massachusetts, USA
| | - John Parkinson
- Gastroenterology Drug Discovery Unit, Takeda Pharmaceutical, Cambridge, Massachusetts, USA
| | - Amal Al-Garawi
- Gastroenterology Drug Discovery Unit, Takeda Pharmaceutical, Cambridge, Massachusetts, USA
| | | | - Ranko Gacesa
- Department of Genetics, University Medical Centre, Groningen, The Netherlands
- Department of Pediatrics, University Medical Centre, Groningen, The Netherlands
| | - Jingyuan Fu
- Department of Pediatrics, University Medical Centre, Groningen, The Netherlands
- Department of Gastroenterology and Hepatology, University Medical Centre, Groningen, The Netherlands
| | - Rinse K Weersma
- Department of Genetics, University Medical Centre, Groningen, The Netherlands
| |
Collapse
|
9
|
Busato S, Gordon M, Chaudhari M, Jensen I, Akyol T, Andersen S, Williams C. Compositionality, sparsity, spurious heterogeneity, and other data-driven challenges for machine learning algorithms within plant microbiome studies. CURRENT OPINION IN PLANT BIOLOGY 2023; 71:102326. [PMID: 36538837 PMCID: PMC9925409 DOI: 10.1016/j.pbi.2022.102326] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/31/2022] [Revised: 11/08/2022] [Accepted: 11/21/2022] [Indexed: 06/17/2023]
Abstract
The plant-associated microbiome is a key component of plant systems, contributing to their health, growth, and productivity. The application of machine learning (ML) in this field promises to help untangle the relationships involved. However, measurements of microbial communities by high-throughput sequencing pose challenges for ML. Noise from low sample sizes, soil heterogeneity, and technical factors can impact the performance of ML. Additionally, the compositional and sparse nature of these datasets can impact the predictive accuracy of ML. We review recent literature from plant studies to illustrate that these properties often go unmentioned. We expand our analysis to other fields to quantify the degree to which mitigation approaches improve the performance of ML and describe the mathematical basis for this. With the advent of accessible analytical packages for microbiome data including learning models, researchers must be familiar with the nature of their datasets.
Collapse
Affiliation(s)
- Sebastiano Busato
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA
| | - Max Gordon
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA
| | - Meenal Chaudhari
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA
| | - Ib Jensen
- Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
| | - Turgut Akyol
- Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
| | - Stig Andersen
- Department of Molecular Biology and Genetics, Aarhus University, Aarhus, Denmark
| | - Cranos Williams
- Department of Electrical and Computer Engineering, North Carolina State University, Raleigh, USA; NC Plant Sciences Initiative, North Carolina State University, Raleigh, USA; Department of Plant and Microbial Biology, North Carolina State University, Raleigh, USA.
| |
Collapse
|
10
|
Shtossel O, Isakov H, Turjeman S, Koren O, Louzoun Y. Ordering taxa in image convolution networks improves microbiome-based machine learning accuracy. Gut Microbes 2023; 15:2224474. [PMID: 37345233 PMCID: PMC10288916 DOI: 10.1080/19490976.2023.2224474] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/15/2022] [Accepted: 06/08/2023] [Indexed: 06/23/2023] Open
Abstract
The human gut microbiome is associated with a large number of disease etiologies. As such, it is a natural candidate for machine-learning-based biomarker development for multiple diseases and conditions. The microbiome is often analyzed using 16S rRNA gene sequencing or shotgun metagenomics. However, several properties of microbial sequence-based studies hinder machine learning (ML), including non-uniform representation, a small number of samples compared with the dimension of each sample, and sparsity of the data, with the majority of taxa present in a small subset of samples. We show here using a graph representation that the cladogram structure is as informative as the taxa frequency. We then suggest a novel method to combine information from different taxa and improve data representation for ML using microbial taxonomy. iMic (image microbiome) translates the microbiome to images through an iterative ordering scheme, and applies convolutional neural networks to the resulting image. We show that iMic has a higher precision in static microbiome gene sequence-based ML than state-of-the-art methods. iMic also facilitates the interpretation of the classifiers through an explainable artificial intelligence (AI) algorithm to iMic to detect taxa relevant to each condition. iMic is then extended to dynamic microbiome samples by translating them to movies.
Collapse
Affiliation(s)
- Oshrit Shtossel
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | - Haim Isakov
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | - Sondra Turjeman
- The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Omry Koren
- The Azrieli Faculty of Medicine, Bar-Ilan University, Safed, Israel
| | - Yoram Louzoun
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| |
Collapse
|
11
|
Krohn C, Khudur L, Dias DA, van den Akker B, Rees CA, Crosbie ND, Surapaneni A, O'Carroll DM, Stuetz RM, Batstone DJ, Ball AS. The role of microbial ecology in improving the performance of anaerobic digestion of sewage sludge. Front Microbiol 2022; 13:1079136. [PMID: 36590430 PMCID: PMC9801413 DOI: 10.3389/fmicb.2022.1079136] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 11/28/2022] [Indexed: 12/15/2022] Open
Abstract
The use of next-generation diagnostic tools to optimise the anaerobic digestion of municipal sewage sludge has the potential to increase renewable natural gas recovery, improve the reuse of biosolid fertilisers and help operators expand circular economies globally. This review aims to provide perspectives on the role of microbial ecology in improving digester performance in wastewater treatment plants, highlighting that a systems biology approach is fundamental for monitoring mesophilic anaerobic sewage sludge in continuously stirred reactor tanks. We further highlight the potential applications arising from investigations into sludge ecology. The principal limitation for improvements in methane recoveries or in process stability of anaerobic digestion, especially after pre-treatment or during co-digestion, are ecological knowledge gaps related to the front-end metabolism (hydrolysis and fermentation). Operational problems such as stable biological foaming are a key problem, for which ecological markers are a suitable approach. However, no biomarkers exist yet to assist in monitoring and management of clade-specific foaming potentials along with other risks, such as pollutants and pathogens. Fundamental ecological principles apply to anaerobic digestion, which presents opportunities to predict and manipulate reactor functions. The path ahead for mapping ecological markers on process endpoints and risk factors of anaerobic digestion will involve numerical ecology, an expanding field that employs metrics derived from alpha, beta, phylogenetic, taxonomic, and functional diversity, as well as from phenotypes or life strategies derived from genetic potentials. In contrast to addressing operational issues (as noted above), which are effectively addressed by whole population or individual biomarkers, broad improvement and optimisation of function will require enhancement of hydrolysis and acidogenic processes. This will require a discovery-based approach, which will involve integrative research involving the proteome and metabolome. This will utilise, but overcome current limitations of DNA-centric approaches, and likely have broad application outside the specific field of anaerobic digestion.
Collapse
Affiliation(s)
- Christian Krohn
- ARC Training Centre for the Transformation of Australia's Biosolids Resource, RMIT University, Bundoora, VIC, Australia,*Correspondence: Christian Krohn,
| | - Leadin Khudur
- ARC Training Centre for the Transformation of Australia's Biosolids Resource, RMIT University, Bundoora, VIC, Australia
| | - Daniel Anthony Dias
- School of Health and Biomedical Sciences, Discipline of Laboratory Medicine, STEM College, RMIT University, Bundoora, VIC, Australia
| | | | | | | | - Aravind Surapaneni
- ARC Training Centre for the Transformation of Australia's Biosolids Resource, RMIT University, Bundoora, VIC, Australia
| | - Denis M. O'Carroll
- Water Research Centre, School of Civil and Environmental Engineering, University of New South Wales, Sydney, NSW, Australia
| | - Richard M. Stuetz
- Water Research Centre, School of Civil and Environmental Engineering, University of New South Wales, Sydney, NSW, Australia
| | - Damien J. Batstone
- ARC Training Centre for the Transformation of Australia's Biosolids Resource, RMIT University, Bundoora, VIC, Australia,Australian Centre for Water and Environmental Biotechnology, Gehrmann Building, The University of Queensland, Brisbane, QLD, Australia
| | - Andrew S. Ball
- ARC Training Centre for the Transformation of Australia's Biosolids Resource, RMIT University, Bundoora, VIC, Australia
| |
Collapse
|
12
|
Gao Y, O’Hely M, Quinn TP, Ponsonby AL, Harrison LC, Frøkiær H, Tang MLK, Brix S, Kristiansen K, Burgner D, Saffery R, Ranganathan S, Collier F, Vuillermin P. Maternal gut microbiota during pregnancy and the composition of immune cells in infancy. Front Immunol 2022; 13:986340. [PMID: 36211431 PMCID: PMC9535361 DOI: 10.3389/fimmu.2022.986340] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2022] [Accepted: 08/30/2022] [Indexed: 11/13/2022] Open
Abstract
Background Preclinical studies have shown that maternal gut microbiota during pregnancy play a key role in prenatal immune development but the relevance of these findings to humans is unknown. The aim of this prebirth cohort study was to investigate the association between the maternal gut microbiota in pregnancy and the composition of the infant’s cord and peripheral blood immune cells over the first year of life. Methods The Barwon Infant Study cohort (n=1074 infants) was recruited using an unselected sampling frame. Maternal fecal samples were collected at 36 weeks of pregnancy and flow cytometry was conducted on cord/peripheral blood collected at birth, 6 and 12 months of age. Among a randomly selected sub-cohort with available samples (n=293), maternal gut microbiota was characterized by sequencing the 16S rRNA V4 region. Operational taxonomic units (OTUs) were clustered based on their abundance. Associations between maternal fecal microbiota clusters and infant granulocyte, monocyte and lymphocyte subsets were explored using compositional data analysis. Partial least squares (PLS) and regression models were used to investigate the relationships/associations between environmental, maternal and infant factors, and OTU clusters. Results We identified six clusters of co-occurring OTUs. The first two components in the PLS regression explained 39% and 33% of the covariance between the maternal prenatal OTU clusters and immune cell populations in offspring at birth. A cluster in which Dialister, Escherichia, and Ruminococcus were predominant was associated with a lower proportion of granulocytes (p=0.002), and higher proportions of both central naïve CD4+ T cells (CD4+/CD45RA+/CD31−) (p<0.001) and naïve regulatory T cells (Treg) (CD4+/CD45RA+/FoxP3low) (p=0.02) in cord blood. The association with central naïve CD4+ T cells persisted to 12 months of age. Conclusion This birth cohort study provides evidence consistent with past preclinical models that the maternal gut microbiota during pregnancy plays a role in shaping the composition of innate and adaptive elements of the infant’s immune system following birth.
Collapse
Affiliation(s)
- Yuan Gao
- School of Medicine, Deakin University, Geelong, VIC, Australia
- Child Health Research Unit, Barwon Health, Geelong, VIC, Australia
- Faculty of Science, Copenhagen University, København, Denmark
| | - Martin O’Hely
- School of Medicine, Deakin University, Geelong, VIC, Australia
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Melbourne, VIC, Australia
| | | | - Anne-Louise Ponsonby
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Melbourne, VIC, Australia
- The Early Brain Science Department, Florey Institute of Neuroscience and Mental Health, Melbourne, VIC, Australia
| | - Leonard C. Harrison
- Population Health and Immunity Division, Walter and Eliza Hall Institute of Medical Research, Melbourne, VIC, Australia
- Department of Medical Biology, University of Melbourne, Melbourne, VIC, Australia
| | - Hanne Frøkiær
- Faculty of Science, Copenhagen University, København, Denmark
| | - Mimi L. K. Tang
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Melbourne, VIC, Australia
- Department of Pediatrics, University of Melbourne, Melbourne, VIC, Australia
| | - Susanne Brix
- Department of Biotechnology and Biomedicine, Technical University of Denmark, Kongens Lyngby, Denmark
| | - Karsten Kristiansen
- Laboratory of Genomics and Molecular Biomedicine, Department of Biology, University of Copenhagen, Copenhagen, Denmark
| | - Dave Burgner
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Melbourne, VIC, Australia
| | - Richard Saffery
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Melbourne, VIC, Australia
- Department of Pediatrics, University of Melbourne, Melbourne, VIC, Australia
| | - Sarath Ranganathan
- Murdoch Children’s Research Institute, Royal Children’s Hospital, Melbourne, VIC, Australia
- Department of Pediatrics, University of Melbourne, Melbourne, VIC, Australia
| | - Fiona Collier
- School of Medicine, Deakin University, Geelong, VIC, Australia
| | - Peter Vuillermin
- School of Medicine, Deakin University, Geelong, VIC, Australia
- Child Health Research Unit, Barwon Health, Geelong, VIC, Australia
- *Correspondence: Peter Vuillermin,
| |
Collapse
|
13
|
Boyraz A, Pawlowsky-Glahn V, Egozcue JJ, Acar AC. Principal microbial groups: compositional alternative to phylogenetic grouping of microbiome data. Brief Bioinform 2022; 23:6675749. [PMID: 36007229 DOI: 10.1093/bib/bbac328] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2022] [Revised: 07/19/2022] [Accepted: 07/20/2022] [Indexed: 11/13/2022] Open
Abstract
Statistical and machine learning techniques based on relative abundances have been used to predict health conditions and to identify microbial biomarkers. However, high dimensionality, sparsity and the compositional nature of microbiome data represent statistical challenges. On the other hand, the taxon grouping allows summarizing microbiome abundance with a coarser resolution in a lower dimension, but it presents new challenges when correlating taxa with a disease. In this work, we present a novel approach that groups Operational Taxonomical Units (OTUs) based only on relative abundances as an alternative to taxon grouping. The proposed procedure acknowledges the compositional data making use of principal balances. The identified groups are called Principal Microbial Groups (PMGs). The procedure reduces the need for user-defined aggregation of $\textrm{OTU}$s and offers the possibility of working with coarse group of $\textrm{OTU}$s, which are not present in a phylogenetic tree. PMGs can be used for two different goals: (1) as a dimensionality reduction method for compositional data, (2) as an aggregation procedure that provides an alternative to taxon grouping for construction of microbial balances afterward used for disease prediction. We illustrate the procedure with a cirrhosis study data. PMGs provide a coherent data analysis for the search of biomarkers in human microbiota. The source code and demo data for PMGs are available at: https://github.com/asliboyraz/PMGs.
Collapse
Affiliation(s)
- Aslı Boyraz
- Department of Computer Programming, Recep Tayyip Erdoğan University, Ardeşen Vocational School, Rize, 53400, Turkey
| | - Vera Pawlowsky-Glahn
- Department of Computer Sciences, Applied Mathematics and Statistics, University of Girona, Campus Montilivi, 17003 Girona, Spain
| | - Juan José Egozcue
- Department of Civil and Environmental Engineering, Universitat Politécnica de Catalunya, Barcelona, 08034, Spain
| | - Aybar Can Acar
- Department of Medical Informatics, Middle East Technical University, Ankara Turkey
| |
Collapse
|
14
|
Coenders G, Greenacre M. Three approaches to supervised learning for compositional data with pairwise logratios. J Appl Stat 2022; 50:3272-3293. [PMID: 37969895 PMCID: PMC10637191 DOI: 10.1080/02664763.2022.2108007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2022] [Accepted: 07/25/2022] [Indexed: 10/15/2022]
Abstract
Logratios between pairs of compositional parts (pairwise logratios) are the easiest to interpret in compositional data analysis, and include the well-known additive logratios as particular cases. When the number of parts is large (sometimes even larger than the number of cases), some form of logratio selection is needed. In this article, we present three alternative stepwise supervised learning methods to select the pairwise logratios that best explain a dependent variable in a generalized linear model, each geared for a specific problem. The first method features unrestricted search, where any pairwise logratio can be selected. This method has a complex interpretation if some pairs of parts in the logratios overlap, but it leads to the most accurate predictions. The second method restricts parts to occur only once, which makes the corresponding logratios intuitively interpretable. The third method uses additive logratios, so that K-1 selected logratios involve a K-part subcomposition. Our approach allows logratios or non-compositional covariates to be forced into the models based on theoretical knowledge, and various stopping criteria are available based on information measures or statistical significance with the Bonferroni correction. We present an application on a dataset from a study predicting Crohn's disease.
Collapse
Affiliation(s)
- Germà Coenders
- Department of Economics, Universitat de Girona, Girona, Spain
| | - Michael Greenacre
- Department of Economics and Business and Barcelona School of Management, Universitat Pompeu Fabra, Barcelona, Spain
| |
Collapse
|
15
|
Ostner J, Carcy S, Müller CL. tascCODA: Bayesian Tree-Aggregated Analysis of Compositional Amplicon and Single-Cell Data. Front Genet 2021; 12:766405. [PMID: 34950190 PMCID: PMC8689185 DOI: 10.3389/fgene.2021.766405] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2021] [Accepted: 11/01/2021] [Indexed: 12/11/2022] Open
Abstract
Accurate generative statistical modeling of count data is of critical relevance for the analysis of biological datasets from high-throughput sequencing technologies. Important instances include the modeling of microbiome compositions from amplicon sequencing surveys and the analysis of cell type compositions derived from single-cell RNA sequencing. Microbial and cell type abundance data share remarkably similar statistical features, including their inherent compositionality and a natural hierarchical ordering of the individual components from taxonomic or cell lineage tree information, respectively. To this end, we introduce a Bayesian model for tree-aggregated amplicon and single-cell compositional data analysis (tascCODA) that seamlessly integrates hierarchical information and experimental covariate data into the generative modeling of compositional count data. By combining latent parameters based on the tree structure with spike-and-slab Lasso penalization, tascCODA can determine covariate effects across different levels of the population hierarchy in a data-driven parsimonious way. In the context of differential abundance testing, we validate tascCODA's excellent performance on a comprehensive set of synthetic benchmark scenarios. Our analyses on human single-cell RNA-seq data from ulcerative colitis patients and amplicon data from patients with irritable bowel syndrome, respectively, identified aggregated cell type and taxon compositional changes that were more predictive and parsimonious than those proposed by other schemes. We posit that tascCODA constitutes a valuable addition to the growing statistical toolbox for generative modeling and analysis of compositional changes in microbial or cell population data.
Collapse
Affiliation(s)
- Johannes Ostner
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
| | - Salomé Carcy
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Department of Biology, École Normale Supérieure, PSL University, Paris, France
| | - Christian L. Müller
- Department of Statistics, Ludwig-Maximilians-Universität München, Munich, Germany
- Institute of Computational Biology, Helmholtz Zentrum München, Munich, Germany
- Center for Computational Mathematics, Flatiron Institute, New York, NY, United States
| |
Collapse
|