1
|
Jiang Y, Liao D, Zhu Q, Lu YY. PhyloMix: enhancing microbiome-trait association prediction through phylogeny-mixing augmentation. Bioinformatics 2025; 41:btaf014. [PMID: 39799515 PMCID: PMC11849959 DOI: 10.1093/bioinformatics/btaf014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/27/2024] [Revised: 12/14/2024] [Accepted: 01/09/2025] [Indexed: 01/15/2025] Open
Abstract
MOTIVATION Understanding the associations between traits and microbial composition is a fundamental objective in microbiome research. Recently, researchers have turned to machine learning (ML) models to achieve this goal with promising results. However, the effectiveness of advanced ML models is often limited by the unique characteristics of microbiome data, which are typically high-dimensional, compositional, and imbalanced. These characteristics can hinder the models' ability to fully explore the relationships among taxa in predictive analyses. To address this challenge, data augmentation has become crucial. It involves generating synthetic samples with artificial labels based on existing data and incorporating these samples into the training set to improve ML model performance. RESULTS Here, we propose PhyloMix, a novel data augmentation method specifically designed for microbiome data to enhance predictive analyses. PhyloMix leverages the phylogenetic relationships among microbiome taxa as an informative prior to guide the generation of synthetic microbial samples. Leveraging phylogeny, PhyloMix creates new samples by removing a subtree from one sample and combining it with the corresponding subtree from another sample. Notably, PhyloMix is designed to address the compositional nature of microbiome data, effectively handling both raw counts and relative abundances. This approach introduces sufficient diversity into the augmented samples, leading to improved predictive performance. We empirically evaluated PhyloMix on six real microbiome datasets across five commonly used ML models. PhyloMix significantly outperforms distinct baseline methods including sample-mixing-based data augmentation techniques like vanilla mixup and compositional cutmix, as well as the phylogeny-based method TADA. We also demonstrated the wide applicability of PhyloMix in both supervised learning and contrastive representation learning. AVAILABILITY AND IMPLEMENTATION The Apache-licensed source code is available at (https://github.com/batmen-lab/phylomix).
Collapse
Affiliation(s)
- Yifan Jiang
- Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Disen Liao
- Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| | - Qiyun Zhu
- School of Life Sciences, Arizona State University, Tempe, AZ, 85281, United States
| | - Yang Young Lu
- Cheriton School of Computer Science, University of Waterloo, Waterloo, ON, N2L 3G1, Canada
| |
Collapse
|
2
|
Xia Y. Statistical normalization methods in microbiome data with application to microbiome cancer research. Gut Microbes 2023; 15:2244139. [PMID: 37622724 PMCID: PMC10461514 DOI: 10.1080/19490976.2023.2244139] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/20/2023] [Revised: 07/12/2023] [Accepted: 07/31/2023] [Indexed: 08/26/2023] Open
Abstract
Mounting evidence has shown that gut microbiome is associated with various cancers, including gastrointestinal (GI) tract and non-GI tract cancers. But microbiome data have unique characteristics and pose major challenges when using standard statistical methods causing results to be invalid or misleading. Thus, to analyze microbiome data, it not only needs appropriate statistical methods, but also requires microbiome data to be normalized prior to statistical analysis. Here, we first describe the unique characteristics of microbiome data and the challenges in analyzing them (Section 2). Then, we provide an overall review on the available normalization methods of 16S rRNA and shotgun metagenomic data along with examples of their applications in microbiome cancer research (Section 3). In Section 4, we comprehensively investigate how the normalization methods of 16S rRNA and shotgun metagenomic data are evaluated. Finally, we summarize and conclude with remarks on statistical normalization methods (Section 5). Altogether, this review aims to provide a broad and comprehensive view and remarks on the promises and challenges of the statistical normalization methods in microbiome data with microbiome cancer research examples.
Collapse
Affiliation(s)
- Yinglin Xia
- Division of Gastroenterology and Hepatology, Department of Medicine, University of Illinois Chicago, Chicago, USA
| |
Collapse
|
3
|
Hernández Medina R, Kutuzova S, Nielsen KN, Johansen J, Hansen LH, Nielsen M, Rasmussen S. Machine learning and deep learning applications in microbiome research. ISME COMMUNICATIONS 2022; 2:98. [PMID: 37938690 PMCID: PMC9723725 DOI: 10.1038/s43705-022-00182-9] [Citation(s) in RCA: 105] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 09/12/2022] [Accepted: 09/16/2022] [Indexed: 05/27/2023]
Abstract
The many microbial communities around us form interactive and dynamic ecosystems called microbiomes. Though concealed from the naked eye, microbiomes govern and influence macroscopic systems including human health, plant resilience, and biogeochemical cycling. Such feats have attracted interest from the scientific community, which has recently turned to machine learning and deep learning methods to interrogate the microbiome and elucidate the relationships between its composition and function. Here, we provide an overview of how the latest microbiome studies harness the inductive prowess of artificial intelligence methods. We start by highlighting that microbiome data - being compositional, sparse, and high-dimensional - necessitates special treatment. We then introduce traditional and novel methods and discuss their strengths and applications. Finally, we discuss the outlook of machine and deep learning pipelines, focusing on bottlenecks and considerations to address them.
Collapse
Affiliation(s)
- Ricardo Hernández Medina
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
| | - Svetlana Kutuzova
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
- Department of Computer Science, University of Copenhagen, DK-2100, Copenhagen Ø, Denmark
| | - Knud Nor Nielsen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871, Frederiksberg, Denmark
| | - Joachim Johansen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
| | - Lars Hestbjerg Hansen
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871, Frederiksberg, Denmark
| | - Mads Nielsen
- Department of Computer Science, University of Copenhagen, DK-2100, Copenhagen Ø, Denmark.
| | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark.
| |
Collapse
|
4
|
Díez López C, Montiel González D, Vidaki A, Kayser M. Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning. Front Microbiol 2022; 13:886201. [PMID: 35928158 PMCID: PMC9343866 DOI: 10.3389/fmicb.2022.886201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 06/21/2022] [Indexed: 11/24/2022] Open
Abstract
Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits.
Collapse
Affiliation(s)
| | | | | | - Manfred Kayser
- Department of Genetic Identification, Erasmus MC University Medical Center Rotterdam, Rotterdam, Netherlands
| |
Collapse
|
5
|
Youngblut ND, de la Cuesta-Zuluaga J, Ley RE. Incorporating genome-based phylogeny and functional similarity into diversity assessments helps to resolve a global collection of human gut metagenomes. Environ Microbiol 2022; 24:3966-3984. [PMID: 35049120 DOI: 10.1111/1462-2920.15910] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2021] [Accepted: 01/15/2022] [Indexed: 11/29/2022]
Abstract
Tree-based diversity measures incorporate phylogenetic or functional relatedness into comparisons of microbial communities. This can improve the identification of explanatory factors compared to tree-agnostic diversity measures. However, applying tree-based diversity measures to metagenome data is more challenging than for single-locus sequencing (e.g., 16S rRNA gene). Utilizing the Genome Taxonomy Database (GTDB) for species-level metagenome profiling allows for functional diversity measures based on genomic content or traits inferred from it. Still, it is unclear how metagenome-based assessments of microbiome diversity benefit from incorporating phylogeny or function into measures of diversity. We assessed this by measuring phylogeny-based, function-based, and tree-agnostic diversity measures from a large, global collection of human gut metagenomes composed of 30 studies and 2943 samples. We found tree-based measures to explain phenotypic variation (e.g., westernization, disease status, and gender) better or equivalent to tree-agnostic measures. Ecophylogenetic and functional diversity measures provided unique insight into how microbiome diversity was partitioned by phenotype. Tree-based measures greatly improved machine learning model performance for predicting westernization, disease status, and gender, relative to models trained solely on tree-agnostic measures. Our findings illustrate the usefulness of tree- and function-based measures for metagenomic assessments of microbial diversity, which is a fundamental component of microbiome science. This article is protected by copyright. All rights reserved.
Collapse
Affiliation(s)
- Nicholas D Youngblut
- Department of Microbiome Science, Max Planck Institute for Developmental Biology, Max Planck Ring 5, 72076, Tübingen, Germany
| | - Jacobo de la Cuesta-Zuluaga
- Department of Microbiome Science, Max Planck Institute for Developmental Biology, Max Planck Ring 5, 72076, Tübingen, Germany
| | - Ruth E Ley
- Department of Microbiome Science, Max Planck Institute for Developmental Biology, Max Planck Ring 5, 72076, Tübingen, Germany
| |
Collapse
|
6
|
García-Jiménez B, Muñoz J, Cabello S, Medina J, Wilkinson MD. Predicting microbiomes through a deep latent space. Bioinformatics 2021; 37:1444-1451. [PMID: 33289510 PMCID: PMC8208755 DOI: 10.1093/bioinformatics/btaa971] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Revised: 10/21/2020] [Accepted: 11/06/2020] [Indexed: 12/28/2022] Open
Abstract
Motivation Microbial communities influence their environment by modifying the availability of compounds, such as nutrients or chemical elicitors. Knowing the microbial composition of a site is therefore relevant to improve productivity or health. However, sequencing facilities are not always available, or may be prohibitively expensive in some cases. Thus, it would be desirable to computationally predict the microbial composition from more accessible, easily-measured features. Results Integrating deep learning techniques with microbiome data, we propose an artificial neural network architecture based on heterogeneous autoencoders to condense the long vector of microbial abundance values into a deep latent space representation. Then, we design a model to predict the deep latent space and, consequently, to predict the complete microbial composition using environmental features as input. The performance of our system is examined using the rhizosphere microbiome of Maize. We reconstruct the microbial composition (717 taxa) from the deep latent space (10 values) with high fidelity (>0.9 Pearson correlation). We then successfully predict microbial composition from environmental variables, such as plant age, temperature or precipitation (0.73 Pearson correlation, 0.42 Bray–Curtis). We extend this to predict microbiome composition under hypothetical scenarios, such as future climate change conditions. Finally, via transfer learning, we predict microbial composition in a distinct scenario with only 100 sequences, and distinct environmental features. We propose that our deep latent space may assist microbiome-engineering strategies when technical or financial resources are limited, through predicting current or future microbiome compositions. Availability and implementation Software, results and data are available at https://github.com/jorgemf/DeepLatentMicrobiome Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Beatriz García-Jiménez
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Jorge Muñoz
- Serendeepia Research, 28905 Getafe (Madrid), Spain
| | - Sara Cabello
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Joaquín Medina
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Mark D Wilkinson
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain.,Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid (UPM), Madrid, Spain
| |
Collapse
|