1
|
Jiang Y, Aton M, Zhu Q, Lu YY. Modeling microbiome-trait associations with taxonomy-adaptive neural networks. MICROBIOME 2025; 13:87. [PMID: 40158141 PMCID: PMC11954268 DOI: 10.1186/s40168-025-02080-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Accepted: 03/04/2025] [Indexed: 04/01/2025]
Abstract
The human microbiome, a complex ecosystem of microorganisms inhabiting the body, plays a critical role in human health. Investigating its association with host traits is essential for understanding its impact on various diseases. Although shotgun metagenomic sequencing technologies have produced vast amounts of microbiome data, analyzing such data is highly challenging due to its sparsity, noisiness, and high feature dimensionality. Here, we develop MIOSTONE, an accurate and interpretable neural network model for microbiome-disease association that simulates a real taxonomy by encoding the relationships among microbial features. The taxonomy-encoding architecture provides a natural bridge from variations in microbial taxa abundance to variations in traits, encompassing increasingly coarse scales from species to domains. MIOSTONE has the ability to determine whether taxa within the corresponding taxonomic group provide a better explanation in a data-driven manner. MIOSTONE serves as an effective predictive model, as it not only accurately predicts microbiome-trait associations across extensive simulated and real datasets but also offers interpretability for scientific discovery. Both attributes are crucial for facilitating in silico investigations into the biological mechanisms underlying such associations among microbial taxa. Video Abstract.
Collapse
Affiliation(s)
- Yifan Jiang
- Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada
| | - Matthew Aton
- School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - Qiyun Zhu
- School of Life Sciences, Arizona State University, Tempe, AZ, USA.
| | - Yang Young Lu
- Cheriton School of Computer Science, University of Waterloo, Waterloo, Ontario, Canada.
| |
Collapse
|
2
|
Song Y, Atza E, Sánchez-Gil JJ, Akkermans D, de Jonge R, de Rooij PGH, Kakembo D, Bakker PAHM, Pieterse CMJ, Budko NV, Berendsen RL. Seed tuber microbiome can predict growth potential of potato varieties. Nat Microbiol 2025; 10:28-40. [PMID: 39730984 DOI: 10.1038/s41564-024-01872-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Accepted: 10/31/2024] [Indexed: 12/29/2024]
Abstract
Potato vigour, the growth potential of seed potatoes, is a key agronomic trait that varies significantly across production fields due to factors such as genetic background and environmental conditions. Seed tuber microbiomes are thought to influence plant health and crop performance, yet the precise relationships between microbiome composition and potato vigour remain unclear. Here we conducted microbiome sequencing on seed tuber eyes and heel ends from 6 potato varieties grown in 240 fields. By using time-resolved drone imaging of three trial fields in the next season to track crop development, we were able to link microbiome composition with potato vigour. We used microbiome data at varying taxonomic resolutions to build random forest predictive models and found that amplicon sequence variants provided the highest predictive accuracy for potato vigour. The model revealed variety-specific relationships between the seed tuber microbiome and next season's crop vigour in independent trial fields. With a coefficient of determination value of 0.69 for the best-performing variety, the model accurately predicted vigour in seed tubers from fields not previously included in the analysis. Moreover, the model identified key microbial indicators of vigour from which a Streptomyces, an Acinetobacter and a Cellvibrio amplicon sequence variant stood out as the most important contributors to the model's accuracy. This study shows that seed potato vigour can be reliably predicted based on the microbiota associated with seed tuber eyes, potentially guiding future microbiome-informed breeding strategies.
Collapse
Affiliation(s)
- Yang Song
- Plant-Microbe Interactions, Institute of Environmental Biology, Department of Biology, Science4Life, Utrecht University, Utrecht, the Netherlands
| | - Elisa Atza
- Numerical Analysis, Delft Institute of Applied Mathematics, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, the Netherlands
| | - Juan J Sánchez-Gil
- Plant-Microbe Interactions, Institute of Environmental Biology, Department of Biology, Science4Life, Utrecht University, Utrecht, the Netherlands
| | - Doretta Akkermans
- HZPC Research B.V., Department of Plant Pathology, Metslawier, the Netherlands
| | - Ronnie de Jonge
- Plant-Microbe Interactions, Institute of Environmental Biology, Department of Biology, Science4Life, Utrecht University, Utrecht, the Netherlands
- AI Technology for Life, Department of Information and Computing Sciences, Science4Life, Utrecht University, Utrecht, the Netherlands
| | - Peter G H de Rooij
- Plant-Microbe Interactions, Institute of Environmental Biology, Department of Biology, Science4Life, Utrecht University, Utrecht, the Netherlands
| | - David Kakembo
- Plant-Microbe Interactions, Institute of Environmental Biology, Department of Biology, Science4Life, Utrecht University, Utrecht, the Netherlands
| | - Peter A H M Bakker
- Plant-Microbe Interactions, Institute of Environmental Biology, Department of Biology, Science4Life, Utrecht University, Utrecht, the Netherlands
| | - Corné M J Pieterse
- Plant-Microbe Interactions, Institute of Environmental Biology, Department of Biology, Science4Life, Utrecht University, Utrecht, the Netherlands
| | - Neil V Budko
- Numerical Analysis, Delft Institute of Applied Mathematics, Faculty of Electrical Engineering, Mathematics and Computer Science, Delft University of Technology, Delft, the Netherlands
| | - Roeland L Berendsen
- Plant-Microbe Interactions, Institute of Environmental Biology, Department of Biology, Science4Life, Utrecht University, Utrecht, the Netherlands.
| |
Collapse
|
3
|
Hosseiniyan Khatibi SM, Dimaano NG, Veliz E, Sundaresan V, Ali J. Exploring and exploiting the rice phytobiome to tackle climate change challenges. PLANT COMMUNICATIONS 2024; 5:101078. [PMID: 39233440 PMCID: PMC11671768 DOI: 10.1016/j.xplc.2024.101078] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/27/2024] [Revised: 08/07/2024] [Accepted: 09/02/2024] [Indexed: 09/06/2024]
Abstract
The future of agriculture is uncertain under the current climate change scenario. Climate change directly and indirectly affects the biotic and abiotic elements that control agroecosystems, jeopardizing the safety of the world's food supply. A new area that focuses on characterizing the phytobiome is emerging. The phytobiome comprises plants and their immediate surroundings, involving numerous interdependent microscopic and macroscopic organisms that affect the health and productivity of plants. Phytobiome studies primarily focus on the microbial communities associated with plants, which are referred to as the plant microbiome. The development of high-throughput sequencing technologies over the past 10 years has dramatically advanced our understanding of the structure, functionality, and dynamics of the phytobiome; however, comprehensive methods for using this knowledge are lacking, particularly for major crops such as rice. Considering the impact of rice production on world food security, gaining fresh perspectives on the interdependent and interrelated components of the rice phytobiome could enhance rice production and crop health, sustain rice ecosystem function, and combat the effects of climate change. Our review re-conceptualizes the complex dynamics of the microscopic and macroscopic components in the rice phytobiome as influenced by human interventions and changing environmental conditions driven by climate change. We also discuss interdisciplinary and systematic approaches to decipher and reprogram the sophisticated interactions in the rice phytobiome using novel strategies and cutting-edge technology. Merging the gigantic datasets and complex information on the rice phytobiome and their application in the context of regenerative agriculture could lead to sustainable rice farming practices that are resilient to the impacts of climate change.
Collapse
Affiliation(s)
| | - Niña Gracel Dimaano
- International Rice Research Institute, Los Baños, Laguna, Philippines; College of Agriculture and Food Science, University of the Philippines Los Baños, Los Baños, Laguna, Philippines
| | - Esteban Veliz
- College of Biological Sciences, University of California, Davis, Davis, CA, USA
| | - Venkatesan Sundaresan
- College of Biological Sciences, University of California, Davis, Davis, CA, USA; College of Agricultural and Environmental Sciences, University of California, Davis, Davis, CA, USA
| | - Jauhar Ali
- International Rice Research Institute, Los Baños, Laguna, Philippines.
| |
Collapse
|
4
|
Zhao H, Wang Y, Sun Y, Wang Y, Shi B, Liu J, Zhang S. Hematological indicator-based machine learning models for preoperative prediction of lymph node metastasis in cervical cancer. Front Oncol 2024; 14:1400109. [PMID: 39193382 PMCID: PMC11347340 DOI: 10.3389/fonc.2024.1400109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Accepted: 07/29/2024] [Indexed: 08/29/2024] Open
Abstract
Background Lymph node metastasis (LNM) is an important prognostic factor for cervical cancer (CC) and determines the treatment strategy. Hematological indicators have been reported as being useful biomarkers for the prognosis of a variety of cancers. This study aimed to evaluate the feasibility of machine learning models characterized by preoperative hematological indicators to predict the LNM status of CC patients before surgery. Methods The clinical data of 236 patients with pathologically confirmed CC were retrospectively analyzed at the Gynecology Oncology Department of the First Affiliated Hospital of Bengbu Medical University from November 2020 to August 2022. The least absolute shrinkage and selection operator (LASSO) was used to select 21 features from 35 hematological indicators and for the construction of 6 machine learning predictive models, including Adaptive Boosting (AdaBoost), Gaussian Naive Bayes (GNB), and Logistic Regression (LR), as well as Random Forest (RF), Support Vector Machines (SVM), and Extreme Gradient Boosting (XGBoost). Evaluation metrics of predictive models included the area under the receiver operating characteristic curve (AUC), accuracy, specificity, sensitivity, and F1-score. Results RF has the best overall predictive performance for ten-fold cross-validation in the training set. The specific performance indicators of RF were AUC (0.910, 95% confidence interval [CI]: 0.820-1.000), accuracy (0.831, 95% CI: 0.702-0.960), specificity (0.835, 95% CI: 0.708-0.962), sensitivity (0.831, 95% CI: 0.702-0.960), and F1-score (0.829, 95% CI: 0.696-0.962). RF had the highest AUC in the testing set (AUC = 0.854). Conclusion RF based on preoperative hematological indicators that are easily available in clinical practice showed superior performance in the preoperative prediction of CC LNM. However, investigations on larger external cohorts of patients are required for further validation of our findings.
Collapse
Affiliation(s)
- Huan Zhao
- School of Medical Imaging, Bengbu Medical University, Bengbu, Anhui, China
| | - Yuling Wang
- Department of Gynecology and Oncology, First Affiliated Hospital, Bengbu Medical University, Bengbu, Anhui, China
| | - Yilin Sun
- Department of Gynecology and Oncology, First Affiliated Hospital, Bengbu Medical University, Bengbu, Anhui, China
| | - Yongqiang Wang
- School of Medical Imaging, Bengbu Medical University, Bengbu, Anhui, China
| | - Bo Shi
- School of Medical Imaging, Bengbu Medical University, Bengbu, Anhui, China
| | - Jian Liu
- Department of Gynecology and Oncology, First Affiliated Hospital, Bengbu Medical University, Bengbu, Anhui, China
| | - Sai Zhang
- School of Medical Imaging, Bengbu Medical University, Bengbu, Anhui, China
| |
Collapse
|
5
|
Peralta-Marzal LN, Rojas-Velazquez D, Rigters D, Prince N, Garssen J, Kraneveld AD, Perez-Pardo P, Lopez-Rincon A. A robust microbiome signature for autism spectrum disorder across different studies using machine learning. Sci Rep 2024; 14:814. [PMID: 38191575 PMCID: PMC10774349 DOI: 10.1038/s41598-023-50601-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Accepted: 12/21/2023] [Indexed: 01/10/2024] Open
Abstract
Autism spectrum disorder (ASD) is a highly complex neurodevelopmental disorder characterized by deficits in sociability and repetitive behaviour, however there is a great heterogeneity within other comorbidities that accompany ASD. Recently, gut microbiome has been pointed out as a plausible contributing factor for ASD development as individuals diagnosed with ASD often suffer from intestinal problems and show a differentiated intestinal microbial composition. Nevertheless, gut microbiome studies in ASD rarely agree on the specific bacterial taxa involved in this disorder. Regarding the potential role of gut microbiome in ASD pathophysiology, our aim is to investigate whether there is a set of bacterial taxa relevant for ASD classification by using a sibling-controlled dataset. Additionally, we aim to validate these results across two independent cohorts as several confounding factors, such as lifestyle, influence both ASD and gut microbiome studies. A machine learning approach, recursive ensemble feature selection (REFS), was applied to 16S rRNA gene sequencing data from 117 subjects (60 ASD cases and 57 siblings) identifying 26 bacterial taxa that discriminate ASD cases from controls. The average area under the curve (AUC) of this specific set of bacteria in the sibling-controlled dataset was 81.6%. Moreover, we applied the selected bacterial taxa in a tenfold cross-validation scheme using two independent cohorts (a total of 223 samples-125 ASD cases and 98 controls). We obtained average AUCs of 74.8% and 74%, respectively. Analysis of the gut microbiome using REFS identified a set of bacterial taxa that can be used to predict the ASD status of children in three distinct cohorts with AUC over 80% for the best-performing classifiers. Our results indicate that the gut microbiome has a strong association with ASD and should not be disregarded as a potential target for therapeutic interventions. Furthermore, our work can contribute to use the proposed approach for identifying microbiome signatures across other 16S rRNA gene sequencing datasets.
Collapse
Affiliation(s)
- Lucia N Peralta-Marzal
- Division of Pharmacology, Faculty of Science, Utrecht Institute for Pharmaceutical Sciences, University of Utrecht, Utrecht, The Netherlands
| | - David Rojas-Velazquez
- Division of Pharmacology, Faculty of Science, Utrecht Institute for Pharmaceutical Sciences, University of Utrecht, Utrecht, The Netherlands
- Department of Data Science, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
| | - Douwe Rigters
- Division of Pharmacology, Faculty of Science, Utrecht Institute for Pharmaceutical Sciences, University of Utrecht, Utrecht, The Netherlands
| | - Naika Prince
- Division of Pharmacology, Faculty of Science, Utrecht Institute for Pharmaceutical Sciences, University of Utrecht, Utrecht, The Netherlands
| | - Johan Garssen
- Division of Pharmacology, Faculty of Science, Utrecht Institute for Pharmaceutical Sciences, University of Utrecht, Utrecht, The Netherlands
- Global Centre of Excellence Immunology, Danone Nutricia Research, Utrecht, The Netherlands
| | - Aletta D Kraneveld
- Division of Pharmacology, Faculty of Science, Utrecht Institute for Pharmaceutical Sciences, University of Utrecht, Utrecht, The Netherlands
- Department of Neuroscience, Faculty of Science, VU University, Amsterdam, The Netherlands
| | - Paula Perez-Pardo
- Division of Pharmacology, Faculty of Science, Utrecht Institute for Pharmaceutical Sciences, University of Utrecht, Utrecht, The Netherlands.
| | - Alejandro Lopez-Rincon
- Division of Pharmacology, Faculty of Science, Utrecht Institute for Pharmaceutical Sciences, University of Utrecht, Utrecht, The Netherlands
- Department of Data Science, Julius Center for Health Sciences and Primary Care, University Medical Center Utrecht, Utrecht, The Netherlands
| |
Collapse
|
6
|
Alshawaqfeh M, Rababah S, Hayajneh A, Gharaibeh A, Serpedin E. MetaAnalyst: a user-friendly tool for metagenomic biomarker detection and phenotype classification. BMC Med Res Methodol 2022; 22:336. [PMID: 36577938 PMCID: PMC9795700 DOI: 10.1186/s12874-022-01812-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2022] [Accepted: 11/28/2022] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Many metagenomic studies have linked the imbalance in microbial abundance profiles to a wide range of diseases. These studies suggest utilizing the microbial abundance profiles as potential markers for metagenomic-associated conditions. Due to the inevitable importance of biomarkers in understanding the disease progression and the development of possible therapies, various computational tools have been proposed for metagenomic biomarker detection. However, most existing tools require prior scripting knowledge and lack user friendly interfaces, causing considerable time and effort to install, configure, and run these tools. Besides, there is no available all-in-one solution for running and comparing various metagenomic biomarker detection simultaneously. In addition, most of these tools just present the suggested biomarkers without any statistical evaluation for their quality. RESULTS To overcome these limitations, this work presents MetaAnalyst, a software package with a simple graphical user interface (GUI) that (i) automates the installation and configuration of 28 state-of-the-art tools, (ii) supports flexible study design to enable studying the dataset under different scenarios smoothly, iii) runs and evaluates several algorithms simultaneously iv) supports different input formats and provides the user with several preprocessing capabilities, v) provides a variety of metrics to evaluate the quality of the suggested markers, and vi) presents the outcomes in the form of publication quality plots with various formatting capabilities as well as Excel sheets. CONCLUSIONS The utility of this tool has been verified through studying a metagenomic dataset under four scenarios. The executable file for MetaAnalyst along with its user manual are made available at https://github.com/mshawaqfeh/MetaAnalyst .
Collapse
Affiliation(s)
- Mustafa Alshawaqfeh
- grid.440896.70000 0004 0418 154XSchool of Electrical Engineering and Information Technology, German Jordanian University, Amman, Jordan
| | - Salahelden Rababah
- grid.440896.70000 0004 0418 154XSchool of Electrical Engineering and Information Technology, German Jordanian University, Amman, Jordan ,grid.264260.40000 0001 2164 4508Department of Systems Science and Industrial Engineering, State University of New York at Binghamton, Binghamton, NY, USA
| | - Abdullah Hayajneh
- grid.264756.40000 0004 4687 2082Electrical and Computer Engineering Department, Texas A &M University, College Station, TX, USA
| | - Ammar Gharaibeh
- grid.440896.70000 0004 0418 154XSchool of Electrical Engineering and Information Technology, German Jordanian University, Amman, Jordan
| | - Erchin Serpedin
- grid.264756.40000 0004 4687 2082Electrical and Computer Engineering Department, Texas A &M University, College Station, TX, USA
| |
Collapse
|
7
|
Correa-Garcia S, Constant P, Yergeau E. The forecasting power of the microbiome. Trends Microbiol 2022; 31:444-452. [PMID: 36549949 DOI: 10.1016/j.tim.2022.11.013] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2022] [Revised: 11/25/2022] [Accepted: 11/30/2022] [Indexed: 12/24/2022]
Abstract
Microorganisms are informative biological integrators of past and present environmental abiotic and biotic conditions. At the same time, they are directly involved in ecosystem processes. Unfortunately, the complexity of microbial communities has so far resulted in most studies being descriptive. Here, we suggest that signals in the microbiome data can be used to forecast future ecosystem processes. The combination of omics with various statistical learning approaches, selected based on accuracy-interpretability and bias-variance trade-offs, will be key to attain this goal, as exemplified by recent studies. The time is ripe for microbial ecologists to fully exploit the forecasting power of microbiomes.
Collapse
Affiliation(s)
- Sara Correa-Garcia
- Institut national de la recherche scientifique, Centre Armand-Frappier Santé Biotechnologie, 531 boulevard des Prairies, Laval, Québec H7V 1B7, Canada
| | - Philippe Constant
- Institut national de la recherche scientifique, Centre Armand-Frappier Santé Biotechnologie, 531 boulevard des Prairies, Laval, Québec H7V 1B7, Canada
| | - Etienne Yergeau
- Institut national de la recherche scientifique, Centre Armand-Frappier Santé Biotechnologie, 531 boulevard des Prairies, Laval, Québec H7V 1B7, Canada.
| |
Collapse
|
8
|
Loganathan T, Priya Doss C G. The influence of machine learning technologies in gut microbiome research and cancer studies - A review. Life Sci 2022; 311:121118. [DOI: 10.1016/j.lfs.2022.121118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 10/19/2022] [Accepted: 10/19/2022] [Indexed: 11/18/2022]
|
9
|
Li P, Luo H, Ji B, Nielsen J. Machine learning for data integration in human gut microbiome. Microb Cell Fact 2022; 21:241. [PMID: 36419034 PMCID: PMC9685977 DOI: 10.1186/s12934-022-01973-4] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2022] [Accepted: 11/15/2022] [Indexed: 11/25/2022] Open
Abstract
Recent studies have demonstrated that gut microbiota plays critical roles in various human diseases. High-throughput technology has been widely applied to characterize the microbial ecosystems, which led to an explosion of different types of molecular profiling data, such as metagenomics, metatranscriptomics and metabolomics. For analysis of such data, machine learning algorithms have shown to be useful for identifying key molecular signatures, discovering potential patient stratifications, and particularly for generating models that can accurately predict phenotypes. In this review, we first discuss how dysbiosis of the intestinal microbiota is linked to human disease development and how potential modulation strategies of the gut microbial ecosystem can be used for disease treatment. In addition, we introduce categories and workflows of different machine learning approaches, and how they can be used to perform integrative analysis of multi-omics data. Finally, we review advances of machine learning in gut microbiome applications and discuss related challenges. Based on this we conclude that machine learning is very well suited for analysis of gut microbiome and that these approaches can be useful for development of gut microbe-targeted therapies, which ultimately can help in achieving personalized and precision medicine.
Collapse
Affiliation(s)
- Peishun Li
- grid.5371.00000 0001 0775 6028Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Hao Luo
- grid.5371.00000 0001 0775 6028Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden
| | - Boyang Ji
- grid.5371.00000 0001 0775 6028Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden ,grid.510909.4BioInnovation Institute, Ole Maaløes Vej 3, DK2200 Copenhagen, Denmark
| | - Jens Nielsen
- grid.5371.00000 0001 0775 6028Department of Biology and Biological Engineering, Chalmers University of Technology, Gothenburg, Sweden ,grid.510909.4BioInnovation Institute, Ole Maaløes Vej 3, DK2200 Copenhagen, Denmark
| |
Collapse
|
10
|
Hernández Medina R, Kutuzova S, Nielsen KN, Johansen J, Hansen LH, Nielsen M, Rasmussen S. Machine learning and deep learning applications in microbiome research. ISME COMMUNICATIONS 2022; 2:98. [PMID: 37938690 PMCID: PMC9723725 DOI: 10.1038/s43705-022-00182-9] [Citation(s) in RCA: 105] [Impact Index Per Article: 35.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 09/12/2022] [Accepted: 09/16/2022] [Indexed: 05/27/2023]
Abstract
The many microbial communities around us form interactive and dynamic ecosystems called microbiomes. Though concealed from the naked eye, microbiomes govern and influence macroscopic systems including human health, plant resilience, and biogeochemical cycling. Such feats have attracted interest from the scientific community, which has recently turned to machine learning and deep learning methods to interrogate the microbiome and elucidate the relationships between its composition and function. Here, we provide an overview of how the latest microbiome studies harness the inductive prowess of artificial intelligence methods. We start by highlighting that microbiome data - being compositional, sparse, and high-dimensional - necessitates special treatment. We then introduce traditional and novel methods and discuss their strengths and applications. Finally, we discuss the outlook of machine and deep learning pipelines, focusing on bottlenecks and considerations to address them.
Collapse
Affiliation(s)
- Ricardo Hernández Medina
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
| | - Svetlana Kutuzova
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
- Department of Computer Science, University of Copenhagen, DK-2100, Copenhagen Ø, Denmark
| | - Knud Nor Nielsen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871, Frederiksberg, Denmark
| | - Joachim Johansen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
| | - Lars Hestbjerg Hansen
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871, Frederiksberg, Denmark
| | - Mads Nielsen
- Department of Computer Science, University of Copenhagen, DK-2100, Copenhagen Ø, Denmark.
| | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark.
| |
Collapse
|
11
|
Bakir-Gungor B, Hacılar H, Jabeer A, Nalbantoglu OU, Aran O, Yousef M. Inflammatory bowel disease biomarkers of human gut microbiota selected via different feature selection methods. PeerJ 2022; 10:e13205. [PMID: 35497193 PMCID: PMC9048649 DOI: 10.7717/peerj.13205] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 03/10/2022] [Indexed: 01/12/2023] Open
Abstract
The tremendous boost in next generation sequencing and in the "omics" technologies makes it possible to characterize the human gut microbiome-the collective genomes of the microbial community that reside in our gastrointestinal tract. Although some of these microorganisms are considered to be essential regulators of our immune system, the alteration of the complexity and eubiotic state of microbiota might promote autoimmune and inflammatory disorders such as diabetes, rheumatoid arthritis, Inflammatory bowel diseases (IBD), obesity, and carcinogenesis. IBD, comprising Crohn's disease and ulcerative colitis, is a gut-related, multifactorial disease with an unknown etiology. IBD presents defects in the detection and control of the gut microbiota, associated with unbalanced immune reactions, genetic mutations that confer susceptibility to the disease, and complex environmental conditions such as westernized lifestyle. Although some existing studies attempt to unveil the composition and functional capacity of the gut microbiome in relation to IBD diseases, a comprehensive picture of the gut microbiome in IBD patients is far from being complete. Due to the complexity of metagenomic studies, the applications of the state-of-the-art machine learning techniques became popular to address a wide range of questions in the field of metagenomic data analysis. In this regard, using IBD associated metagenomics dataset, this study utilizes both supervised and unsupervised machine learning algorithms, (i) to generate a classification model that aids IBD diagnosis, (ii) to discover IBD-associated biomarkers, (iii) to discover subgroups of IBD patients using k-means and hierarchical clustering approaches. To deal with the high dimensionality of features, we applied robust feature selection algorithms such as Conditional Mutual Information Maximization (CMIM), Fast Correlation Based Filter (FCBF), min redundancy max relevance (mRMR), Select K Best (SKB), Information Gain (IG) and Extreme Gradient Boosting (XGBoost). In our experiments with 100-fold Monte Carlo cross-validation (MCCV), XGBoost, IG, and SKB methods showed a considerable effect in terms of minimizing the microbiota used for the diagnosis of IBD and thus reducing the cost and time. We observed that compared to Decision Tree, Support Vector Machine, Logitboost, Adaboost, and stacking ensemble classifiers, our Random Forest classifier resulted in better performance measures for the classification of IBD. Our findings revealed potential microbiome-mediated mechanisms of IBD and these findings might be useful for the development of microbiome-based diagnostics.
Collapse
Affiliation(s)
- Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Hilal Hacılar
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Amhar Jabeer
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | | | - Oya Aran
- TETAM, Bogazici University, Istanbul, Turkey
| | - Malik Yousef
- Zefat Academic College, Zefat, Israel,Galilee Digital Health Research Center, Zefat Academic College, Zefat, Israel
| |
Collapse
|
12
|
Host phenotype classification from human microbiome data is mainly driven by the presence of microbial taxa. PLoS Comput Biol 2022; 18:e1010066. [PMID: 35446845 PMCID: PMC9064115 DOI: 10.1371/journal.pcbi.1010066] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 05/03/2022] [Accepted: 03/29/2022] [Indexed: 12/14/2022] Open
Abstract
Machine learning-based classification approaches are widely used to predict host phenotypes from microbiome data. Classifiers are typically employed by considering operational taxonomic units or relative abundance profiles as input features. Such types of data are intrinsically sparse, which opens the opportunity to make predictions from the presence/absence rather than the relative abundance of microbial taxa. This also poses the question whether it is the presence rather than the abundance of particular taxa to be relevant for discrimination purposes, an aspect that has been so far overlooked in the literature. In this paper, we aim at filling this gap by performing a meta-analysis on 4,128 publicly available metagenomes associated with multiple case-control studies. At species-level taxonomic resolution, we show that it is the presence rather than the relative abundance of specific microbial taxa to be important when building classification models. Such findings are robust to the choice of the classifier and confirmed by statistical tests applied to identifying differentially abundant/present taxa. Results are further confirmed at coarser taxonomic resolutions and validated on 4,026 additional 16S rRNA samples coming from 30 public case-control studies. The composition of the human microbiome has been linked to a large number of different diseases. In this context, classification methodologies based on machine learning approaches have represented a promising tool for diagnostic purposes from metagenomics data. The link between microbial population composition and host phenotypes has been usually performed by considering taxonomic profiles represented by relative abundances of microbial species. In this study, we show that it is more the presence rather than the relative abundance of microbial taxa to be relevant to maximize classification accuracy. This is accomplished by conducting a meta-analysis on more than 4,000 shotgun metagenomes coming from 25 case-control studies and in which original relative abundance data are degraded to presence/absence profiles. Findings are also extended to 16S rRNA data and advance the research field in building prediction models directly from human microbiome data.
Collapse
|
13
|
Chen X, Liu L, Zhang W, Yang J, Wong KC. Human host status inference from temporal microbiome changes via recurrent neural networks. Brief Bioinform 2021; 22:6307015. [PMID: 34151933 DOI: 10.1093/bib/bbab223] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 04/21/2021] [Accepted: 04/21/2021] [Indexed: 01/04/2023] Open
Abstract
With the rapid increase in sequencing data, human host status inference (e.g. healthy or sick) from microbiome data has become an important issue. Existing studies are mostly based on single-point microbiome composition, while it is rare that the host status is predicted from longitudinal microbiome data. However, single-point-based methods cannot capture the dynamic patterns between the temporal changes and host status. Therefore, it remains challenging to build good predictive models as well as scaling to different microbiome contexts. On the other hand, existing methods are mainly targeted for disease prediction and seldom investigate other host statuses. To fill the gap, we propose a comprehensive deep learning-based framework that utilizes longitudinal microbiome data as input to infer the human host status. Specifically, the framework is composed of specific data preparation strategies and a recurrent neural network tailored for longitudinal microbiome data. In experiments, we evaluated the proposed method on both semi-synthetic and real datasets based on different sequencing technologies and metagenomic contexts. The results indicate that our method achieves robust performance compared to other baseline and state-of-the-art classifiers and provides a significant reduction in prediction time.
Collapse
Affiliation(s)
- Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Lingjing Liu
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Weitong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Kowloon, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| |
Collapse
|
14
|
Jasner Y, Belogolovski A, Ben-Itzhak M, Koren O, Louzoun Y. Microbiome Preprocessing Machine Learning Pipeline. Front Immunol 2021; 12:677870. [PMID: 34220823 PMCID: PMC8250139 DOI: 10.3389/fimmu.2021.677870] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2021] [Accepted: 05/07/2021] [Indexed: 11/13/2022] Open
Abstract
Background 16S sequencing results are often used for Machine Learning (ML) tasks. 16S gene sequences are represented as feature counts, which are associated with taxonomic representation. Raw feature counts may not be the optimal representation for ML. Methods We checked multiple preprocessing steps and tested the optimal combination for 16S sequencing-based classification tasks. We computed the contribution of each step to the accuracy as measured by the Area Under Curve (AUC) of the classification. Results We show that the log of the feature counts is much more informative than the relative counts. We further show that merging features associated with the same taxonomy at a given level, through a dimension reduction step for each group of bacteria improves the AUC. Finally, we show that z-scoring has a very limited effect on the results. Conclusions The prepossessing of microbiome 16S data is crucial for optimal microbiome based Machine Learning. These preprocessing steps are integrated into the MIPMLP - Microbiome Preprocessing Machine Learning Pipeline, which is available as a stand-alone version at: https://github.com/louzounlab/microbiome/tree/master/Preprocess or as a service at http://mip-mlp.math.biu.ac.il/Home Both contain the code, and standard test sets.
Collapse
Affiliation(s)
- Yoel Jasner
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| | | | | | - Omry Koren
- Azrieli Faculty of Medicine, Bar-Ilan University, Ramat Gan, Israel
| | - Yoram Louzoun
- Department of Mathematics, Bar-Ilan University, Ramat Gan, Israel
| |
Collapse
|
15
|
Anyaso-Samuel S, Sachdeva A, Guha S, Datta S. Metagenomic Geolocation Prediction Using an Adaptive Ensemble Classifier. Front Genet 2021; 12:642282. [PMID: 33959149 PMCID: PMC8093763 DOI: 10.3389/fgene.2021.642282] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2020] [Accepted: 03/18/2021] [Indexed: 11/13/2022] Open
Abstract
Microbiome samples harvested from urban environments can be informative in predicting the geographic location of unknown samples. The idea that different cities may have geographically disparate microbial signatures can be utilized to predict the geographical location based on city-specific microbiome samples. We implemented this idea first; by utilizing standard bioinformatics procedures to pre-process the raw metagenomics samples provided by the CAMDA organizers. We trained several component classifiers and a robust ensemble classifier with data generated from taxonomy-dependent and taxonomy-free approaches. Also, we implemented class weighting and an optimal oversampling technique to overcome the class imbalance in the primary data. In each instance, we observed that the component classifiers performed differently, whereas the ensemble classifier consistently yielded optimal performance. Finally, we predicted the source cities of mystery samples provided by the organizers. Our results highlight the unreliability of restricting the classification of metagenomic samples to source origins to a single classification algorithm. By combining several component classifiers via the ensemble approach, we obtained classification results that were as good as the best-performing component classifier.
Collapse
Affiliation(s)
- Samuel Anyaso-Samuel
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| | - Archie Sachdeva
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| | - Subharup Guha
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| | - Somnath Datta
- Department of Biostatistics, University of Florida, Gainesville, FL, United States
| |
Collapse
|
16
|
Zeng T, Yu X, Chen Z. Applying artificial intelligence in the microbiome for gastrointestinal diseases: A review. J Gastroenterol Hepatol 2021; 36:832-840. [PMID: 33880762 DOI: 10.1111/jgh.15503] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/09/2021] [Revised: 03/18/2021] [Accepted: 03/18/2021] [Indexed: 12/20/2022]
Abstract
For a long time, gut bacteria have been recognized for their important roles in the occurrence and progression of gastrointestinal diseases like colorectal cancer, and the ever-increasing amounts of microbiome data combined with other high-quality clinical and imaging datasets are leading the study of gastrointestinal diseases into an era of biomedical big data. The "omics" technologies used for microbiome analysis continuously evolve, and the machine learning or artificial intelligence technologies are key to extract the relevant information from microbiome data. This review intends to provide a focused summary of recent research and applications of microbiome big data and to discuss the use of artificial intelligence to combat gastrointestinal diseases.
Collapse
Affiliation(s)
- Tao Zeng
- CAS Key Laboratory of Computational Biology, Bio-Med Big Data Center, Shanghai Institute of Nutrition and Health, University of Chinese Academy of Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Systems Biology, Shanghai Institute of Biochemistry and Cell Biology, Center for Excellence in Molecular Cell Science, Chinese Academy of Sciences, Shanghai, China
| | - Xiangtian Yu
- Clinical Reasearch Center, Shanghai Jiao Tong University Affiliated Sixth People's Hospital, Shanghai, China
| | - Zhangran Chen
- Institute for Microbial Ecology, School of Medicine, Xiamen University, Xiamen, China
| |
Collapse
|
17
|
Marcos-Zambrano LJ, Karaduzovic-Hadziabdic K, Loncar Turukalo T, Przymus P, Trajkovik V, Aasmets O, Berland M, Gruca A, Hasic J, Hron K, Klammsteiner T, Kolev M, Lahti L, Lopes MB, Moreno V, Naskinova I, Org E, Paciência I, Papoutsoglou G, Shigdel R, Stres B, Vilne B, Yousef M, Zdravevski E, Tsamardinos I, Carrillo de Santa Pau E, Claesson MJ, Moreno-Indias I, Truu J. Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment. Front Microbiol 2021; 12:634511. [PMID: 33737920 PMCID: PMC7962872 DOI: 10.3389/fmicb.2021.634511] [Citation(s) in RCA: 159] [Impact Index Per Article: 39.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open
Abstract
The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | | | | | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Magali Berland
- Université Paris-Saclay, INRAE, MGP, Jouy-en-Josas, France
| | - Aleksandra Gruca
- Department of Computer Networks and Systems, Silesian University of Technology, Gliwice, Poland
| | - Jasminka Hasic
- University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Mikhail Kolev
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Victor Moreno
- Oncology Data Analytics Program, Catalan Institute of Oncology (ICO)Barcelona, Spain
- Colorectal Cancer Group, Institut de Recerca Biomedica de Bellvitge (IDIBELL), Barcelona, Spain
- Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
- Department of Clinical Sciences, Faculty of Medicine, University of Barcelona, Barcelona, Spain
| | - Irina Naskinova
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Elin Org
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
| | - Inês Paciência
- EPIUnit – Instituto de Saúde Pública da Universidade do Porto, Porto, Portugal
| | | | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Group for Microbiology and Microbial Biotechnology, Department of Animal Science, University of Ljubljana, Ljubljana, Slovenia
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Eftim Zdravevski
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | | | | | - Marcus J. Claesson
- School of Microbiology & APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Isabel Moreno-Indias
- Unidad de Gestión Clínica de Endocrinología y Nutrición, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospital Clínico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomédica en Red de Fisiopatología de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| |
Collapse
|
18
|
Iadanza E, Fabbri R, Bašić-ČiČak D, Amedei A, Telalovic JH. Gut microbiota and artificial intelligence approaches: A scoping review. HEALTH AND TECHNOLOGY 2020; 10:1343-1358. [DOI: 10.1007/s12553-020-00486-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2020] [Accepted: 10/01/2020] [Indexed: 12/19/2022]
Abstract
AbstractThis article aims to provide a thorough overview of the use of Artificial Intelligence (AI) techniques in studying the gut microbiota and its role in the diagnosis and treatment of some important diseases. The association between microbiota and diseases, together with its clinical relevance, is still difficult to interpret. The advances in AI techniques, such as Machine Learning (ML) and Deep Learning (DL), can help clinicians in processing and interpreting these massive data sets. Two research groups have been involved in this Scoping Review, working in two different areas of Europe: Florence and Sarajevo. The papers included in the review describe the use of ML or DL methods applied to the study of human gut microbiota. In total, 1109 papers were considered in this study. After elimination, a final set of 16 articles was considered in the scoping review. Different AI techniques were applied in the reviewed papers. Some papers applied ML, while others applied DL techniques. 11 papers evaluated just different ML algorithms (ranging from one to eight algorithms applied to one dataset). The remaining five papers examined both ML and DL algorithms. The most applied ML algorithm was Random Forest and it also exhibited the best performances.
Collapse
|
19
|
Beyene SS, Ling T, Ristevski B, Chen M. A novel riboswitch classification based on imbalanced sequences achieved by machine learning. PLoS Comput Biol 2020; 16:e1007760. [PMID: 32687488 PMCID: PMC7392346 DOI: 10.1371/journal.pcbi.1007760] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2020] [Revised: 07/30/2020] [Accepted: 05/13/2020] [Indexed: 11/24/2022] Open
Abstract
Riboswitch, a part of regulatory mRNA (50-250nt in length), has two main classes: aptamer and expression platform. One of the main challenges raised during the classification of riboswitch is imbalanced data. That is a circumstance in which the records of a sequences of one group are very small compared to the others. Such circumstances lead classifier to ignore minority group and emphasize on majority ones, which results in a skewed classification. We considered sixteen riboswitch families, to be in accord with recent riboswitch classification work, that contain imbalanced sequences. The sequences were split into training and test set using a newly developed pipeline. From 5460 k-mers (k value 1 to 6) produced, 156 features were calculated based on CfsSubsetEval and BestFirst function found in WEKA 3.8. Statistically tested result was significantly difference between balanced and imbalanced sequences (p < 0.05). Besides, each algorithm also showed a significant difference in sensitivity, specificity, accuracy, and macro F-score when used in both groups (p < 0.05). Several k-mers clustered from heat map were discovered to have biological functions and motifs at the different positions like interior loops, terminal loops and helices. They were validated to have a biological function and some are riboswitch motifs. The analysis has discovered the importance of solving the challenges of majority bias analysis and overfitting. Presented results were generalized evaluation of both balanced and imbalanced models, which implies their ability of classifying, to classify novel riboswitches. The Python source code is available at https://github.com/Seasonsling/riboswitch.
Collapse
Affiliation(s)
- Solomon Shiferaw Beyene
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| | - Tianyi Ling
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
- School of Medicine, Zhejiang University, Hangzhou, Zhejiang, China
| | - Blagoj Ristevski
- Faculty of Information and Communication Technologies, Bitola, St. Kliment Ohridski University Bitola, ul. Partizanska Bitola, Republic of North Macedonia
| | - Ming Chen
- Department of Bioinformatics, College of Life Sciences, Zhejiang University, Hangzhou, China
| |
Collapse
|
20
|
Combination of Ensembles of Regularized Regression Models with Resampling-Based Lasso Feature Selection in High Dimensional Data. MATHEMATICS 2020. [DOI: 10.3390/math8010110] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
In high-dimensional data, the performances of various classifiers are largely dependent on the selection of important features. Most of the individual classifiers with the existing feature selection (FS) methods do not perform well for highly correlated data. Obtaining important features using the FS method and selecting the best performing classifier is a challenging task in high throughput data. In this article, we propose a combination of resampling-based least absolute shrinkage and selection operator (LASSO) feature selection (RLFS) and ensembles of regularized regression (ERRM) capable of dealing data with the high correlation structures. The ERRM boosts the prediction accuracy with the top-ranked features obtained from RLFS. The RLFS utilizes the lasso penalty with sure independence screening (SIS) condition to select the top k ranked features. The ERRM includes five individual penalty based classifiers: LASSO, adaptive LASSO (ALASSO), elastic net (ENET), smoothly clipped absolute deviations (SCAD), and minimax concave penalty (MCP). It was built on the idea of bagging and rank aggregation. Upon performing simulation studies and applying to smokers’ cancer gene expression data, we demonstrated that the proposed combination of ERRM with RLFS achieved superior performance of accuracy and geometric mean.
Collapse
|
21
|
Abbas M, Matta J, Le T, Bensmail H, Obafemi-Ajayi T, Honavar V, EL-Manzalawy Y. Biomarker discovery in inflammatory bowel diseases using network-based feature selection. PLoS One 2019; 14:e0225382. [PMID: 31756219 PMCID: PMC6874333 DOI: 10.1371/journal.pone.0225382] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2019] [Accepted: 11/04/2019] [Indexed: 12/20/2022] Open
Abstract
Reliable identification of Inflammatory biomarkers from metagenomics data is a promising direction for developing non-invasive, cost-effective, and rapid clinical tests for early diagnosis of IBD. We present an integrative approach to Network-Based Biomarker Discovery (NBBD) which integrates network analyses methods for prioritizing potential biomarkers and machine learning techniques for assessing the discriminative power of the prioritized biomarkers. Using a large dataset of new-onset pediatric IBD metagenomics biopsy samples, we compare the performance of Random Forest (RF) classifiers trained on features selected using a representative set of traditional feature selection methods against NBBD framework, configured using five different tools for inferring networks from metagenomics data, and nine different methods for prioritizing biomarkers as well as a hybrid approach combining best traditional and NBBD based feature selection. We also examine how the performance of the predictive models for IBD diagnosis varies as a function of the size of the data used for biomarker identification. Our results show that (i) NBBD is competitive with some of the state-of-the-art feature selection methods including Random Forest Feature Importance (RFFI) scores; and (ii) NBBD is especially effective in reliably identifying IBD biomarkers when the number of data samples available for biomarker discovery is small.
Collapse
Affiliation(s)
- Mostafa Abbas
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - John Matta
- Department of Computer Science, Southern Illinois University Edwardsville, Edwardsville, IL, United States of America
| | - Thanh Le
- Engineering Program, Missouri State University, Springfield, MO, United States of America
| | - Halima Bensmail
- Qatar Computing Research Institute, Hamad Bin Khalifa University, Doha, Qatar
| | - Tayo Obafemi-Ajayi
- Engineering Program, Missouri State University, Springfield, MO, United States of America
- * E-mail: (TO-A); (YE-M)
| | - Vasant Honavar
- College of Information Sciences and Technology, Pennsylvania State University, University Park, PA, United States of America
| | - Yasser EL-Manzalawy
- College of Information Sciences and Technology, Pennsylvania State University, University Park, PA, United States of America
- Geisinger Health System, Danville, PA, United States of America
- * E-mail: (TO-A); (YE-M)
| |
Collapse
|
22
|
LaPierre N, Ju CJT, Zhou G, Wang W. MetaPheno: A critical evaluation of deep learning and machine learning in metagenome-based disease prediction. Methods 2019; 166:74-82. [PMID: 30885720 PMCID: PMC6708502 DOI: 10.1016/j.ymeth.2019.03.003] [Citation(s) in RCA: 59] [Impact Index Per Article: 9.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2018] [Revised: 02/14/2019] [Accepted: 03/04/2019] [Indexed: 01/21/2023] Open
Abstract
The human microbiome plays a number of critical roles, impacting almost every aspect of human health and well-being. Conditions in the microbiome have been linked to a number of significant diseases. Additionally, revolutions in sequencing technology have led to a rapid increase in publicly-available sequencing data. Consequently, there have been growing efforts to predict disease status from metagenomic sequencing data, with a proliferation of new approaches in the last few years. Some of these efforts have explored utilizing a powerful form of machine learning called deep learning, which has been applied successfully in several biological domains. Here, we review some of these methods and the algorithms that they are based on, with a particular focus on deep learning methods. We also perform a deeper analysis of Type 2 Diabetes and obesity datasets that have eluded improved results, using a variety of machine learning and feature extraction methods. We conclude by offering perspectives on study design considerations that may impact results and future directions the field can take to improve results and offer more valuable conclusions. The scripts and extracted features for the analyses conducted in this paper are available via GitHub:https://github.com/nlapier2/metapheno.
Collapse
Affiliation(s)
- Nathan LaPierre
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Chelsea J-T Ju
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Guangyu Zhou
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA
| | - Wei Wang
- Department of Computer Science, University of California at Los Angeles, Los Angeles, CA 90095, USA.
| |
Collapse
|
23
|
Zhou YH, Gallins P. A Review and Tutorial of Machine Learning Methods for Microbiome Host Trait Prediction. Front Genet 2019; 10:579. [PMID: 31293616 PMCID: PMC6603228 DOI: 10.3389/fgene.2019.00579] [Citation(s) in RCA: 105] [Impact Index Per Article: 17.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2019] [Accepted: 06/04/2019] [Indexed: 12/19/2022] Open
Abstract
With the growing importance of microbiome research, there is increasing evidence that host variation in microbial communities is associated with overall host health. Advancement in genetic sequencing methods for microbiomes has coincided with improvements in machine learning, with important implications for disease risk prediction in humans. One aspect specific to microbiome prediction is the use of taxonomy-informed feature selection. In this review for non-experts, we explore the most commonly used machine learning methods, and evaluate their prediction accuracy as applied to microbiome host trait prediction. Methods are described at an introductory level, and R/Python code for the analyses is provided.
Collapse
Affiliation(s)
- Yi-Hui Zhou
- Department of Biological Sciences, North Carolina State University, Raleigh, NC, United States
| | - Paul Gallins
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC, United States
| |
Collapse
|
24
|
Two-Stage Classification with SIS Using a New Filter Ranking Method in High Throughput Data. MATHEMATICS 2019. [DOI: 10.3390/math7060493] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
Abstract
Over the last decade, high dimensional data have been popularly paid attention to in bioinformatics. These data increase the likelihood of detecting the most promising novel information. However, there are limitations of high-performance computing and overfitting issues. To overcome the issues, alternative strategies need to be explored for the detection of true important features. A two-stage approach, filtering and variable selection steps, has been receiving attention. Filtering methods are divided into two categories of individual ranking and feature subset selection methods. Both have issues with the lack of consideration for joint correlation among features and computing time of an NP-hard problem. Therefore, we proposed a new filter ranking method (PF) using the elastic net penalty with sure independence screening (SIS) based on resampling technique to overcome these issues. We demonstrated that SIS-LASSO, SIS-MCP, and SIS-SCAD with the proposed filtering method achieved superior performance of not only accuracy, AUROC, and geometric mean but also true positive detection compared to those with the marginal maximum likelihood ranking method (MMLR) through extensive simulation studies. In addition, we applied it in a real application of colon and lung cancer gene expression data to investigate the classification performance and power of detecting true genes associated with colon and lung cancer.
Collapse
|
25
|
Maltez Thomas A, Prata Lima F, Maria Silva Moura L, Maria da Silva A, Dias-Neto E, Setubal JC. Comparative Metagenomics. Methods Mol Biol 2018; 1704:243-260. [PMID: 29277868 DOI: 10.1007/978-1-4939-7463-4_8] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
Abstract
Thanks in large part to newer, better, and cheaper DNA sequencing technologies, an enormous number of metagenomic sequence datasets have been and continue to be generated, covering a huge variety of environmental niches, including several different human body sites. Comparing these metagenomes and identifying their commonalities and differences is a challenging task, due not only to the large amounts of data, but also because there are several methodological considerations that need to be taken into account to ensure an appropriate and sound comparison between datasets. In this chapter, we describe current techniques aimed at comparing metagenomes generated by 16S ribosomal RNA and shotgun DNA sequencing, emphasizing methodological issues that arise in these comparative studies. We provide a detailed case study to illustrate some of these techniques using data from the Human Microbiome Project comparing the microbial communities from ten buccal mucosa samples with ten tongue dorsum samples in terms of alpha diversity, beta diversity, and their taxonomic and functional profiles.
Collapse
Affiliation(s)
- Andrew Maltez Thomas
- Department of Biochemistry, Institute of Chemistry , University of São Paulo, São Paulo, SP, Brazil.,Medical Genomics Laboratory, CIPE/A.C. Camargo Cancer Center, São Paulo, SP, Brazil
| | - Felipe Prata Lima
- Department of Biochemistry, Institute of Chemistry , University of São Paulo, São Paulo, SP, Brazil.,Instituto Federal de Alagoas, Maceió, Alagoas, Brazil
| | - Livia Maria Silva Moura
- Department of Biochemistry, Institute of Chemistry , University of São Paulo, São Paulo, SP, Brazil
| | - Aline Maria da Silva
- Department of Biochemistry, Institute of Chemistry , University of São Paulo, São Paulo, SP, Brazil
| | - Emmanuel Dias-Neto
- Medical Genomics Laboratory, CIPE/A.C. Camargo Cancer Center, São Paulo, SP, Brazil.,Lab. of Neurosciences (LIM-27) Alzira Denise Hertzog Silva, Institute of Psychiatry, Faculdade de Medicina, Universidade de São Paulo (USP), São Paulo, SP, Brazil
| | - João C Setubal
- Department of Biochemistry, Institute of Chemistry, University of São Paulo, Av. Prof. Lineu Prestes, 748 room 909, 05508-000, São Paulo, SP, Brazil.
| |
Collapse
|
26
|
Oudah M, Henschel A. Taxonomy-aware feature engineering for microbiome classification. BMC Bioinformatics 2018; 19:227. [PMID: 29907097 PMCID: PMC6003080 DOI: 10.1186/s12859-018-2205-3] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2017] [Accepted: 05/15/2018] [Indexed: 12/17/2022] Open
Abstract
Background What is a healthy microbiome? The pursuit of this and many related questions, especially in light of the recently recognized microbial component in a wide range of diseases has sparked a surge in metagenomic studies. They are often not simply attributable to a single pathogen but rather are the result of complex ecological processes. Relatedly, the increasing DNA sequencing depth and number of samples in metagenomic case-control studies enabled the applicability of powerful statistical methods, e.g. Machine Learning approaches. For the latter, the feature space is typically shaped by the relative abundances of operational taxonomic units, as determined by cost-effective phylogenetic marker gene profiles. While a substantial body of microbiome/microbiota research involves unsupervised and supervised Machine Learning, very little attention has been put on feature selection and engineering. Results We here propose the first algorithm to exploit phylogenetic hierarchy (i.e. an all-encompassing taxonomy) in feature engineering for microbiota classification. The rationale is to exploit the often mono- or oligophyletic distribution of relevant (but hidden) traits by virtue of taxonomic abstraction. The algorithm is embedded in a comprehensive microbiota classification pipeline, which we applied to a diverse range of datasets, distinguishing healthy from diseased microbiota samples. Conclusion We demonstrate substantial improvements over the state-of-the-art microbiota classification tools in terms of classification accuracy, regardless of the actual Machine Learning technique while using drastically reduced feature spaces. Moreover, generalized features bear great explanatory value: they provide a concise description of conditions and thus help to provide pathophysiological insights. Indeed, the automatically and reproducibly derived features are consistent with previously published domain expert analyses. Electronic supplementary material The online version of this article (10.1186/s12859-018-2205-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Mai Oudah
- Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.,New York University Abu Dhabi, Abu Dhabi, United Arab Emirates
| | - Andreas Henschel
- Khalifa University of Science and Technology, Abu Dhabi, United Arab Emirates.
| |
Collapse
|
27
|
Ching T, Himmelstein DS, Beaulieu-Jones BK, Kalinin AA, Do BT, Way GP, Ferrero E, Agapow PM, Zietz M, Hoffman MM, Xie W, Rosen GL, Lengerich BJ, Israeli J, Lanchantin J, Woloszynek S, Carpenter AE, Shrikumar A, Xu J, Cofer EM, Lavender CA, Turaga SC, Alexandari AM, Lu Z, Harris DJ, DeCaprio D, Qi Y, Kundaje A, Peng Y, Wiley LK, Segler MHS, Boca SM, Swamidass SJ, Huang A, Gitter A, Greene CS. Opportunities and obstacles for deep learning in biology and medicine. J R Soc Interface 2018; 15:20170387. [PMID: 29618526 PMCID: PMC5938574 DOI: 10.1098/rsif.2017.0387] [Citation(s) in RCA: 905] [Impact Index Per Article: 129.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2017] [Accepted: 03/07/2018] [Indexed: 11/12/2022] Open
Abstract
Deep learning describes a class of machine learning algorithms that are capable of combining raw inputs into layers of intermediate features. These algorithms have recently shown impressive results across a variety of domains. Biology and medicine are data-rich disciplines, but the data are complex and often ill-understood. Hence, deep learning techniques may be particularly well suited to solve problems of these fields. We examine applications of deep learning to a variety of biomedical problems-patient classification, fundamental biological processes and treatment of patients-and discuss whether deep learning will be able to transform these tasks or if the biomedical sphere poses unique challenges. Following from an extensive literature review, we find that deep learning has yet to revolutionize biomedicine or definitively resolve any of the most pressing challenges in the field, but promising advances have been made on the prior state of the art. Even though improvements over previous baselines have been modest in general, the recent progress indicates that deep learning methods will provide valuable means for speeding up or aiding human investigation. Though progress has been made linking a specific neural network's prediction to input features, understanding how users should interpret these models to make testable hypotheses about the system under study remains an open challenge. Furthermore, the limited amount of labelled data for training presents problems in some domains, as do legal and privacy constraints on work with sensitive health records. Nonetheless, we foresee deep learning enabling changes at both bench and bedside with the potential to transform several areas of biology and medicine.
Collapse
Affiliation(s)
- Travers Ching
- Molecular Biosciences and Bioengineering Graduate Program, University of Hawaii at Manoa, Honolulu, HI, USA
| | - Daniel S Himmelstein
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Brett K Beaulieu-Jones
- Genomics and Computational Biology Graduate Group, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Alexandr A Kalinin
- Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor, MI, USA
| | | | - Gregory P Way
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Enrico Ferrero
- Computational Biology and Stats, Target Sciences, GlaxoSmithKline, Stevenage, UK
| | | | - Michael Zietz
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Michael M Hoffman
- Princess Margaret Cancer Centre, Toronto, Ontario, Canada
- Department of Medical Biophysics, University of Toronto, Toronto, Ontario, Canada
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
| | - Wei Xie
- Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Gail L Rosen
- Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Benjamin J Lengerich
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Johnny Israeli
- Biophysics Program, Stanford University, Stanford, CA, USA
| | - Jack Lanchantin
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Stephen Woloszynek
- Ecological and Evolutionary Signal-processing and Informatics Laboratory, Department of Electrical and Computer Engineering, Drexel University, Philadelphia, PA, USA
| | - Anne E Carpenter
- Imaging Platform, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Avanti Shrikumar
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, Chicago, IL, USA
| | - Evan M Cofer
- Department of Computer Science, Trinity University, San Antonio, TX, USA
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ, USA
| | - Christopher A Lavender
- Integrative Bioinformatics, National Institute of Environmental Health Sciences, National Institutes of Health, Research Triangle Park, NC, USA
| | - Srinivas C Turaga
- Howard Hughes Medical Institute, Janelia Research Campus, Ashburn, VA, USA
| | - Amr M Alexandari
- Department of Computer Science, Stanford University, Stanford, CA, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David J Harris
- Department of Wildlife Ecology and Conservation, University of Florida, Gainesville, FL, USA
| | | | - Yanjun Qi
- Department of Computer Science, University of Virginia, Charlottesville, VA, USA
| | - Anshul Kundaje
- Department of Computer Science, Stanford University, Stanford, CA, USA
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Yifan Peng
- National Center for Biotechnology Information and National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Laura K Wiley
- Division of Biomedical Informatics and Personalized Medicine, University of Colorado School of Medicine, Aurora, CO, USA
| | - Marwin H S Segler
- Institute of Organic Chemistry, Westfälische Wilhelms-Universität Münster, Münster, Germany
| | - Simina M Boca
- Innovation Center for Biomedical Informatics, Georgetown University Medical Center, Washington, DC, USA
| | - S Joshua Swamidass
- Department of Pathology and Immunology, Washington University in Saint Louis, St Louis, MO, USA
| | - Austin Huang
- Department of Medicine, Brown University, Providence, RI, USA
| | - Anthony Gitter
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI, USA
- Morgridge Institute for Research, Madison, WI, USA
| | - Casey S Greene
- Department of Systems Pharmacology and Translational Therapeutics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| |
Collapse
|
28
|
Huang BFF, Boutros PC. The parameter sensitivity of random forests. BMC Bioinformatics 2016; 17:331. [PMID: 27586051 PMCID: PMC5009551 DOI: 10.1186/s12859-016-1228-x] [Citation(s) in RCA: 47] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2015] [Accepted: 08/26/2016] [Indexed: 02/07/2023] Open
Abstract
Background The Random Forest (RF) algorithm for supervised machine learning is an ensemble learning method widely used in science and many other fields. Its popularity has been increasing, but relatively few studies address the parameter selection process: a critical step in model fitting. Due to numerous assertions regarding the performance reliability of the default parameters, many RF models are fit using these values. However there has not yet been a thorough examination of the parameter-sensitivity of RFs in computational genomic studies. We address this gap here. Results We examined the effects of parameter selection on classification performance using the RF machine learning algorithm on two biological datasets with distinct p/n ratios: sequencing summary statistics (low p/n) and microarray-derived data (high p/n). Here, p, refers to the number of variables and, n, the number of samples. Our findings demonstrate that parameterization is highly correlated with prediction accuracy and variable importance measures (VIMs). Further, we demonstrate that different parameters are critical in tuning different datasets, and that parameter-optimization significantly enhances upon the default parameters. Conclusions Parameter performance demonstrated wide variability on both low and high p/n data. Therefore, there is significant benefit to be gained by model tuning RFs away from their default parameter settings. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1228-x) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Barbara F F Huang
- Informatics and Bio-computing Program, Ontario Institute for Cancer Research, Toronto, Canada
| | - Paul C Boutros
- Informatics and Bio-computing Program, Ontario Institute for Cancer Research, Toronto, Canada. .,Department of Medical Biophysics, University of Toronto, Toronto, Canada. .,Department of Pharmacology and Toxicology, University of Toronto, Toronto, Canada. .,MaRS Centre, 661 University Avenue, Suite 510, Toronto, Ontario, M5G 0A3, Canada.
| |
Collapse
|
29
|
Pasolli E, Truong DT, Malik F, Waldron L, Segata N. Machine Learning Meta-analysis of Large Metagenomic Datasets: Tools and Biological Insights. PLoS Comput Biol 2016; 12:e1004977. [PMID: 27400279 PMCID: PMC4939962 DOI: 10.1371/journal.pcbi.1004977] [Citation(s) in RCA: 345] [Impact Index Per Article: 38.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2015] [Accepted: 05/11/2016] [Indexed: 12/12/2022] Open
Abstract
Shotgun metagenomic analysis of the human associated microbiome provides a rich set of microbial features for prediction and biomarker discovery in the context of human diseases and health conditions. However, the use of such high-resolution microbial features presents new challenges, and validated computational tools for learning tasks are lacking. Moreover, classification rules have scarcely been validated in independent studies, posing questions about the generality and generalization of disease-predictive models across cohorts. In this paper, we comprehensively assess approaches to metagenomics-based prediction tasks and for quantitative assessment of the strength of potential microbiome-phenotype associations. We develop a computational framework for prediction tasks using quantitative microbiome profiles, including species-level relative abundances and presence of strain-specific markers. A comprehensive meta-analysis, with particular emphasis on generalization across cohorts, was performed in a collection of 2424 publicly available metagenomic samples from eight large-scale studies. Cross-validation revealed good disease-prediction capabilities, which were in general improved by feature selection and use of strain-specific markers instead of species-level taxonomic abundance. In cross-study analysis, models transferred between studies were in some cases less accurate than models tested by within-study cross-validation. Interestingly, the addition of healthy (control) samples from other studies to training sets improved disease prediction capabilities. Some microbial species (most notably Streptococcus anginosus) seem to characterize general dysbiotic states of the microbiome rather than connections with a specific disease. Our results in modelling features of the “healthy” microbiome can be considered a first step toward defining general microbial dysbiosis. The software framework, microbiome profiles, and metadata for thousands of samples are publicly available at http://segatalab.cibio.unitn.it/tools/metaml. The human microbiome–the entire set of microbial organisms associated with the human host–interacts closely with host immune and metabolic functions and is crucial for human health. Significant advances in the characterization of the microbiome associated with healthy and diseased individuals have been obtained through next-generation DNA sequencing technologies, which permit accurate estimation of microbial communities directly from uncultured human-associated samples (e.g., stool). In particular, shotgun metagenomics provide data at unprecedented species- and strain- levels of resolution. Several large-scale metagenomic disease-associated datasets are also becoming available, and disease-predictive models built on metagenomic signatures have been proposed. However, the generalization of resulting prediction models on different cohorts and diseases has not been validated. In this paper, we comprehensively assess approaches to metagenomics-based prediction tasks and for quantitative assessment of microbiome-phenotype associations. We consider 2424 samples from eight studies and six different diseases to assess the independent prediction accuracy of models built on shotgun metagenomic data and to compare strategies for practical use of the microbiome as a prediction tool.
Collapse
Affiliation(s)
- Edoardo Pasolli
- Centre for Integrative Biology, University of Trento, Trento, Italy
| | - Duy Tin Truong
- Centre for Integrative Biology, University of Trento, Trento, Italy
| | - Faizan Malik
- Graduate School of Public Health and Health Policy, City University of New York, New York, New York, United States of America
| | - Levi Waldron
- Graduate School of Public Health and Health Policy, City University of New York, New York, New York, United States of America
| | - Nicola Segata
- Centre for Integrative Biology, University of Trento, Trento, Italy
- * E-mail:
| |
Collapse
|