1
|
Yi X, He Y, Gao S, Li M. A review of the application of deep learning in obesity: From early prediction aid to advanced management assistance. Diabetes Metab Syndr 2024; 18:103000. [PMID: 38604060 DOI: 10.1016/j.dsx.2024.103000] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/08/2022] [Revised: 01/23/2024] [Accepted: 03/29/2024] [Indexed: 04/13/2024]
Abstract
BACKGROUND AND AIMS Obesity is a chronic disease which can cause severe metabolic disorders. Machine learning (ML) techniques, especially deep learning (DL), have proven to be useful in obesity research. However, there is a dearth of systematic reviews of DL applications in obesity. This article aims to summarize the current trend of DL usage in obesity research. METHODS An extensive literature review was carried out across multiple databases, including PubMed, Embase, Web of Science, Scopus, and Medline, to collate relevant studies published from January 2018 to September 2023. The focus was on research detailing the application of DL in the context of obesity. We have distilled critical insights pertaining to the utilized learning models, encompassing aspects of their development, principal results, and foundational methodologies. RESULTS Our analysis culminated in the synthesis of new knowledge regarding the application of DL in the context of obesity. Finally, 40 research articles were included. The final collection of these research can be divided into three categories: obesity prediction (n = 16); obesity management (n = 13); and body fat estimation (n = 11). CONCLUSIONS This is the first review to examine DL applications in obesity. It reveals DL's superiority in obesity prediction over traditional ML methods, showing promise for multi-omics research. DL also innovates in obesity management through diet, fitness, and environmental analyses. Additionally, DL improves body fat estimation, offering affordable and precise monitoring tools. The study is registered with PROSPERO (ID: CRD42023475159).
Collapse
Affiliation(s)
- Xinghao Yi
- Department of Endocrinology, NHC Key Laboratory of Endocrinology, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing 100730, China
| | - Yangzhige He
- Department of Medical Research Center, Peking Union Medical College Hospital, Chinese Academy of Medical Science & Peking Union Medical College, Beijing 100730, China
| | - Shan Gao
- Department of Endocrinology, Xuan Wu Hospital, Capital Medical University, Beijing 10053, China
| | - Ming Li
- Department of Endocrinology, NHC Key Laboratory of Endocrinology, Peking Union Medical College Hospital, Peking Union Medical College and Chinese Academy of Medical Sciences, Beijing 100730, China.
| |
Collapse
|
2
|
Roy G, Prifti E, Belda E, Zucker JD. Deep learning methods in metagenomics: a review. Microb Genom 2024; 10. [PMID: 38630611 DOI: 10.1099/mgen.0.001231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2024] Open
Abstract
The ever-decreasing cost of sequencing and the growing potential applications of metagenomics have led to an unprecedented surge in data generation. One of the most prevalent applications of metagenomics is the study of microbial environments, such as the human gut. The gut microbiome plays a crucial role in human health, providing vital information for patient diagnosis and prognosis. However, analysing metagenomic data remains challenging due to several factors, including reference catalogues, sparsity and compositionality. Deep learning (DL) enables novel and promising approaches that complement state-of-the-art microbiome pipelines. DL-based methods can address almost all aspects of microbiome analysis, including novel pathogen detection, sequence classification, patient stratification and disease prediction. Beyond generating predictive models, a key aspect of these methods is also their interpretability. This article reviews DL approaches in metagenomics, including convolutional networks, autoencoders and attention-based models. These methods aggregate contextualized data and pave the way for improved patient care and a better understanding of the microbiome's key role in our health.
Collapse
Affiliation(s)
- Gaspar Roy
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
| | - Edi Prifti
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l'hopital, 75013 Paris, France
| | - Eugeni Belda
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l'hopital, 75013 Paris, France
| | - Jean-Daniel Zucker
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l'hopital, 75013 Paris, France
| |
Collapse
|
3
|
Sharma D, Lou W, Xu W. phylaGAN: data augmentation through conditional GANs and autoencoders for improving disease prediction accuracy using microbiome data. Bioinformatics 2024; 40:btae161. [PMID: 38569898 DOI: 10.1093/bioinformatics/btae161] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Revised: 02/18/2024] [Accepted: 04/01/2024] [Indexed: 04/05/2024]
Abstract
MOTIVATION Research is improving our understanding of how the microbiome interacts with the human body and its impact on human health. Existing machine learning methods have shown great potential in discriminating healthy from diseased microbiome states. However, Machine Learning based prediction using microbiome data has challenges such as, small sample size, imbalance between cases and controls and high cost of collecting large number of samples. To address these challenges, we propose a deep learning framework phylaGAN to augment the existing datasets with generated microbiome data using a combination of conditional generative adversarial network (C-GAN) and autoencoder. Conditional generative adversarial networks train two models against each other to compute larger simulated datasets that are representative of the original dataset. Autoencoder maps the original and the generated samples onto a common subspace to make the prediction more accurate. RESULTS Extensive evaluation and predictive analysis was conducted on two datasets, T2D study and Cirrhosis study showing an improvement in mean AUC using data augmentation by 11% and 5% respectively. External validation on a cohort classifying between obese and lean subjects, with a smaller sample size provided an improvement in mean AUC close to 32% when augmented through phylaGAN as compared to using the original cohort. Our findings not only indicate that the generative adversarial networks can create samples that mimic the original data across various diversity metrics, but also highlight the potential of enhancing disease prediction through machine learning models trained on synthetic data. AVAILABILITY AND IMPLEMENTATION https://github.com/divya031090/phylaGAN.
Collapse
Affiliation(s)
- Divya Sharma
- Biostatistics Department, Princess Margaret Cancer Center, University Health Network, Toronto, ON, M5G2C4, Canada
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, M5T3M7, Canada
| | - Wendy Lou
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, M5T3M7, Canada
| | - Wei Xu
- Biostatistics Department, Princess Margaret Cancer Center, University Health Network, Toronto, ON, M5G2C4, Canada
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON, M5T3M7, Canada
| |
Collapse
|
4
|
Asher EE, Bashan A. Model-free prediction of microbiome compositions. Microbiome 2024; 12:17. [PMID: 38303006 PMCID: PMC10832217 DOI: 10.1186/s40168-023-01721-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/25/2022] [Accepted: 11/15/2023] [Indexed: 02/03/2024]
Abstract
BACKGROUND The recent recognition of the importance of the microbiome to the host's health and well-being has yielded efforts to develop therapies that aim to shift the microbiome from a disease-associated state to a healthier one. Direct manipulation techniques of the species' assemblage are currently available, e.g., using probiotics or narrow-spectrum antibiotics to introduce or eliminate specific taxa. However, predicting the species' abundances at the new state remains a challenge, mainly due to the difficulties of deciphering the delicate underlying network of ecological interactions or constructing a predictive model for such complex ecosystems. RESULTS Here, we propose a model-free method to predict the species' abundances at the new steady state based on their presence/absence configuration by utilizing a multi-dimensional k-nearest-neighbors (kNN) regression algorithm. By analyzing data from numeric simulations of ecological dynamics, we show that our predictions, which consider the presence/absence of all species holistically, outperform both the null model that uses the statistics of each species independently and a predictive neural network model. We analyze real metagenomic data of human-associated microbial communities and find that by relying on a small number of "neighboring" samples, i.e., samples with similar species assemblage, the kNN predicts the species abundance better than the whole-cohort average. By studying both real metagenomic and simulated data, we show that the predictability of our method is tightly related to the dissimilarity-overlap relationship of the training data. CONCLUSIONS Our results demonstrate how model-free methods can prove useful in predicting microbial communities and may facilitate the development of microbial-based therapies. Video Abstract.
Collapse
Affiliation(s)
- Eitan E Asher
- Physics Department, Bar-Ilan University, Ramat-Gan, Israel
| | - Amir Bashan
- Physics Department, Bar-Ilan University, Ramat-Gan, Israel.
| |
Collapse
|
5
|
Curry KD, Yu FB, Vance SE, Segarra S, Bhaya D, Chikhi R, Rocha EP, Treangen TJ. Reference-free Structural Variant Detection in Microbiomes via Long-read Coassembly Graphs. bioRxiv 2024:2024.01.25.577285. [PMID: 38352454 PMCID: PMC10862772 DOI: 10.1101/2024.01.25.577285] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 02/20/2024]
Abstract
Bacterial genome dynamics are vital for understanding the mechanisms underlying microbial adaptation, growth, and their broader impact on host phenotype. Structural variants (SVs), genomic alterations of 10 base pairs or more, play a pivotal role in driving evolutionary processes and maintaining genomic heterogeneity within bacterial populations. While SV detection in isolate genomes is relatively straightforward, metagenomes present broader challenges due to absence of clear reference genomes and presence of mixed strains. In response, our proposed method rhea, forgoes reference genomes and metagenome-assembled genomes (MAGs) by encompassing a single metagenome coassembly graph constructed from all samples in a series. The log fold change in graph coverage between subsequent samples is then calculated to call SVs that are thriving or declining throughout the series. We show rhea to outperform existing methods for SV and horizontal gene transfer (HGT) detection in two simulated mock metagenomes, which is particularly noticeable as the simulated reads diverge from reference genomes and an increase in strain diversity is incorporated. We additionally demonstrate use cases for rhea on series metagenomic data of environmental and fermented food microbiomes to detect specific sequence alterations between subsequent time and temperature samples, suggesting host advantage. Our innovative approach leverages raw read patterns rather than references or MAGs to include all sequencing reads in analysis, and thus provide versatility in studying SVs across diverse and poorly characterized microbial communities for more comprehensive insights into microbial genome dynamics.
Collapse
Affiliation(s)
- Kristen D. Curry
- Rice University, Department of Computer Science, Houston, TX 77005, United States
- Institut Pasteur, Université Paris Cité, CNRS, UMR3525, Microbial Evolutionary Genomics, 75015 Paris, France
| | | | - Summer E. Vance
- University of California, Berkeley, Department of Environmental Science, Policy, and Management, Berkeley, CA 94720, United States
| | - Santiago Segarra
- Rice University, Department of Electrical and Computer Engineering, Houston, TX 77005, United States
| | - Devaki Bhaya
- Carnegie Institution for Science, Department of Plant Biology, Stanford, CA 94305, United States
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, Sequence Bioinformatics unit, 75015 Paris, France
| | - Eduardo P.C. Rocha
- Institut Pasteur, Université Paris Cité, CNRS, UMR3525, Microbial Evolutionary Genomics, 75015 Paris, France
| | - Todd J. Treangen
- Rice University, Department of Computer Science, Houston, TX 77005, United States
| |
Collapse
|
6
|
Muller E, Shiryan I, Borenstein E. Multi-omic integration of microbiome data for identifying disease-associated modules. bioRxiv 2024:2023.07.03.547607. [PMID: 37461534 PMCID: PMC10349976 DOI: 10.1101/2023.07.03.547607] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/27/2023]
Abstract
The human gut microbiome is a complex ecosystem with profound implications for health and disease. This recognition has led to a surge in multi-omic microbiome studies, employing various molecular assays to elucidate the microbiome's role in diseases across multiple functional layers. However, despite the clear value of these multi-omic datasets, rigorous integrative analysis of such data poses significant challenges, hindering a comprehensive understanding of microbiome-disease interactions. Perhaps most notably, multiple approaches, including univariate and multivariate analyses, as well as machine learning, have been applied to such data to identify disease-associated markers, namely, specific features (e.g., species, pathways, metabolites) that are significantly altered in disease state. These methods, however, often yield extensive lists of features associated with the disease without effectively capturing the multi-layered structure of multi-omic data or offering clear, interpretable hypotheses about underlying microbiome-disease mechanisms. Here, we address this challenge by introducing MintTea - an intermediate integration-based method for analyzing multi-omic microbiome data. MintTea combines a canonical correlation analysis (CCA) extension, consensus analysis, and an evaluation protocol to robustly identify disease-associated multi-omic modules. Each such module consists of a set of features from the various omics that both shift in concord, and collectively associate with the disease. Applying MintTea to diverse case-control cohorts with multi-omic data, we show that this framework is able to capture modules with high predictive power for disease, significant cross-omic correlations, and alignment with known microbiome-disease associations. For example, analyzing samples from a metabolic syndrome (MS) study, we found a MS-associated module comprising of a highly correlated cluster of serum glutamate- and TCA cycle-related metabolites, as well as bacterial species previously implicated in insulin resistance. In another cohort, we identified a module associated with late-stage colorectal cancer, featuring Peptostreptococcus and Gemella species and several fecal amino acids, in agreement with these species' reported role in the metabolism of these amino acids and their coordinated increase in abundance during disease development. Finally, comparing modules identified in different datasets, we detected multiple significant overlaps, suggesting common interactions between microbiome features. Combined, this work serves as a proof of concept for the potential benefits of advanced integration methods in generating integrated multi-omic hypotheses underlying microbiome-disease interactions and a promising avenue for researchers seeking systems-level insights into coherent mechanisms governing microbiome-related diseases.
Collapse
|
7
|
Liao H, Shang J, Sun Y. GDmicro: classifying host disease status with GCN and deep adaptation network based on the human gut microbiome data. Bioinformatics 2023; 39:btad747. [PMID: 38085234 PMCID: PMC10749762 DOI: 10.1093/bioinformatics/btad747] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2023] [Revised: 11/16/2023] [Accepted: 12/11/2023] [Indexed: 12/27/2023] Open
Abstract
MOTIVATION With advances in metagenomic sequencing technologies, there are accumulating studies revealing the associations between the human gut microbiome and some human diseases. These associations shed light on using gut microbiome data to distinguish case and control samples of a specific disease, which is also called host disease status classification. Importantly, using learning-based models to distinguish the disease and control samples is expected to identify important biomarkers more accurately than abundance-based statistical analysis. However, available tools have not fully addressed two challenges associated with this task: limited labeled microbiome data and decreased accuracy in cross-studies. The confounding factors, such as the diet, technical biases in sample collection/sequencing across different studies/cohorts often jeopardize the generalization of the learning model. RESULTS To address these challenges, we develop a new tool GDmicro, which combines semi-supervised learning and domain adaptation to achieve a more generalized model using limited labeled samples. We evaluated GDmicro on human gut microbiome data from 11 cohorts covering 5 different diseases. The results show that GDmicro has better performance and robustness than state-of-the-art tools. In particular, it improves the AUC from 0.783 to 0.949 in identifying inflammatory bowel disease. Furthermore, GDmicro can identify potential biomarkers with greater accuracy than abundance-based statistical analysis methods. It also reveals the contribution of these biomarkers to the host's disease status. AVAILABILITY AND IMPLEMENTATION https://github.com/liaoherui/GDmicro.
Collapse
Affiliation(s)
- Herui Liao
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong (SAR), 518057, China
| | - Jiayu Shang
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong (SAR), 518057, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong (SAR), 518057, China
| |
Collapse
|
8
|
Hossain PS, Kim K, Uddin J, Samad MA, Choi K. Enhancing Taxonomic Categorization of DNA Sequences with Deep Learning: A Multi-Label Approach. Bioengineering (Basel) 2023; 10:1293. [PMID: 38002417 PMCID: PMC10669241 DOI: 10.3390/bioengineering10111293] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2023] [Revised: 11/02/2023] [Accepted: 11/05/2023] [Indexed: 11/26/2023] Open
Abstract
The application of deep learning for taxonomic categorization of DNA sequences is investigated in this study. Two deep learning architectures, namely the Stacked Convolutional Autoencoder (SCAE) with Multilabel Extreme Learning Machine (MLELM) and the Variational Convolutional Autoencoder (VCAE) with MLELM, have been proposed. These designs provide precise feature maps for individual and inter-label interactions within DNA sequences, capturing their spatial and temporal properties. The collected features are subsequently fed into MLELM networks, which yield soft classification scores and hard labels. The proposed algorithms underwent thorough training and testing on unsupervised data, whereby one or more labels were concurrently taken into account. The introduction of the clade label resulted in improved accuracy for both models compared to the class or genus labels, probably owing to the occurrence of large clusters of similar nucleotides inside a DNA strand. In all circumstances, the VCAE-MLELM model consistently outperformed the SCAE-MLELM model. The best accuracy attained by the VCAE-MLELM model when the clade and family labels were combined was 94%. However, accuracy ratings for single-label categorization using either approach were less than 65%. The approach's effectiveness is based on MLELM networks, which record connected patterns across classes for accurate label categorization. This study advances deep learning in biological taxonomy by emphasizing the significance of combining numerous labels for increased classification accuracy.
Collapse
Affiliation(s)
| | - Kyungsup Kim
- Department of Computer Engineering, Chungnam National University, Yuseong-gu, Daejeon 34134, Republic of Korea
| | - Jia Uddin
- Artificial Intelligence and Big Data Department, Endicott College, Woosong University, Daejeon 34606, Republic of Korea
| | - Md Abdus Samad
- Department of Information and Communication Engineering, Yeungnam University, Gyeongsan-si 38541, Gyeongsangbuk-do, Republic of Korea
| | - Kwonhue Choi
- Department of Information and Communication Engineering, Yeungnam University, Gyeongsan-si 38541, Gyeongsangbuk-do, Republic of Korea
| |
Collapse
|
9
|
Liu Y, Zhang YZ, Imoto S. Microbial Gene Ontology informed deep neural network for microbe functionality discovery in human diseases. PLoS One 2023; 18:e0290307. [PMID: 37603579 PMCID: PMC10441785 DOI: 10.1371/journal.pone.0290307] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2023] [Accepted: 08/04/2023] [Indexed: 08/23/2023] Open
Abstract
The human microbiome plays a crucial role in human health and is associated with a number of human diseases. Determining microbiome functional roles in human diseases remains a biological challenge due to the high dimensionality of metagenome gene features. However, existing models were limited in providing biological interpretability, where the functional role of microbes in human diseases is unexplored. Here we propose to utilize a neural network-based model incorporating Gene Ontology (GO) relationship network to discover the microbe functionality in human diseases. We use four benchmark datasets, including diabetes, liver cirrhosis, inflammatory bowel disease, and colorectal cancer, to explore the microbe functionality in the human diseases. Our model discovered and visualized the novel candidates' important microbiome genes and their functions by calculating the important score of each gene and GO term in the network. Furthermore, we demonstrate that our model achieves a competitive performance in predicting the disease by comparison with other non-Gene Ontology informed models. The discovered candidates' important microbiome genes and their functions provide novel insights into microbe functional contribution.
Collapse
Affiliation(s)
- Yunjie Liu
- Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Yao-zhong Zhang
- Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| | - Seiya Imoto
- Institute of Medical Science, The University of Tokyo, Tokyo, Japan
| |
Collapse
|
10
|
Venkatachala Appa Swamy M, Periyasamy J, Thangavel M, Khan SB, Almusharraf A, Santhanam P, Ramaraj V, Elsisi M. Design and Development of IoT and Deep Ensemble Learning Based Model for Disease Monitoring and Prediction. Diagnostics (Basel) 2023; 13:diagnostics13111942. [PMID: 37296794 DOI: 10.3390/diagnostics13111942] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2023] [Revised: 05/04/2023] [Accepted: 05/11/2023] [Indexed: 06/12/2023] Open
Abstract
With the rapidly increasing reliance on advances in IoT, we persist towards pushing technology to new heights. From ordering food online to gene editing-based personalized healthcare, disruptive technologies like ML and AI continue to grow beyond our wildest dreams. Early detection and treatment through AI-assisted diagnostic models have outperformed human intelligence. In many cases, these tools can act upon the structured data containing probable symptoms, offer medication schedules based on the appropriate code related to diagnosis conventions, and predict adverse drug effects, if any, in accordance with medications. Utilizing AI and IoT in healthcare has facilitated innumerable benefits like minimizing cost, reducing hospital-obtained infections, decreasing mortality and morbidity etc. DL algorithms have opened up several frontiers by contributing towards healthcare opportunities through their ability to understand and learn from different levels of demonstration and generalization, which is significant in data analysis and interpretation. In contrast to ML which relies more on structured, labeled data and domain expertise to facilitate feature extractions, DL employs human-like cognitive abilities to extract hidden relationships and patterns from uncategorized data. Through the efficient application of DL techniques on the medical dataset, precise prediction, and classification of infectious/rare diseases, avoiding surgeries that can be preventable, minimization of over-dosage of harmful contrast agents for scans and biopsies can be reduced to a greater extent in future. Our study is focused on deploying ensemble deep learning algorithms and IoT devices to design and develop a diagnostic model that can effectively analyze medical Big Data and diagnose diseases by identifying abnormalities in early stages through medical images provided as input. This AI-assisted diagnostic model based on Ensemble Deep learning aims to be a valuable tool for healthcare systems and patients through its ability to diagnose diseases in the initial stages and present valuable insights to facilitate personalized treatment by aggregating the prediction of each base model and generating a final prediction.
Collapse
Affiliation(s)
| | - Jayalakshmi Periyasamy
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India
| | - Muthamilselvan Thangavel
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India
| | - Surbhi B Khan
- Department of Electrical and Computer Engineering, Lebanese American University, Byblos 13-5053, Lebanon
- Department of Data Science, School of Science, Engineering and Environment, University of Sanford, Manchester M5 4WT, UK
| | - Ahlam Almusharraf
- Department of Business Administration, College of Business and Administration, Princess Nourah bint Abdulrahman University, P.O. Box 84428, Riyadh 11671, Saudi Arabia
| | - Prasanna Santhanam
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India
| | - Vijayan Ramaraj
- School of Information Technology and Engineering, Vellore Institute of Technology, Vellore 632014, Tamil Nadu, India
| | - Mahmoud Elsisi
- Department of Electrical Engineering, National Kaohsiung University of Science and Technology, Kaohsiung City 807618, Taiwan
- Department of Electrical Engineering, Faculty of Engineering (Shoubra), Benha University, 108 Shoubra St., Cairo P.O. Box 11241, Egypt
| |
Collapse
|
11
|
Hallsworth JE, Udaondo Z, Pedrós‐Alió C, Höfer J, Benison KC, Lloyd KG, Cordero RJB, de Campos CBL, Yakimov MM, Amils R. Scientific novelty beyond the experiment. Microb Biotechnol 2023; 16:1131-1173. [PMID: 36786388 PMCID: PMC10221578 DOI: 10.1111/1751-7915.14222] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/20/2022] [Revised: 01/09/2023] [Accepted: 01/11/2023] [Indexed: 02/15/2023] Open
Abstract
Practical experiments drive important scientific discoveries in biology, but theory-based research studies also contribute novel-sometimes paradigm-changing-findings. Here, we appraise the roles of theory-based approaches focusing on the experiment-dominated wet-biology research areas of microbial growth and survival, cell physiology, host-pathogen interactions, and competitive or symbiotic interactions. Additional examples relate to analyses of genome-sequence data, climate change and planetary health, habitability, and astrobiology. We assess the importance of thought at each step of the research process; the roles of natural philosophy, and inconsistencies in logic and language, as drivers of scientific progress; the value of thought experiments; the use and limitations of artificial intelligence technologies, including their potential for interdisciplinary and transdisciplinary research; and other instances when theory is the most-direct and most-scientifically robust route to scientific novelty including the development of techniques for practical experimentation or fieldwork. We highlight the intrinsic need for human engagement in scientific innovation, an issue pertinent to the ongoing controversy over papers authored using/authored by artificial intelligence (such as the large language model/chatbot ChatGPT). Other issues discussed are the way in which aspects of language can bias thinking towards the spatial rather than the temporal (and how this biased thinking can lead to skewed scientific terminology); receptivity to research that is non-mainstream; and the importance of theory-based science in education and epistemology. Whereas we briefly highlight classic works (those by Oakes Ames, Francis H.C. Crick and James D. Watson, Charles R. Darwin, Albert Einstein, James E. Lovelock, Lynn Margulis, Gilbert Ryle, Erwin R.J.A. Schrödinger, Alan M. Turing, and others), the focus is on microbiology studies that are more-recent, discussing these in the context of the scientific process and the types of scientific novelty that they represent. These include several studies carried out during the 2020 to 2022 lockdowns of the COVID-19 pandemic when access to research laboratories was disallowed (or limited). We interviewed the authors of some of the featured microbiology-related papers and-although we ourselves are involved in laboratory experiments and practical fieldwork-also drew from our own research experiences showing that such studies can not only produce new scientific findings but can also transcend barriers between disciplines, act counter to scientific reductionism, integrate biological data across different timescales and levels of complexity, and circumvent constraints imposed by practical techniques. In relation to urgent research needs, we believe that climate change and other global challenges may require approaches beyond the experiment.
Collapse
Affiliation(s)
- John E. Hallsworth
- Institute for Global Food Security, School of Biological SciencesQueen's University BelfastBelfastUK
| | - Zulema Udaondo
- Department of Biomedical InformaticsUniversity of Arkansas for Medical SciencesLittle RockArkansasUSA
| | - Carlos Pedrós‐Alió
- Department of Systems BiologyCentro Nacional de Biotecnología (CSIC)MadridSpain
| | - Juan Höfer
- Escuela de Ciencias del MarPontificia Universidad Católica de ValparaísoValparaísoChile
| | - Kathleen C. Benison
- Department of Geology and GeographyWest Virginia UniversityMorgantownWest VirginiaUSA
| | - Karen G. Lloyd
- Microbiology DepartmentUniversity of TennesseeKnoxvilleTennesseeUSA
| | - Radamés J. B. Cordero
- Department of Molecular Microbiology and ImmunologyJohns Hopkins Bloomberg School of Public HealthBaltimoreMarylandUSA
| | - Claudia B. L. de Campos
- Institute of Science and TechnologyUniversidade Federal de Sao Paulo (UNIFESP)São José dos CamposSPBrazil
| | | | - Ricardo Amils
- Department of Molecular Biology, Centro de Biología Molecular Severo Ochoa (CSIC‐UAM)Nicolás Cabrera n° 1, Universidad Autónoma de MadridMadridSpain
- Department of Planetology and HabitabilityCentro de Astrobiología (INTA‐CSIC)Torrejón de ArdozSpain
| |
Collapse
|
12
|
Fung DLX, Li X, Leung CK, Hu P. A self-knowledge distillation-driven CNN-LSTM model for predicting disease outcomes using longitudinal microbiome data. Bioinform Adv 2023; 3:vbad059. [PMID: 37228387 PMCID: PMC10203376 DOI: 10.1093/bioadv/vbad059] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Revised: 04/03/2023] [Accepted: 05/01/2023] [Indexed: 05/27/2023]
Abstract
Motivation Human microbiome is complex and highly dynamic in nature. Dynamic patterns of the microbiome can capture more information than single point inference as it contains the temporal changes information. However, dynamic information of the human microbiome can be hard to be captured due to the complexity of obtaining the longitudinal data with a large volume of missing data that in conjunction with heterogeneity may provide a challenge for the data analysis. Results We propose using an efficient hybrid deep learning architecture convolutional neural network-long short-term memory, which combines with self-knowledge distillation to create highly accurate models to analyze the longitudinal microbiome profiles to predict disease outcomes. Using our proposed models, we analyzed the datasets from Predicting Response to Standardized Pediatric Colitis Therapy (PROTECT) study and DIABIMMUNE study. We showed the significant improvement in the area under the receiver operating characteristic curve scores, achieving 0.889 and 0.798 on PROTECT study and DIABIMMUNE study, respectively, compared with state-of-the-art temporal deep learning models. Our findings provide an effective artificial intelligence-based tool to predict disease outcomes using longitudinal microbiome profiles from collected patients. Availability and implementation The data and source code can be accessed at https://github.com/darylfung96/UC-disease-TL.
Collapse
Affiliation(s)
- Daryl L X Fung
- Department of Computer Science, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | - Xu Li
- Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
| | - Carson K Leung
- Department of Computer Science, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | | |
Collapse
|
13
|
Khachatryan L, Xiang Y, Ivanov A, Glaab E, Graham G, Granata I, Giordano M, Maddalena L, Piccirillo M, Manipur I, Baruzzo G, Cappellato M, Avot B, Stan A, Battey J, Lo Sasso G, Boue S, Ivanov NV, Peitsch MC, Hoeng J, Falquet L, Di Camillo B, Guarracino MR, Ulyantsev V, Sierro N, Poussin C. Results and lessons learned from the sbv IMPROVER metagenomics diagnostics for inflammatory bowel disease challenge. Sci Rep 2023; 13:6303. [PMID: 37072468 PMCID: PMC10113391 DOI: 10.1038/s41598-023-33050-0] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Accepted: 04/06/2023] [Indexed: 05/03/2023] Open
Abstract
A growing body of evidence links gut microbiota changes with inflammatory bowel disease (IBD), raising the potential benefit of exploiting metagenomics data for non-invasive IBD diagnostics. The sbv IMPROVER metagenomics diagnosis for inflammatory bowel disease challenge investigated computational metagenomics methods for discriminating IBD and nonIBD subjects. Participants in this challenge were given independent training and test metagenomics data from IBD and nonIBD subjects, which could be wither either raw read data (sub-challenge 1, SC1) or processed Taxonomy- and Function-based profiles (sub-challenge 2, SC2). A total of 81 anonymized submissions were received between September 2019 and March 2020. Most participants' predictions performed better than random predictions in classifying IBD versus nonIBD, Ulcerative Colitis (UC) versus nonIBD, and Crohn's Disease (CD) versus nonIBD. However, discrimination between UC and CD remains challenging, with the classification quality similar to the set of random predictions. We analyzed the class prediction accuracy, the metagenomics features by the teams, and computational methods used. These results will be openly shared with the scientific community to help advance IBD research and illustrate the application of a range of computational methodologies for effective metagenomic classification.
Collapse
Affiliation(s)
- Lusine Khachatryan
- PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000, Neuchâtel, Switzerland.
| | - Yang Xiang
- PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000, Neuchâtel, Switzerland
| | - Artem Ivanov
- ITMO University, St. Petersburg, Russian Federation
| | - Enrico Glaab
- University of Luxembourg, Luxembourg, Luxembourg
| | | | | | | | | | | | | | | | | | | | - Adrian Stan
- PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000, Neuchâtel, Switzerland
| | - James Battey
- PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000, Neuchâtel, Switzerland
| | - Giuseppe Lo Sasso
- PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000, Neuchâtel, Switzerland
| | - Stephanie Boue
- PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000, Neuchâtel, Switzerland
| | - Nikolai V Ivanov
- PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000, Neuchâtel, Switzerland
| | - Manuel C Peitsch
- PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000, Neuchâtel, Switzerland
| | - Julia Hoeng
- PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000, Neuchâtel, Switzerland
| | | | | | | | | | - Nicolas Sierro
- PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000, Neuchâtel, Switzerland
| | - Carine Poussin
- PMI R&D, Philip Morris Products S.A., Quai Jeanrenaud 5, 2000, Neuchâtel, Switzerland
| |
Collapse
|
14
|
Boodaghidizaji M, Jungles T, Chen T, Zhang B, Landay A, Keshavarzian A, Hamaker B, Ardekani A. Machine learning based gut microbiota pattern and response to fiber as a diagnostic tool for chronic inflammatory diseases. bioRxiv 2023:2023.03.27.534466. [PMID: 37034781 PMCID: PMC10081192 DOI: 10.1101/2023.03.27.534466] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Gut microbiota has been implicated in the pathogenesis of multiple gastrointestinal (GI) and systemic metabolic and inflammatory disorders where disrupted gut microbiota composition and function (dysbiosis) has been found in multiple studies. Thus, human microbiome data has a potential to be a great source of information for the diagnosis and disease characteristics (phenotypes, disease course, therapeutic response) of diseases with dysbiotic microbiota community. However, multiple attempts to leverage gut microbiota taxonomic data for diagnostic and disease characterization have failed due to significant inter-individual variability of microbiota community and overlap of disrupted microbiota communities among multiple diseases. One potential approach is to look at the microbiota community pattern and response to microbiota modifiers like dietary fiber in different disease states. This approach is now feasible by availability of machine learning that is able to identify hidden patterns in the human microbiome and predict diseases. Accordingly, the aim of our study was to test the hypothesis that application of machine learning algorithms can distinguish stool microbiota pattern and microbiota response to fiber between diseases where overlapping dysbiotic microbiota have been previously reported. Here, we have applied machine learning algorithms to distinguish between Parkinson's disease, Crohn's disease (CD), ulcerative colitis (UC), human immune deficiency virus (HIV), and healthy control (HC) subjects in the presence and absence of fiber treatments. We have shown that machine learning algorithms can classify diseases with accuracy as high as 95%. Furthermore, machine learning methods applied to the microbiome data to predict UC vs CD led to prediction accuracy as high as 90%.
Collapse
|
15
|
Liang C, Wagstaff J, Aharony N, Schmit V, Manheim D. Managing the Transition to Widespread Metagenomic Monitoring: Policy Considerations for Future Biosurveillance. Health Secur 2023; 21:34-45. [PMID: 36629860 PMCID: PMC9940815 DOI: 10.1089/hs.2022.0029] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
The technological possibilities and future public health importance of metagenomic sequencing have received extensive attention, but there has been little discussion about the policy and regulatory issues that need to be addressed if metagenomic sequencing is adopted as a key technology for biosurveillance. In this article, we introduce metagenomic monitoring as a possible path to eventually replacing current infectious disease monitoring models. Many key enablers are technological, whereas others are not. We therefore highlight key policy challenges and implementation questions that need to be addressed for "widespread metagenomic monitoring" to be possible. Policymakers must address pitfalls like fragmentation of the technological base, private capture of benefits, privacy concerns, the usefulness of the system during nonpandemic times, and how the future systems will enable better response. If these challenges are addressed, the technological and public health promise of metagenomic sequencing can be realized.
Collapse
Affiliation(s)
- Chelsea Liang
- Chelsea Liang is an Independent Researcher, University of New South Wales, School of Biotechnology and Biomolecular Sciences, Sydney, Australia
| | - James Wagstaff
- James Wagstaff, PhD, is a Research Fellow, Future of Humanity Institute, University of Oxford, Oxford, UK
| | - Noga Aharony
- Noga Aharony, MS, is a PhD Student, Department of Systems Biology, Columbia University, New York, NY
| | - Virginia Schmit
- Virginia Schmit, PhD, is Director of Research, 1DatSooner, DE, and a Policy Specialist, National Institute of Allergy and Infectious Diseases, Bethesda, MD
| | - David Manheim
- David Manheim, PhD, is Head of Policy and Research, ALTER, Rehovot, Israel; Lead Researcher, 1DaySooner, Claymont, DE,Visiting Researcher, Humanities and Arts Department, Technion – Israel Institute of Technology, Haifa, Israel.,Address correspondence to: David B. Manheim, 8734 First Avenue, Silver Spring, MD 20910
| |
Collapse
|
16
|
Yang L, Wang S, Altman RB. POPDx: an automated framework for patient phenotyping across 392 246 individuals in the UK Biobank study. J Am Med Inform Assoc 2023; 30:245-255. [PMID: 36469791 PMCID: PMC9846671 DOI: 10.1093/jamia/ocac226] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2022] [Revised: 10/19/2022] [Accepted: 11/18/2022] [Indexed: 12/12/2022] Open
Abstract
OBJECTIVE For the UK Biobank, standardized phenotype codes are associated with patients who have been hospitalized but are missing for many patients who have been treated exclusively in an outpatient setting. We describe a method for phenotype recognition that imputes phenotype codes for all UK Biobank participants. MATERIALS AND METHODS POPDx (Population-based Objective Phenotyping by Deep Extrapolation) is a bilinear machine learning framework for simultaneously estimating the probabilities of 1538 phenotype codes. We extracted phenotypic and health-related information of 392 246 individuals from the UK Biobank for POPDx development and evaluation. A total of 12 803 ICD-10 diagnosis codes of the patients were converted to 1538 phecodes as gold standard labels. The POPDx framework was evaluated and compared to other available methods on automated multiphenotype recognition. RESULTS POPDx can predict phenotypes that are rare or even unobserved in training. We demonstrate substantial improvement of automated multiphenotype recognition across 22 disease categories, and its application in identifying key epidemiological features associated with each phenotype. CONCLUSIONS POPDx helps provide well-defined cohorts for downstream studies. It is a general-purpose method that can be applied to other biobanks with diverse but incomplete data.
Collapse
Affiliation(s)
- Lu Yang
- Department of Bioengineering, Stanford University, Stanford, CA, USA
| | - Sheng Wang
- Paul G. Allen School of Computer Science and Engineering, University of Washington, Seattle, WA, USA
| | - Russ B Altman
- Department of Bioengineering, Stanford University, Stanford, CA, USA.,Department of Genetics, Stanford University, Stanford, CA, USA.,Department of Medicine, Stanford University, Stanford, CA, USA
| |
Collapse
|
17
|
Loganathan T, Priya Doss C G. The influence of machine learning technologies in gut microbiome research and cancer studies - A review. Life Sci 2022; 311:121118. [DOI: 10.1016/j.lfs.2022.121118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2022] [Revised: 10/19/2022] [Accepted: 10/19/2022] [Indexed: 11/18/2022]
|
18
|
Hernández Medina R, Kutuzova S, Nielsen KN, Johansen J, Hansen LH, Nielsen M, Rasmussen S. Machine learning and deep learning applications in microbiome research. ISME Commun 2022; 2:98. [PMID: 37938690 PMCID: PMC9723725 DOI: 10.1038/s43705-022-00182-9] [Citation(s) in RCA: 33] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Revised: 09/12/2022] [Accepted: 09/16/2022] [Indexed: 05/27/2023]
Abstract
The many microbial communities around us form interactive and dynamic ecosystems called microbiomes. Though concealed from the naked eye, microbiomes govern and influence macroscopic systems including human health, plant resilience, and biogeochemical cycling. Such feats have attracted interest from the scientific community, which has recently turned to machine learning and deep learning methods to interrogate the microbiome and elucidate the relationships between its composition and function. Here, we provide an overview of how the latest microbiome studies harness the inductive prowess of artificial intelligence methods. We start by highlighting that microbiome data - being compositional, sparse, and high-dimensional - necessitates special treatment. We then introduce traditional and novel methods and discuss their strengths and applications. Finally, we discuss the outlook of machine and deep learning pipelines, focusing on bottlenecks and considerations to address them.
Collapse
Affiliation(s)
- Ricardo Hernández Medina
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
| | - Svetlana Kutuzova
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
- Department of Computer Science, University of Copenhagen, DK-2100, Copenhagen Ø, Denmark
| | - Knud Nor Nielsen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871, Frederiksberg, Denmark
| | - Joachim Johansen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark
| | - Lars Hestbjerg Hansen
- Department of Plant and Environmental Sciences, University of Copenhagen, DK-1871, Frederiksberg, Denmark
| | - Mads Nielsen
- Department of Computer Science, University of Copenhagen, DK-2100, Copenhagen Ø, Denmark.
| | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, DK-2200, Copenhagen N, Denmark.
| |
Collapse
|
19
|
Wen LY, Wang X, Min F. Cost-sensitive microbial data augmentation through matrix factorization. APPL INTELL 2022. [DOI: 10.1007/s10489-022-04187-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
20
|
Díez López C, Montiel González D, Vidaki A, Kayser M. Prediction of Smoking Habits From Class-Imbalanced Saliva Microbiome Data Using Data Augmentation and Machine Learning. Front Microbiol 2022; 13:886201. [PMID: 35928158 PMCID: PMC9343866 DOI: 10.3389/fmicb.2022.886201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2022] [Accepted: 06/21/2022] [Indexed: 11/24/2022] Open
Abstract
Human microbiome research is moving from characterization and association studies to translational applications in medical research, clinical diagnostics, and others. One of these applications is the prediction of human traits, where machine learning (ML) methods are often employed, but face practical challenges. Class imbalance in available microbiome data is one of the major problems, which, if unaccounted for, leads to spurious prediction accuracies and limits the classifier's generalization. Here, we investigated the predictability of smoking habits from class-imbalanced saliva microbiome data by combining data augmentation techniques to account for class imbalance with ML methods for prediction. We collected publicly available saliva 16S rRNA gene sequencing data and smoking habit metadata demonstrating a serious class imbalance problem, i.e., 175 current vs. 1,070 non-current smokers. Three data augmentation techniques (synthetic minority over-sampling technique, adaptive synthetic, and tree-based associative data augmentation) were applied together with seven ML methods: logistic regression, k-nearest neighbors, support vector machine with linear and radial kernels, decision trees, random forest, and extreme gradient boosting. K-fold nested cross-validation was used with the different augmented data types and baseline non-augmented data to validate the prediction outcome. Combining data augmentation with ML generally outperformed baseline methods in our dataset. The final prediction model combined tree-based associative data augmentation and support vector machine with linear kernel, and achieved a classification performance expressed as Matthews correlation coefficient of 0.36 and AUC of 0.81. Our method successfully addresses the problem of class imbalance in microbiome data for reliable prediction of smoking habits.
Collapse
Affiliation(s)
| | | | | | - Manfred Kayser
- Department of Genetic Identification, Erasmus MC University Medical Center Rotterdam, Rotterdam, Netherlands
| |
Collapse
|
21
|
Zhou X, Chen L, Liu HX. Applications of Machine Learning Models to Predict and Prevent Obesity: A Mini-Review. Front Nutr 2022; 9:933130. [PMID: 35866076 PMCID: PMC9294383 DOI: 10.3389/fnut.2022.933130] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2022] [Accepted: 05/19/2022] [Indexed: 11/28/2022] Open
Abstract
Research on obesity and related diseases has received attention from government policymakers; interventions targeting nutrient intake, dietary patterns, and physical activity are deployed globally. An urgent issue now is how can we improve the efficiency of obesity research or obesity interventions. Currently, machine learning (ML) methods have been widely applied in obesity-related studies to detect obesity disease biomarkers or discover intervention strategies to optimize weight loss results. In addition, an open source of these algorithms is necessary to check the reproducibility of the research results. Furthermore, appropriate applications of these algorithms could greatly improve the efficiency of similar studies by other researchers. Here, we proposed a mini-review of several open-source ML algorithms, platforms, or related databases that are of particular interest or can be applied in the field of obesity research. We focus our topic on nutrition, environment and social factor, genetics or genomics, and microbiome-adopting ML algorithms.
Collapse
Affiliation(s)
- Xiaobei Zhou
- Health Sciences Institute, China Medical University, Shenyang, China
- Liaoning Key Laboratory of Obesity and Glucose/Lipid Associated Metabolic Diseases, China Medical University, Shenyang, China
| | - Lei Chen
- Health Sciences Institute, China Medical University, Shenyang, China
- Liaoning Key Laboratory of Obesity and Glucose/Lipid Associated Metabolic Diseases, China Medical University, Shenyang, China
- Institute of Life Sciences, China Medical University, Shenyang, China
| | - Hui-Xin Liu
- Health Sciences Institute, China Medical University, Shenyang, China
- Liaoning Key Laboratory of Obesity and Glucose/Lipid Associated Metabolic Diseases, China Medical University, Shenyang, China
- Institute of Life Sciences, China Medical University, Shenyang, China
- *Correspondence: Hui-Xin Liu
| |
Collapse
|
22
|
Li B, Zhong D, Qiao J, Jiang X. GNPI: Graph normalization to integrate phylogenetic information for metagenomic host phenotype prediction. Methods 2022; 205:11-17. [PMID: 35636652 DOI: 10.1016/j.ymeth.2022.05.007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 05/17/2022] [Accepted: 05/26/2022] [Indexed: 11/24/2022] Open
Abstract
Microorganisms play important roles in our lives especially on metabolism and diseases. Determining the probability of human suffering from specific diseases and the severity of the disease based on microbial genes is the crucial research for understanding the relationship between microbes and diseases. Previous could extract the topological information of phylogenetic trees and integrate them to metagenomic datasets, thus enable classifiers to learn more information in limited datasets and thus improve the performance of the models. In this paper, we proposed a GNPI model to better learn the structure of phylogenetic trees. GNPI maintained the original vector format of metagenomic datasets, while previous research had to change the input form to matrices. The vector-like form of the input data can be easily adopted in the baseline machine learning models and is available for deep learning models. The datasets processed with GNPI help enhance the accuracy of machine learning and deep learning models in three different datasets. GNPI is an interpretable data processing method for host phenotype prediction and other bioinformatics tasks.
Collapse
Affiliation(s)
- Bojing Li
- Hubei Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China; School of Computer, Central China Normal University, Wuhan, China
| | - Duo Zhong
- Hubei Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China; School of Computer, Central China Normal University, Wuhan, China
| | - Jimei Qiao
- Mathematics and Science College, Shanghai Normal University, Shanghai, China
| | - Xingpeng Jiang
- Hubei Key Laboratory of Artificial Intelligence and Smart Learning, Central China Normal University, Wuhan, China; School of Computer, Central China Normal University, Wuhan, China; National Language Resources Monitoring & Research Center for Network Media, Central China Normal University, Wuhan, China.
| |
Collapse
|
23
|
Bakir-Gungor B, Hacılar H, Jabeer A, Nalbantoglu OU, Aran O, Yousef M. Inflammatory bowel disease biomarkers of human gut microbiota selected via different feature selection methods. PeerJ 2022; 10:e13205. [PMID: 35497193 PMCID: PMC9048649 DOI: 10.7717/peerj.13205] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2021] [Accepted: 03/10/2022] [Indexed: 01/12/2023] Open
Abstract
The tremendous boost in next generation sequencing and in the "omics" technologies makes it possible to characterize the human gut microbiome-the collective genomes of the microbial community that reside in our gastrointestinal tract. Although some of these microorganisms are considered to be essential regulators of our immune system, the alteration of the complexity and eubiotic state of microbiota might promote autoimmune and inflammatory disorders such as diabetes, rheumatoid arthritis, Inflammatory bowel diseases (IBD), obesity, and carcinogenesis. IBD, comprising Crohn's disease and ulcerative colitis, is a gut-related, multifactorial disease with an unknown etiology. IBD presents defects in the detection and control of the gut microbiota, associated with unbalanced immune reactions, genetic mutations that confer susceptibility to the disease, and complex environmental conditions such as westernized lifestyle. Although some existing studies attempt to unveil the composition and functional capacity of the gut microbiome in relation to IBD diseases, a comprehensive picture of the gut microbiome in IBD patients is far from being complete. Due to the complexity of metagenomic studies, the applications of the state-of-the-art machine learning techniques became popular to address a wide range of questions in the field of metagenomic data analysis. In this regard, using IBD associated metagenomics dataset, this study utilizes both supervised and unsupervised machine learning algorithms, (i) to generate a classification model that aids IBD diagnosis, (ii) to discover IBD-associated biomarkers, (iii) to discover subgroups of IBD patients using k-means and hierarchical clustering approaches. To deal with the high dimensionality of features, we applied robust feature selection algorithms such as Conditional Mutual Information Maximization (CMIM), Fast Correlation Based Filter (FCBF), min redundancy max relevance (mRMR), Select K Best (SKB), Information Gain (IG) and Extreme Gradient Boosting (XGBoost). In our experiments with 100-fold Monte Carlo cross-validation (MCCV), XGBoost, IG, and SKB methods showed a considerable effect in terms of minimizing the microbiota used for the diagnosis of IBD and thus reducing the cost and time. We observed that compared to Decision Tree, Support Vector Machine, Logitboost, Adaboost, and stacking ensemble classifiers, our Random Forest classifier resulted in better performance measures for the classification of IBD. Our findings revealed potential microbiome-mediated mechanisms of IBD and these findings might be useful for the development of microbiome-based diagnostics.
Collapse
Affiliation(s)
- Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Hilal Hacılar
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | - Amhar Jabeer
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | | | - Oya Aran
- TETAM, Bogazici University, Istanbul, Turkey
| | - Malik Yousef
- Zefat Academic College, Zefat, Israel,Galilee Digital Health Research Center, Zefat Academic College, Zefat, Israel
| |
Collapse
|
24
|
Giliberti R, Cavaliere S, Mauriello IE, Ercolini D, Pasolli E. Host phenotype classification from human microbiome data is mainly driven by the presence of microbial taxa. PLoS Comput Biol 2022; 18:e1010066. [PMID: 35446845 DOI: 10.1371/journal.pcbi.1010066] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2021] [Revised: 05/03/2022] [Accepted: 03/29/2022] [Indexed: 12/14/2022] Open
Abstract
Machine learning-based classification approaches are widely used to predict host phenotypes from microbiome data. Classifiers are typically employed by considering operational taxonomic units or relative abundance profiles as input features. Such types of data are intrinsically sparse, which opens the opportunity to make predictions from the presence/absence rather than the relative abundance of microbial taxa. This also poses the question whether it is the presence rather than the abundance of particular taxa to be relevant for discrimination purposes, an aspect that has been so far overlooked in the literature. In this paper, we aim at filling this gap by performing a meta-analysis on 4,128 publicly available metagenomes associated with multiple case-control studies. At species-level taxonomic resolution, we show that it is the presence rather than the relative abundance of specific microbial taxa to be important when building classification models. Such findings are robust to the choice of the classifier and confirmed by statistical tests applied to identifying differentially abundant/present taxa. Results are further confirmed at coarser taxonomic resolutions and validated on 4,026 additional 16S rRNA samples coming from 30 public case-control studies. The composition of the human microbiome has been linked to a large number of different diseases. In this context, classification methodologies based on machine learning approaches have represented a promising tool for diagnostic purposes from metagenomics data. The link between microbial population composition and host phenotypes has been usually performed by considering taxonomic profiles represented by relative abundances of microbial species. In this study, we show that it is more the presence rather than the relative abundance of microbial taxa to be relevant to maximize classification accuracy. This is accomplished by conducting a meta-analysis on more than 4,000 shotgun metagenomes coming from 25 case-control studies and in which original relative abundance data are degraded to presence/absence profiles. Findings are also extended to 16S rRNA data and advance the research field in building prediction models directly from human microbiome data.
Collapse
|
25
|
Rashid J, Batool S, Kim J, Wasif Nisar M, Hussain A, Juneja S, Kushwaha R. An Augmented Artificial Intelligence Approach for Chronic Diseases Prediction. Front Public Health 2022; 10:860396. [PMID: 35433587 PMCID: PMC9008324 DOI: 10.3389/fpubh.2022.860396] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2022] [Accepted: 02/22/2022] [Indexed: 12/23/2022] Open
Abstract
Chronic diseases are increasing in prevalence and mortality worldwide. Early diagnosis has therefore become an important research area to enhance patient survival rates. Several research studies have reported classification approaches for specific disease prediction. In this paper, we propose a novel augmented artificial intelligence approach using an artificial neural network (ANN) with particle swarm optimization (PSO) to predict five prevalent chronic diseases including breast cancer, diabetes, heart attack, hepatitis, and kidney disease. Seven classification algorithms are compared to evaluate the proposed model's prediction performance. The ANN prediction model constructed with a PSO based feature extraction approach outperforms other state-of-the-art classification approaches when evaluated with accuracy. Our proposed approach gave the highest accuracy of 99.67%, with the PSO. However, the classification model's performance is found to depend on the attributes of data used for classification. Our results are compared with various chronic disease datasets and shown to outperform other benchmark approaches. In addition, our optimized ANN processing is shown to require less time compared to random forest (RF), deep learning and support vector machine (SVM) based methods. Our study could play a role for early diagnosis of chronic diseases in hospitals, including through development of online diagnosis systems.
Collapse
Affiliation(s)
- Junaid Rashid
- Department of Computer Science and Engineering, Kongju National University, Cheonan, South Korea
| | - Saba Batool
- Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan
| | - Jungeun Kim
- Department of Computer Science and Engineering, Kongju National University, Cheonan, South Korea
- *Correspondence: Jungeun Kim
| | - Muhammad Wasif Nisar
- Department of Computer Science, COMSATS University Islamabad, Islamabad, Pakistan
| | - Amir Hussain
- Data Science and Cyber Analytics Research Group, Edinburgh Napier University, Edinburgh, United Kingdom
| | - Sapna Juneja
- Department of Computer Science, KIET Group of Institutions, Ghaziabad, India
| | - Riti Kushwaha
- Department of Computer Science, Bennett University, Greater Noida, India
| |
Collapse
|
26
|
Abstract
Microbes can form complex communities that perform critical functions in maintaining the integrity of their environment or their hosts' well-being. Rationally managing these microbial communities requires improving our ability to predict how different species assemblages affect the final species composition of the community. However, making such a prediction remains challenging because of our limited knowledge of the diverse physical, biochemical, and ecological processes governing microbial dynamics. To overcome this challenge, we present a deep learning framework that automatically learns the map between species assemblages and community compositions from training data only, without knowing any of the above processes. First, we systematically validate our framework using synthetic data generated by classical population dynamics models. Then, we apply our framework to data from in vitro and in vivo microbial communities, including ocean and soil microbiota, Drosophila melanogaster gut microbiota, and human gut and oral microbiota. We find that our framework learns to perform accurate out-of-sample predictions of complex community compositions from a small number of training samples. Our results demonstrate how deep learning can enable us to understand better and potentially manage complex microbial communities.
Collapse
Affiliation(s)
- Sebastian Michel-Mata
- Center for Applied Physics and Advanced Technology, Universidad Nacional Autónoma de México, Juriquilla 76230, México
- Department of Ecology and Evolutionary Biology, Princeton University, Princeton, NJ 08544, USA
| | - Xu-Wen Wang
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts 02115, USA
| | - Yang-Yu Liu
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts 02115, USA
- Correspondence: Yang-Yu Liu (); Marco Tulio Angulo (), Correspondence and requests for materials should be addressed to M.T.A. or Y.-Y.L
| | - Marco Tulio Angulo
- CONACyT - Institute of Mathematics, Universidad Nacional Autónoma de México, Juriquilla 76230, México
- Correspondence: Yang-Yu Liu (); Marco Tulio Angulo (), Correspondence and requests for materials should be addressed to M.T.A. or Y.-Y.L
| |
Collapse
|
27
|
Curry KD, Nute MG, Treangen TJ. It takes guts to learn: machine learning techniques for disease detection from the gut microbiome. Emerg Top Life Sci 2021; 5:815-827. [PMID: 34779841 PMCID: PMC8786294 DOI: 10.1042/etls20210213] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2021] [Revised: 09/29/2021] [Accepted: 10/06/2021] [Indexed: 02/01/2023]
Abstract
Associations between the human gut microbiome and expression of host illness have been noted in a variety of conditions ranging from gastrointestinal dysfunctions to neurological deficits. Machine learning (ML) methods have generated promising results for disease prediction from gut metagenomic information for diseases including liver cirrhosis and irritable bowel disease, but have lacked efficacy when predicting other illnesses. Here, we review current ML methods designed for disease classification from microbiome data. We highlight the computational challenges these methods have effectively overcome and discuss the biological components that have been overlooked to offer perspectives on future work in this area.
Collapse
Affiliation(s)
- Kristen D. Curry
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Michael G. Nute
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| | - Todd J. Treangen
- Department of Computer Science, Rice University, Houston, TX 77005, USA
| |
Collapse
|
28
|
Narayana JK, Mac Aogáin M, Goh WWB, Xia K, Tsaneva-Atanasova K, Chotirmall SH. Mathematical-based microbiome analytics for clinical translation. Comput Struct Biotechnol J 2021; 19:6272-6281. [PMID: 34900137 PMCID: PMC8637001 DOI: 10.1016/j.csbj.2021.11.029] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/25/2021] [Revised: 11/17/2021] [Accepted: 11/17/2021] [Indexed: 12/20/2022] Open
Abstract
Traditionally, human microbiology has been strongly built on the laboratory focused culture of microbes isolated from human specimens in patients with acute or chronic infection. These approaches primarily view human disease through the lens of a single species and its relevant clinical setting however such approaches fail to account for the surrounding environment and wide microbial diversity that exists in vivo. Given the emergence of next generation sequencing technologies and advancing bioinformatic pipelines, researchers now have unprecedented capabilities to characterise the human microbiome in terms of its taxonomy, function, antibiotic resistance and even bacteriophages. Despite this, an analysis of microbial communities has largely been restricted to ordination, ecological measures, and discriminant taxa analysis. This is predominantly due to a lack of suitable computational tools to facilitate microbiome analytics. In this review, we first evaluate the key concerns related to the inherent structure of microbiome datasets which include its compositionality and batch effects. We describe the available and emerging analytical techniques including integrative analysis, machine learning, microbial association networks, topological data analysis (TDA) and mathematical modelling. We also present how these methods may translate to clinical settings including tools for implementation. Mathematical based analytics for microbiome analysis represents a promising avenue for clinical translation across a range of acute and chronic disease states.
Collapse
Affiliation(s)
- Jayanth Kumar Narayana
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
| | - Micheál Mac Aogáin
- Biochemical Genetics Laboratory, Department of Biochemistry, St. James’s Hospital, Dublin, Ireland
- Clinical Biochemistry Unit, School of Medicine, Trinity College Dublin, Dublin, Ireland
| | - Wilson Wen Bin Goh
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- School of Biological Sciences, Nanyang Technological University, Singapore, Singapore
| | - Kelin Xia
- Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University, Singapore, Singapore
| | - Krasimira Tsaneva-Atanasova
- Department of Mathematics & Living Systems Institute, College of Engineering, Mathematics and Physical Sciences, University of Exeter, Exeter EX4 4QF, UK
| | - Sanjay H. Chotirmall
- Lee Kong Chian School of Medicine, Nanyang Technological University, Singapore, Singapore
- Department of Respiratory and Critical Care Medicine, Tan Tock Seng Hospital, Singapore
| |
Collapse
|
29
|
Abstract
Unraveling the association between microbiome and plant phenotype can illustrate the effect of microbiome on host and then guide the agriculture management. Adequate identification of species and appropriate choice of models are two challenges in microbiome data analysis. Computational models of microbiome data could help in association analysis between the microbiome and plant host. The deep learning methods have been widely used to learn the microbiome data due to their powerful strength of handling the complex, sparse, noisy, and high-dimensional data. Here, we review the analytic strategies in the microbiome data analysis and describe the applications of deep learning models for plant–microbiome correlation studies. We also introduce the application cases of different models in plant–microbiome correlation analysis and discuss how to adapt the models on the critical steps in data processing. From the aspect of data processing manner, model structure, and operating principle, most deep learning models are suitable for the plant microbiome data analysis. The ability of feature representation and pattern recognition is the advantage of deep learning methods in modeling and interpretation for association analysis. Based on published computational experiments, the convolutional neural network and graph neural networks could be recommended for plant microbiome analysis.
Collapse
Affiliation(s)
- Zhiyu Deng
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Jinming Zhang
- Department of Infectious Diseases, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Junya Li
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Xiujun Zhang
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China
| |
Collapse
|
30
|
Gao J, Zhang X, Tian L, Liu Y, Wang J, Li Z, Hu X. MTGNN: Multi-Task Graph Neural Network based few-shot learning for disease similarity measurement. Methods 2021; 198:88-95. [PMID: 34700014 DOI: 10.1016/j.ymeth.2021.10.005] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2021] [Revised: 10/16/2021] [Accepted: 10/18/2021] [Indexed: 11/24/2022] Open
Abstract
Similar diseases are usually caused by molecular origins or similar phenotypes. Confirming the relationship between diseases can help researchers gain a deep insight of the pathogenic mechanisms of emerging complex diseases, and improve the corresponding diagnoses and treatment. Therefore, similar diseases are considerably important in biology and pathology. However, the insufficient number of labelled similar disease pairs cannot support the optimal training of the models. In this paper, we propose a Multi-Task Graph Neural Network (MTGNN) framework to measure disease similarity by few-shot learning. To tackle the problem of insufficient number of labelled similar disease pairs, we design the multi-task optimization strategy to train the graph neural network for disease similarity task (lack of labelled training data) by introducing link prediction task (sufficient labelled training data). The similarity between diseases can then be obtained by measuring the distance between disease embeddings in high-dimensional space learning from the double tasks. The experiment results evaluate the performance of MTGNN and illustrate its advantages over previous methods on few labeled training dataset.
Collapse
Affiliation(s)
- Jianliang Gao
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Xiangchi Zhang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Ling Tian
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Yuxin Liu
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Jianxin Wang
- School of Computer Science and Engineering, Central South University, Changsha 410083, China
| | - Zhao Li
- Alibaba Group, Hangzhou 310000, China.
| | - Xiaohua Hu
- College of Computing & Informatics, Drexel University, Philadelphia, PA 19104, USA
| |
Collapse
|
31
|
Zhao Z, Woloszynek S, Agbavor F, Mell JC, Sokhansanj BA, Rosen GL. Learning, visualizing and exploring 16S rRNA structure using an attention-based deep neural network. PLoS Comput Biol 2021; 17:e1009345. [PMID: 34550967 PMCID: PMC8496832 DOI: 10.1371/journal.pcbi.1009345] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2020] [Revised: 10/07/2021] [Accepted: 08/12/2021] [Indexed: 01/04/2023] Open
Abstract
Recurrent neural networks with memory and attention mechanisms are widely used in natural language processing because they can capture short and long term sequential information for diverse tasks. We propose an integrated deep learning model for microbial DNA sequence data, which exploits convolutional neural networks, recurrent neural networks, and attention mechanisms to predict taxonomic classifications and sample-associated attributes, such as the relationship between the microbiome and host phenotype, on the read/sequence level. In this paper, we develop this novel deep learning approach and evaluate its application to amplicon sequences. We apply our approach to short DNA reads and full sequences of 16S ribosomal RNA (rRNA) marker genes, which identify the heterogeneity of a microbial community sample. We demonstrate that our implementation of a novel attention-based deep network architecture, Read2Pheno, achieves read-level phenotypic prediction. Training Read2Pheno models will encode sequences (reads) into dense, meaningful representations: learned embedded vectors output from the intermediate layer of the network model, which can provide biological insight when visualized. The attention layer of Read2Pheno models can also automatically identify nucleotide regions in reads/sequences which are particularly informative for classification. As such, this novel approach can avoid pre/post-processing and manual interpretation required with conventional approaches to microbiome sequence classification. We further show, as proof-of-concept, that aggregating read-level information can robustly predict microbial community properties, host phenotype, and taxonomic classification, with performance at least comparable to conventional approaches. An implementation of the attention-based deep learning network is available at https://github.com/EESI/sequence_attention (a python package) and https://github.com/EESI/seq2att (a command line tool).
Collapse
Affiliation(s)
- Zhengqiao Zhao
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Stephen Woloszynek
- Beth Israel Deaconess Medical Center, Boston, Massachusetts, United States of America
- Harvard Medical School, Boston, Massachusetts, United States of America
| | - Felix Agbavor
- School of Biomedical Engineering, Science and Health Systems, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Joshua Chang Mell
- College of Medicine, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Bahrad A. Sokhansanj
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| | - Gail L. Rosen
- Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Department of Electrical and Computer Engineering, College of Engineering, Drexel University, Philadelphia, Pennsylvania, United States of America
| |
Collapse
|
32
|
Sun Q, Peng Y, Liu J. A reference-free approach for cell type classification with scRNA-seq. iScience 2021; 24:102855. [PMID: 34381979 PMCID: PMC8335627 DOI: 10.1016/j.isci.2021.102855] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2021] [Revised: 05/07/2021] [Accepted: 07/08/2021] [Indexed: 11/29/2022] Open
Abstract
Single-cell RNA sequencing (scRNA-seq) has become a revolutionary technology to characterize cells under different biological conditions. Unlike bulk RNA-seq, gene expression from scRNA-seq is highly sparse due to limited sequencing depth per cell. This is worsened by tossing away a significant portion of reads that attribute to gene quantification. To overcome data sparsity and fully utilize original reads, we propose scSimClassify, a reference-free and alignment-free approach to classify cell types with k-mer level features. The compressed k-mer groups (CKGs), identified by the simhash method, contain k-mers with similar abundance profiles and serve as the cells’ features. Our experiments demonstrate that CKG features lend themselves to better performance than gene expression features in scRNA-seq classification accuracy in the majority of experimental cases. Because CKGs are derived from raw reads without alignment to reference genome, scSimClassify offers an effective alternative to existing methods especially when reference genome is incomplete or insufficient to represent subject genomes. Compressed k-mer groups (CKGs) are used to classify cell types without references CKGs are competitive to gene expression features for cell type classification CKGs are associated with genes sharing gene specific k-mers
Collapse
Affiliation(s)
- Qi Sun
- Department of Computer Science, University of Kentucky, Lexington, KY, 40508, USA
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, USA
| | - Jinze Liu
- Department of Biostatistics, Virginia Commonwealth University, Richmond, VA 23298, USA
| |
Collapse
|
33
|
Sharma D, Xu W. phyLoSTM: a novel deep learning model on disease prediction from longitudinal microbiome data. Bioinformatics 2021; 37:3707-3714. [PMID: 34213529 DOI: 10.1093/bioinformatics/btab482] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2021] [Revised: 05/24/2021] [Accepted: 06/30/2021] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Research shows that human microbiome is highly dynamic on longitudinal timescales, changing dynamically with diet, or due to medical interventions. In this paper, we propose a novel deep learning framework "phyLoSTM", using a combination of Convolutional Neural Networks and Long Short Term Memory Networks (LSTM) for feature extraction and analysis of temporal dependency in longitudinal microbiome sequencing data along with host's environmental factors for disease prediction. Additional novelty in terms of handling variable timepoints in subjects through LSTMs, as well as, weight balancing between imbalanced cases and controls is proposed. RESULTS We simulated 100 datasets across multiple time points for model testing. To demonstrate the model's effectiveness, we also implemented this novel method into two real longitudinal human microbiome studies: (i) DIABIMMUNE three country cohort with food allergy outcomes (Milk, Egg, Peanut and Overall) (ii) DiGiulio study with preterm delivery as outcome. Extensive analysis and comparison of our approach yields encouraging performance with an AUC of 0.897 (increased by 5%) on simulated studies and AUCs of 0.762 (increased by 19%) and 0.713 (increased by 8%) on the two real longitudinal microbiome studies respectively, as compared to the next best performing method, Random Forest. The proposed methodology improves predictive accuracy on longitudinal human microbiome studies containing spatially correlated data, and evaluates the change of microbiome composition contributing to outcome prediction. AVAILABILITY AND IMPLEMENTATION https://github.com/divya031090/phyLoSTM.
Collapse
Affiliation(s)
- Divya Sharma
- Princess Margaret Cancer Center, University Health Network, Toronto, Ontario, Canada
| | - Wei Xu
- Princess Margaret Cancer Center, University Health Network, Toronto, Ontario, Canada.,Division of Biostatistics, Dalla Lana School of Public Health, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
34
|
García-Jiménez B, Muñoz J, Cabello S, Medina J, Wilkinson MD. Predicting microbiomes through a deep latent space. Bioinformatics 2021; 37:1444-1451. [PMID: 33289510 PMCID: PMC8208755 DOI: 10.1093/bioinformatics/btaa971] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2020] [Revised: 10/21/2020] [Accepted: 11/06/2020] [Indexed: 12/28/2022] Open
Abstract
Motivation Microbial communities influence their environment by modifying the availability of compounds, such as nutrients or chemical elicitors. Knowing the microbial composition of a site is therefore relevant to improve productivity or health. However, sequencing facilities are not always available, or may be prohibitively expensive in some cases. Thus, it would be desirable to computationally predict the microbial composition from more accessible, easily-measured features. Results Integrating deep learning techniques with microbiome data, we propose an artificial neural network architecture based on heterogeneous autoencoders to condense the long vector of microbial abundance values into a deep latent space representation. Then, we design a model to predict the deep latent space and, consequently, to predict the complete microbial composition using environmental features as input. The performance of our system is examined using the rhizosphere microbiome of Maize. We reconstruct the microbial composition (717 taxa) from the deep latent space (10 values) with high fidelity (>0.9 Pearson correlation). We then successfully predict microbial composition from environmental variables, such as plant age, temperature or precipitation (0.73 Pearson correlation, 0.42 Bray–Curtis). We extend this to predict microbiome composition under hypothetical scenarios, such as future climate change conditions. Finally, via transfer learning, we predict microbial composition in a distinct scenario with only 100 sequences, and distinct environmental features. We propose that our deep latent space may assist microbiome-engineering strategies when technical or financial resources are limited, through predicting current or future microbiome compositions. Availability and implementation Software, results and data are available at https://github.com/jorgemf/DeepLatentMicrobiome Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Beatriz García-Jiménez
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Jorge Muñoz
- Serendeepia Research, 28905 Getafe (Madrid), Spain
| | - Sara Cabello
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Joaquín Medina
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain
| | - Mark D Wilkinson
- Centro de Biotecnología y Genómica de Plantas (CBGP, UPM-INIA), Universidad Politécnica de Madrid (UPM) - Instituto Nacional de Investigación y Tecnología Agraria y Alimentaria (INIA), Campus de Montegancedo-UPM, 28223-Pozuelo de Alarcón, Madrid, Spain.,Departamento de Biotecnología-Biología Vegetal, Escuela Técnica Superior de Ingeniería Agronómica, Alimentaria y de Biosistemas, Universidad Politécnica de Madrid (UPM), Madrid, Spain
| |
Collapse
|
35
|
Chen X, Liu L, Zhang W, Yang J, Wong KC. Human host status inference from temporal microbiome changes via recurrent neural networks. Brief Bioinform 2021; 22:6307015. [PMID: 34151933 DOI: 10.1093/bib/bbab223] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2021] [Revised: 04/21/2021] [Accepted: 04/21/2021] [Indexed: 01/04/2023] Open
Abstract
With the rapid increase in sequencing data, human host status inference (e.g. healthy or sick) from microbiome data has become an important issue. Existing studies are mostly based on single-point microbiome composition, while it is rare that the host status is predicted from longitudinal microbiome data. However, single-point-based methods cannot capture the dynamic patterns between the temporal changes and host status. Therefore, it remains challenging to build good predictive models as well as scaling to different microbiome contexts. On the other hand, existing methods are mainly targeted for disease prediction and seldom investigate other host statuses. To fill the gap, we propose a comprehensive deep learning-based framework that utilizes longitudinal microbiome data as input to infer the human host status. Specifically, the framework is composed of specific data preparation strategies and a recurrent neural network tailored for longitudinal microbiome data. In experiments, we evaluated the proposed method on both semi-synthetic and real datasets based on different sequencing technologies and metagenomic contexts. The results indicate that our method achieves robust performance compared to other baseline and state-of-the-art classifiers and provides a significant reduction in prediction time.
Collapse
Affiliation(s)
- Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Lingjing Liu
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Weitong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| | - Jianyi Yang
- School of Mathematical Sciences, Nankai University, Kowloon, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong SAR
| |
Collapse
|
36
|
DiMucci D, Kon M, Segrè D. BowSaw: Inferring Higher-Order Trait Interactions Associated With Complex Biological Phenotypes. Front Mol Biosci 2021; 8:663532. [PMID: 34222331 PMCID: PMC8245782 DOI: 10.3389/fmolb.2021.663532] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2021] [Accepted: 05/24/2021] [Indexed: 11/15/2022] Open
Abstract
Machine learning is helping the interpretation of biological complexity by enabling the inference and classification of cellular, organismal and ecological phenotypes based on large datasets, e.g., from genomic, transcriptomic and metagenomic analyses. A number of available algorithms can help search these datasets to uncover patterns associated with specific traits, including disease-related attributes. While, in many instances, treating an algorithm as a black box is sufficient, it is interesting to pursue an enhanced understanding of how system variables end up contributing to a specific output, as an avenue toward new mechanistic insight. Here we address this challenge through a suite of algorithms, named BowSaw, which takes advantage of the structure of a trained random forest algorithm to identify combinations of variables (“rules”) frequently used for classification. We first apply BowSaw to a simulated dataset and show that the algorithm can accurately recover the sets of variables used to generate the phenotypes through complex Boolean rules, even under challenging noise levels. We next apply our method to data from the integrative Human Microbiome Project and find previously unreported high-order combinations of microbial taxa putatively associated with Crohn’s disease. By leveraging the structure of trees within a random forest, BowSaw provides a new way of using decision trees to generate testable biological hypotheses.
Collapse
Affiliation(s)
- Demetrius DiMucci
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Biological Design Center, Boston University, Boston, MA, United States
| | - Mark Kon
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Department of Mathematics and Statistics, Boston University, Boston, MA, United States
| | - Daniel Segrè
- Bioinformatics Graduate Program, Boston University, Boston, MA, United States.,Biological Design Center, Boston University, Boston, MA, United States.,Department of Biology, Boston University, Boston, MA, United States.,Department of Biomedical Engineering, Boston University, Boston, MA, United States.,Department of Physics, Boston University, Boston, MA, United States
| |
Collapse
|
37
|
Khan K, Ramsahai E. Maintaining proper health records improves machine learning predictions for novel 2019-nCoV. BMC Med Inform Decis Mak 2021; 21:172. [PMID: 34044839 PMCID: PMC8159067 DOI: 10.1186/s12911-021-01537-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2020] [Accepted: 05/23/2021] [Indexed: 11/19/2022] Open
Abstract
Background An ongoing outbreak of a novel coronavirus (2019-nCoV) pneumonia continues to affect the whole world including major countries such as China, USA, Italy, France and the United Kingdom. We present outcome (‘recovered’, ‘isolated’ or ‘death’) risk estimates of 2019-nCoV over ‘early’ datasets. A major consideration is the likelihood of death for patients with 2019-nCoV. Method Accounting for the impact of the variations in the reporting rate of 2019-nCoV, we used machine learning techniques (AdaBoost, bagging, extra-trees, decision trees and k-nearest neighbour classifiers) on two 2019-nCoV datasets obtained from Kaggle on March 30, 2020. We used ‘country’, ‘age’ and ‘gender’ as features to predict outcome for both datasets. We included the patient’s ‘disease’ history (only present in the second dataset) to predict the outcome for the second dataset. Results The use of a patient’s ‘disease’ history improves the prediction of ‘death’ by more than sevenfold. The models ignoring a patent’s ‘disease’ history performed poorly in test predictions. Conclusion Our findings indicate the potential of using a patient’s ‘disease’ history as part of the feature set in machine learning techniques to improve 2019-nCoV predictions. This development can have a positive effect on predictive patient treatment and can result in easing currently overburdened healthcare systems worldwide, especially with the increasing prevalence of second and third wave re-infections in some countries.
Collapse
Affiliation(s)
- Koffka Khan
- Department of Computing and Information Technology, The University of the West Indies, St. Augustine, Trinidad and Tobago.
| | - Emilie Ramsahai
- UWI School of Business & Applied Studies Ltd (UWI-ROYTEC), 136-138 Henry Street, 24105, Port of Spain, Trinidad and Tobago
| |
Collapse
|
38
|
Wu S, Chen Y, Li Z, Li J, Zhao F, Su X. Towards multi-label classification: Next step of machine learning for microbiome research. Comput Struct Biotechnol J 2021; 19:2742-2749. [PMID: 34093989 PMCID: PMC8131981 DOI: 10.1016/j.csbj.2021.04.054] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2021] [Revised: 04/21/2021] [Accepted: 04/22/2021] [Indexed: 11/22/2022] Open
Abstract
Machine learning (ML) has been widely used in microbiome research for biomarker selection and disease prediction. By training microbial profiles of samples from patients and healthy controls, ML classifiers constructs data models by community features that highly correlated with the target diseases, so as to determine the status of new samples. To clearly understand the host-microbe interaction of specific diseases, previous studies always focused on well-designed cohorts, in which each sample was exactly labeled by a single status type. However, in fact an individual may be associated with multiple diseases simultaneously, which introduce additional variations on microbial patterns that interferes the status detection. More importantly, comorbidities or complications can be missed by regular ML models, limiting the practical application of microbiome techniques. In this review, we summarize the typical ML approaches of single-label classification for microbiome research, and demonstrate their limitations in multi-label disease detection using a real dataset. Then we prospect a further step of ML towards multi-label classification that potentially solves the aforementioned problem, including a series of promising strategies and key technical issues for applying multi-label classification in microbiome-based studies.
Collapse
Affiliation(s)
- Shunyao Wu
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071, China
| | - Yuzhu Chen
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071, China
| | - Zhiruo Li
- School of Mathematics and Statistics, Qingdao University, Qingdao, Shandong 266071, China
| | - Jian Li
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071, China
| | - Fengyang Zhao
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071, China
| | - Xiaoquan Su
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071, China
| |
Collapse
|
39
|
Zhang W, Chen X, Wong KC. Noninvasive early diagnosis of intestinal diseases based on artificial intelligence in genomics and microbiome. J Gastroenterol Hepatol 2021; 36:823-831. [PMID: 33880763 DOI: 10.1111/jgh.15500] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 01/10/2021] [Revised: 03/15/2021] [Accepted: 03/17/2021] [Indexed: 12/15/2022]
Abstract
The maturing development in artificial intelligence (AI) and genomics has propelled the advances in intestinal diseases including intestinal cancer, inflammatory bowel disease (IBD), and irritable bowel syndrome (IBS). On the other hand, colorectal cancer is the second most deadly and the third most common type of cancer in the world according to GLOBOCAN 2020 data. The mechanisms behind IBD and IBS are still speculative. The conventional methods to identify colorectal cancer, IBD, and IBS are based on endoscopy or colonoscopy to identify lesions. However, it is invasive, demanding, and time-consuming for early-stage intestinal diseases. To address those problems, new strategies based on blood and/or human microbiome in gut, colon, or even feces were developed; those methods took advantage of high-throughput sequencing and machine learning approaches. In this review, we summarize the recent research and methods to diagnose intestinal diseases with machine learning technologies based on cell-free DNA and microbiome data generated by amplicon sequencing or whole-genome sequencing. Those methods play an important role in not only intestinal disease diagnosis but also therapy development in the near future.
Collapse
Affiliation(s)
- Weitong Zhang
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Xingjian Chen
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| | - Ka-Chun Wong
- Department of Computer Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR.,Hong Kong Institute for Data Science, City University of Hong Kong, Kowloon Tong, Hong Kong SAR
| |
Collapse
|
40
|
Wei ZG, Zhang XD, Cao M, Liu F, Qian Y, Zhang SW. Comparison of Methods for Picking the Operational Taxonomic Units From Amplicon Sequences. Front Microbiol 2021; 12:644012. [PMID: 33841367 PMCID: PMC8024490 DOI: 10.3389/fmicb.2021.644012] [Citation(s) in RCA: 14] [Impact Index Per Article: 4.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Accepted: 02/17/2021] [Indexed: 12/31/2022] Open
Abstract
With the advent of next-generation sequencing technology, it has become convenient and cost efficient to thoroughly characterize the microbial diversity and taxonomic composition in various environmental samples. Millions of sequencing data can be generated, and how to utilize this enormous sequence resource has become a critical concern for microbial ecologists. One particular challenge is the OTUs (operational taxonomic units) picking in 16S rRNA sequence analysis. Lucky, this challenge can be directly addressed by sequence clustering that attempts to group similar sequences. Therefore, numerous clustering methods have been proposed to help to cluster 16S rRNA sequences into OTUs. However, each method has its clustering mechanism, and different methods produce diverse outputs. Even a slight parameter change for the same method can also generate distinct results, and how to choose an appropriate method has become a challenge for inexperienced users. A lot of time and resources can be wasted in selecting clustering tools and analyzing the clustering results. In this study, we introduced the recent advance of clustering methods for OTUs picking, which mainly focus on three aspects: (i) the principles of existing clustering algorithms, (ii) benchmark dataset construction for OTU picking and evaluation metrics, and (iii) the performance of different methods with various distance thresholds on benchmark datasets. This paper aims to assist biological researchers to select the reasonable clustering methods for analyzing their collected sequences and help algorithm developers to design more efficient sequences clustering methods.
Collapse
Affiliation(s)
- Ze-Gang Wei
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi’an, China
| | - Xiao-Dan Zhang
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Ming Cao
- Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China
- School of Mathematics and Statistics, Shaanxi Xueqian Normal University, Xi’an, China
| | - Fei Liu
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Yu Qian
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi’an, China
| |
Collapse
|
41
|
Moreno-Indias I, Lahti L, Nedyalkova M, Elbere I, Roshchupkin G, Adilovic M, Aydemir O, Bakir-Gungor B, Santa Pau ECD, D’Elia D, Desai MS, Falquet L, Gundogdu A, Hron K, Klammsteiner T, Lopes MB, Marcos-Zambrano LJ, Marques C, Mason M, May P, Pašić L, Pio G, Pongor S, Promponas VJ, Przymus P, Saez-Rodriguez J, Sampri A, Shigdel R, Stres B, Suharoschi R, Truu J, Truică CO, Vilne B, Vlachakis D, Yilmaz E, Zeller G, Zomer AL, Gómez-Cabrero D, Claesson MJ. Statistical and Machine Learning Techniques in Human Microbiome Studies: Contemporary Challenges and Solutions. Front Microbiol 2021; 12:635781. [PMID: 33692771 PMCID: PMC7937616 DOI: 10.3389/fmicb.2021.635781] [Citation(s) in RCA: 39] [Impact Index Per Article: 13.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Accepted: 01/28/2021] [Indexed: 12/23/2022] Open
Abstract
The human microbiome has emerged as a central research topic in human biology and biomedicine. Current microbiome studies generate high-throughput omics data across different body sites, populations, and life stages. Many of the challenges in microbiome research are similar to other high-throughput studies, the quantitative analyses need to address the heterogeneity of data, specific statistical properties, and the remarkable variation in microbiome composition across individuals and body sites. This has led to a broad spectrum of statistical and machine learning challenges that range from study design, data processing, and standardization to analysis, modeling, cross-study comparison, prediction, data science ecosystems, and reproducible reporting. Nevertheless, although many statistics and machine learning approaches and tools have been developed, new techniques are needed to deal with emerging applications and the vast heterogeneity of microbiome data. We review and discuss emerging applications of statistical and machine learning techniques in human microbiome studies and introduce the COST Action CA18131 "ML4Microbiome" that brings together microbiome researchers and machine learning experts to address current challenges such as standardization of analysis pipelines for reproducibility of data analysis results, benchmarking, improvement, or development of existing and new tools and ontologies.
Collapse
Affiliation(s)
- Isabel Moreno-Indias
- Instituto de Investigación Biomédica de Málaga (IBIMA), Unidad de Gestión Clìnica de Endocrinologìa y Nutrición, Hospital Clìnico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomeìdica en Red de Fisiopatologtìa de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Miroslava Nedyalkova
- Human Genetics and Disease Mechanisms, Latvian Biomedical Research and Study Centre, Riga, Latvia
| | - Ilze Elbere
- Latvian Biomedical Research and Study Centre, Riga, Latvia
| | | | - Muhamed Adilovic
- Department of Genetics and Bioengineering, International University of Sarajevo, Sarajevo, Bosnia and Herzegovina
| | - Onder Aydemir
- Department of Electrical and Electronics Engineering, Karadeniz Technical University, Trabzon, Turkey
| | - Burcu Bakir-Gungor
- Department of Computer Engineering, Abdullah Gul University, Kayseri, Turkey
| | | | - Domenica D’Elia
- Department for Biomedical Sciences, Institute for Biomedical Technologies, National Research Council, Bari, Italy
| | - Mahesh S. Desai
- Department of Infection and Immunity, Luxembourg Institute of Health, Esch-sur-Alzette, Luxembourg
- Odense Research Center for Anaphylaxis, Department of Dermatology and Allergy Center, Odense University Hospital, University of Southern Denmark, Odense, Denmark
| | - Laurent Falquet
- Department of Biology, University of Fribourg, Fribourg, Switzerland
- Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Aycan Gundogdu
- Department of Microbiology and Clinical Microbiology, Faculty of Medicine, Erciyes University, Kayseri, Turkey
- Metagenomics Laboratory, Genome and Stem Cell Center (GenKök), Erciyes University, Kayseri, Turkey
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | - Cláudia Marques
- CINTESIS, NOVA Medical School, NMS, Universidade Nova de Lisboa, Lisbon, Portugal
| | - Michael Mason
- Computational Oncology, Sage Bionetworks, Seattle, WA, United States
| | - Patrick May
- Bioinformatics Core, Luxembourg Centre for Systems Biomedicine, University of Luxembourg, Esch-sur-Alzette, Luxembourg
| | - Lejla Pašić
- Sarajevo Medical School, University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Gianvito Pio
- Department of Computer Science, University of Bari Aldo Moro, Bari, Italy
| | - Sándor Pongor
- Faculty of Information Tehnology and Bionics, Pázmány University, Budapest, Hungary
| | - Vasilis J. Promponas
- Bioinformatics Research Laboratory, Department of Biological Sciences, University of Cyprus, Nicosia, Cyprus
| | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruñ, Poland
| | - Julio Saez-Rodriguez
- Institute of Computational Biomedicine, Heidelberg University, Faculty of Medicine and Heidelberg University Hospital, Heidelberg, Germany
| | - Alexia Sampri
- Division of Informatics, Imaging and Data Sciences, School of Health Sciences, University of Manchester, Manchester, United Kingdom
| | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Jozef Stefan Institute, Ljubljana, Slovenia
- Biotechnical Faculty, University of Ljubljana, Ljubljana, Slovenia
- Faculty of Civil and Geodetic Engineering, University of Ljubljana, Ljubljana, Slovenia
| | - Ramona Suharoschi
- Molecular Nutrition and Proteomics Lab, Faculty of the Food Science and Technology, Institute of Life Sciences, University of Agricultural Sciences and Veterinary Medicine of Cluj-Napoca, Cluj-Napoca, Romania
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Ciprian-Octavian Truică
- Department of Computer Science and Engineering, Faculty of Automatic Control and Computers, University Politehnica of Bucharest, Bucharest, Romania
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Dimitrios Vlachakis
- Laboratory of Genetics, Department of Biotechnology, School of Applied Biology and Biotechnology, Agricultural University of Athens, Athens, Greece
| | - Ercument Yilmaz
- Department of Computer Technologies, Karadeniz Technical University, Trabzon, Turkey
| | - Georg Zeller
- European Molecular Biology Laboratory, Structural and Computational Biology Unit, Heidelberg, Germany
| | - Aldert L. Zomer
- Department of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, Utrecht, Netherlands
| | - David Gómez-Cabrero
- Navarrabiomed, Complejo Hospitalario de Navarra (CHN), IdiSNA, Universidad Pública de Navarra (UPNA), Pamplona, Spain
| | - Marcus J. Claesson
- School of Microbiology and APC Microbiome Ireland, University College Cork, Cork, Ireland
| |
Collapse
|
42
|
Marcos-Zambrano LJ, Karaduzovic-Hadziabdic K, Loncar Turukalo T, Przymus P, Trajkovik V, Aasmets O, Berland M, Gruca A, Hasic J, Hron K, Klammsteiner T, Kolev M, Lahti L, Lopes MB, Moreno V, Naskinova I, Org E, Paciência I, Papoutsoglou G, Shigdel R, Stres B, Vilne B, Yousef M, Zdravevski E, Tsamardinos I, Carrillo de Santa Pau E, Claesson MJ, Moreno-Indias I, Truu J. Applications of Machine Learning in Human Microbiome Studies: A Review on Feature Selection, Biomarker Identification, Disease Prediction and Treatment. Front Microbiol 2021; 12:634511. [PMID: 33737920 PMCID: PMC7962872 DOI: 10.3389/fmicb.2021.634511] [Citation(s) in RCA: 113] [Impact Index Per Article: 37.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2020] [Accepted: 02/01/2021] [Indexed: 12/19/2022] Open
Abstract
The number of microbiome-related studies has notably increased the availability of data on human microbiome composition and function. These studies provide the essential material to deeply explore host-microbiome associations and their relation to the development and progression of various complex diseases. Improved data-analytical tools are needed to exploit all information from these biological datasets, taking into account the peculiarities of microbiome data, i.e., compositional, heterogeneous and sparse nature of these datasets. The possibility of predicting host-phenotypes based on taxonomy-informed feature selection to establish an association between microbiome and predict disease states is beneficial for personalized medicine. In this regard, machine learning (ML) provides new insights into the development of models that can be used to predict outputs, such as classification and prediction in microbiology, infer host phenotypes to predict diseases and use microbial communities to stratify patients by their characterization of state-specific microbial signatures. Here we review the state-of-the-art ML methods and respective software applied in human microbiome studies, performed as part of the COST Action ML4Microbiome activities. This scoping review focuses on the application of ML in microbiome studies related to association and clinical use for diagnostics, prognostics, and therapeutics. Although the data presented here is more related to the bacterial community, many algorithms could be applied in general, regardless of the feature type. This literature and software review covering this broad topic is aligned with the scoping review methodology. The manual identification of data sources has been complemented with: (1) automated publication search through digital libraries of the three major publishers using natural language processing (NLP) Toolkit, and (2) an automated identification of relevant software repositories on GitHub and ranking of the related research papers relying on learning to rank approach.
Collapse
Affiliation(s)
- Laura Judith Marcos-Zambrano
- Computational Biology Group, Precision Nutrition and Cancer Research Program, IMDEA Food Institute, Madrid, Spain
| | | | | | - Piotr Przymus
- Faculty of Mathematics and Computer Science, Nicolaus Copernicus University, Toruń, Poland
| | - Vladimir Trajkovik
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | - Oliver Aasmets
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
- Department of Biotechnology, Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| | - Magali Berland
- Université Paris-Saclay, INRAE, MGP, Jouy-en-Josas, France
| | - Aleksandra Gruca
- Department of Computer Networks and Systems, Silesian University of Technology, Gliwice, Poland
| | - Jasminka Hasic
- University Sarajevo School of Science and Technology, Sarajevo, Bosnia and Herzegovina
| | - Karel Hron
- Department of Mathematical Analysis and Applications of Mathematics, Palacký University, Olomouc, Czechia
| | | | - Mikhail Kolev
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Leo Lahti
- Department of Computing, University of Turku, Turku, Finland
| | - Marta B. Lopes
- NOVA Laboratory for Computer Science and Informatics (NOVA LINCS), FCT, UNL, Caparica, Portugal
- Centro de Matemática e Aplicações (CMA), FCT, UNL, Caparica, Portugal
| | - Victor Moreno
- Oncology Data Analytics Program, Catalan Institute of Oncology (ICO)Barcelona, Spain
- Colorectal Cancer Group, Institut de Recerca Biomedica de Bellvitge (IDIBELL), Barcelona, Spain
- Consortium for Biomedical Research in Epidemiology and Public Health (CIBERESP), Barcelona, Spain
- Department of Clinical Sciences, Faculty of Medicine, University of Barcelona, Barcelona, Spain
| | - Irina Naskinova
- South West University “Neofit Rilski”, Blagoevgrad, Bulgaria
| | - Elin Org
- Institute of Genomics, Estonian Genome Centre, University of Tartu, Tartu, Estonia
| | - Inês Paciência
- EPIUnit – Instituto de Saúde Pública da Universidade do Porto, Porto, Portugal
| | | | - Rajesh Shigdel
- Department of Clinical Science, University of Bergen, Bergen, Norway
| | - Blaz Stres
- Group for Microbiology and Microbial Biotechnology, Department of Animal Science, University of Ljubljana, Ljubljana, Slovenia
| | - Baiba Vilne
- Bioinformatics Research Unit, Riga Stradins University, Riga, Latvia
| | - Malik Yousef
- Department of Information Systems, Zefat Academic College, Zefat, Israel
- Galilee Digital Health Research Center (GDH), Zefat Academic College, Zefat, Israel
| | - Eftim Zdravevski
- Faculty of Computer Science and Engineering, Ss. Cyril and Methodius University, Skopje, North Macedonia
| | | | | | - Marcus J. Claesson
- School of Microbiology & APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Isabel Moreno-Indias
- Unidad de Gestión Clínica de Endocrinología y Nutrición, Instituto de Investigación Biomédica de Málaga (IBIMA), Hospital Clínico Universitario Virgen de la Victoria, Universidad de Málaga, Málaga, Spain
- Centro de Investigación Biomédica en Red de Fisiopatología de la Obesidad y la Nutrición (CIBEROBN), Instituto de Salud Carlos III, Madrid, Spain
| | - Jaak Truu
- Institute of Molecular and Cell Biology, University of Tartu, Tartu, Estonia
| |
Collapse
|
43
|
Li W, Liu H, Cheng F, Li Y, Li S, Yan J. Artificial intelligence applications for oncological positron emission tomography imaging. Eur J Radiol 2020; 134:109448. [PMID: 33307463 DOI: 10.1016/j.ejrad.2020.109448] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/20/2020] [Revised: 10/07/2020] [Accepted: 11/26/2020] [Indexed: 12/16/2022]
Abstract
Positron emission tomography (PET), a functional and dynamic molecular imaging technique, is generally used to reveal tumors' biological behavior. Radiomics allows a high-throughput extraction of multiple features from images with artificial intelligence (AI) approaches and develops rapidly worldwide. Quantitative and objective features of medical images have been explored to recognize reliable biomarkers, with the development of PET radiomics. This paper will review the current clinical exploration of PET-based classical machine learning and deep learning methods, including disease diagnosis, the prediction of histological subtype, gene mutation status, tumor metastasis, tumor relapse, therapeutic side effects, therapeutic intervention and evaluation of prognosis. The applications of AI in oncology will be mainly discussed. The image-guided biopsy or surgery assisted by PET-based AI will be introduced as well. This paper aims to present the applications and methods of AI for PET imaging, which may offer important details for further clinical studies. Relevant precautions are put forward and future research directions are suggested.
Collapse
Affiliation(s)
- Wanting Li
- Shanxi Medical University, Taiyuan 030009, PR China; Department of Nuclear Medicine, First Hospital of Shanxi Medical University, Taiyuan 030001, PR China; Collaborative Innovation Center for Molecular Imaging, Taiyuan 030001, PR China
| | - Haiyan Liu
- Department of Nuclear Medicine, First Hospital of Shanxi Medical University, Taiyuan 030001, PR China; Collaborative Innovation Center for Molecular Imaging, Taiyuan 030001, PR China; Cellular Physiology Key Laboratory of Ministry of Education, Translational Medicine Research Center, Shanxi Medical University, Taiyuan 030001, PR China
| | - Feng Cheng
- Shanxi Medical University, Taiyuan 030009, PR China
| | - Yanhua Li
- Shanxi Medical University, Taiyuan 030009, PR China
| | - Sijin Li
- Shanxi Medical University, Taiyuan 030009, PR China; Department of Nuclear Medicine, First Hospital of Shanxi Medical University, Taiyuan 030001, PR China; Collaborative Innovation Center for Molecular Imaging, Taiyuan 030001, PR China.
| | - Jiangwei Yan
- Shanxi Medical University, Taiyuan 030009, PR China.
| |
Collapse
|
44
|
Iadanza E, Fabbri R, Bašić-čičak D, Amedei A, Telalovic JH. Gut microbiota and artificial intelligence approaches: A scoping review. Health Technol 2020; 10:1343-58. [DOI: 10.1007/s12553-020-00486-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
AbstractThis article aims to provide a thorough overview of the use of Artificial Intelligence (AI) techniques in studying the gut microbiota and its role in the diagnosis and treatment of some important diseases. The association between microbiota and diseases, together with its clinical relevance, is still difficult to interpret. The advances in AI techniques, such as Machine Learning (ML) and Deep Learning (DL), can help clinicians in processing and interpreting these massive data sets. Two research groups have been involved in this Scoping Review, working in two different areas of Europe: Florence and Sarajevo. The papers included in the review describe the use of ML or DL methods applied to the study of human gut microbiota. In total, 1109 papers were considered in this study. After elimination, a final set of 16 articles was considered in the scoping review. Different AI techniques were applied in the reviewed papers. Some papers applied ML, while others applied DL techniques. 11 papers evaluated just different ML algorithms (ranging from one to eight algorithms applied to one dataset). The remaining five papers examined both ML and DL algorithms. The most applied ML algorithm was Random Forest and it also exhibited the best performances.
Collapse
|
45
|
Cammarota G, Ianiro G, Ahern A, Carbone C, Temko A, Claesson MJ, Gasbarrini A, Tortora G. Gut microbiome, big data and machine learning to promote precision medicine for cancer. Nat Rev Gastroenterol Hepatol 2020; 17:635-648. [PMID: 32647386 DOI: 10.1038/s41575-020-0327-3] [Citation(s) in RCA: 135] [Impact Index Per Article: 33.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 06/02/2020] [Indexed: 12/13/2022]
Abstract
The gut microbiome has been implicated in cancer in several ways, as specific microbial signatures are known to promote cancer development and influence safety, tolerability and efficacy of therapies. The 'omics' technologies used for microbiome analysis continuously evolve and, although much of the research is still at an early stage, large-scale datasets of ever increasing size and complexity are being produced. However, there are varying levels of difficulty in realizing the full potential of these new tools, which limit our ability to critically analyse much of the available data. In this Perspective, we provide a brief overview on the role of gut microbiome in cancer and focus on the need, role and limitations of a machine learning-driven approach to analyse large amounts of complex health-care information in the era of big data. We also discuss the potential application of microbiome-based big data aimed at promoting precision medicine in cancer.
Collapse
Affiliation(s)
- Giovanni Cammarota
- Gastroenterology Department, Fondazione Policlinico Universitario Agostino Gemelli-IRCCS, Università Cattolica del Sacro Cuore, Rome, Italy.
| | - Gianluca Ianiro
- Gastroenterology Department, Fondazione Policlinico Universitario Agostino Gemelli-IRCCS, Università Cattolica del Sacro Cuore, Rome, Italy
| | - Anna Ahern
- School of Microbiology and APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Carmine Carbone
- Oncology Department, Fondazione Policlinico Universitario Agostino Gemelli-IRCCS, Università Cattolica del Sacro Cuore, Rome, Italy
| | - Andriy Temko
- School of Engineering, University College Cork, Cork, Ireland.,Qualcomm ML R&D, Cork, Ireland
| | - Marcus J Claesson
- School of Microbiology and APC Microbiome Ireland, University College Cork, Cork, Ireland
| | - Antonio Gasbarrini
- Gastroenterology Department, Fondazione Policlinico Universitario Agostino Gemelli-IRCCS, Università Cattolica del Sacro Cuore, Rome, Italy
| | - Giampaolo Tortora
- Oncology Department, Fondazione Policlinico Universitario Agostino Gemelli-IRCCS, Università Cattolica del Sacro Cuore, Rome, Italy
| |
Collapse
|
46
|
Su X, Jing G, Zhang Y, Wu S. Method development for cross-study microbiome data mining: Challenges and opportunities. Comput Struct Biotechnol J 2020; 18:2075-2080. [PMID: 32802279 PMCID: PMC7419250 DOI: 10.1016/j.csbj.2020.07.020] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Revised: 07/22/2020] [Accepted: 07/24/2020] [Indexed: 01/26/2023] Open
Abstract
During the past decade, tremendous amount of microbiome sequencing data has been generated to study on the dynamic associations between microbial profiles and environments. How to precisely and efficiently decipher large-scale of microbiome data and furtherly take advantages from it has become one of the most essential bottlenecks for microbiome research at present. In this mini-review, we focus on the three key steps of analyzing cross-study microbiome datasets, including microbiome profiling, data integrating and data mining. By introducing the current bioinformatics approaches and discussing their limitations, we prospect the opportunities in development of computational methods for the three steps, and propose the promising solutions to multi-omics data analysis for comprehensive understanding and rapid investigation of microbiome from different angles, which could potentially promote the data-driven research by providing a broader view of the "microbiome data space".
Collapse
Affiliation(s)
- Xiaoquan Su
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071 China
- Single-Cell Center, Qingdao Institute of BioEnergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao, Shandong 266101 China
| | - Gongchao Jing
- Single-Cell Center, Qingdao Institute of BioEnergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao, Shandong 266101 China
| | - Yufeng Zhang
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071 China
- Single-Cell Center, Qingdao Institute of BioEnergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao, Shandong 266101 China
| | - Shunyao Wu
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071 China
| |
Collapse
|
47
|
Seneviratne CJ, Balan P, Suriyanarayanan T, Lakshmanan M, Lee DY, Rho M, Jakubovics N, Brandt B, Crielaard W, Zaura E. Oral microbiome-systemic link studies: perspectives on current limitations and future artificial intelligence-based approaches. Crit Rev Microbiol 2020; 46:288-299. [PMID: 32434436 DOI: 10.1080/1040841x.2020.1766414] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022]
Abstract
In the past decade, there has been a tremendous increase in studies on the link between oral microbiome and systemic diseases. However, variations in study design and confounding variables across studies often lead to inconsistent observations. In this narrative review, we have discussed the potential influence of study design and confounding variables on the current sequencing-based oral microbiome-systemic disease link studies. The current limitations of oral microbiome-systemic link studies on type 2 diabetes mellitus, rheumatoid arthritis, pregnancy, atherosclerosis, and pancreatic cancer are discussed in this review, followed by our perspective on how artificial intelligence (AI), particularly machine learning and deep learning approaches, can be employed for predicting systemic disease and host metadata from the oral microbiome. The application of AI for predicting systemic disease as well as host metadata requires the establishment of a global database repository with microbiome sequences and annotated host metadata. However, this task requires collective efforts from researchers working in the field of oral microbiome to establish more comprehensive datasets with appropriate host metadata. Development of AI-based models by incorporating consistent host metadata will allow prediction of systemic diseases with higher accuracies, bringing considerable clinical benefits.
Collapse
Affiliation(s)
- Chaminda Jayampath Seneviratne
- Singapore Oral Microbiomics Initiative (SOMI), National Dental Research Institute Singapore, National Dental Centre Singapore, Duke NUS Medical School, Singapore, Singapore
| | - Preethi Balan
- Singapore Oral Microbiomics Initiative (SOMI), National Dental Research Institute Singapore, National Dental Centre Singapore, Duke NUS Medical School, Singapore, Singapore
| | - Tanujaa Suriyanarayanan
- Singapore Oral Microbiomics Initiative (SOMI), National Dental Research Institute Singapore, National Dental Centre Singapore, Duke NUS Medical School, Singapore, Singapore
| | - Meiyappan Lakshmanan
- Bioprocessing Technology Institute (BTI), ASTAR - Agency for Science, Technology and Research, Singapore, Singapore
| | - Dong-Yup Lee
- Bioprocessing Technology Institute (BTI), ASTAR - Agency for Science, Technology and Research, Singapore, Singapore.,School of Chemical Engineering, Sungkyunkwan University, Jongno-gu, Republic of Korea
| | - Mina Rho
- Departments of Computer Science and Engineering & Biomedical Informatics, Hanyang University, Seoul, Korea
| | - Nicholas Jakubovics
- Oral Biology, School of Dental Sciences, Newcastle University, Newcastle upon Tyne, UK
| | - Bernd Brandt
- Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam, University of Amsterdam and Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Wim Crielaard
- Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam, University of Amsterdam and Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| | - Egija Zaura
- Department of Preventive Dentistry, Academic Centre for Dentistry Amsterdam, University of Amsterdam and Vrije Universiteit Amsterdam, Amsterdam, The Netherlands
| |
Collapse
|
48
|
Reiman D, Metwally AA, Sun J, Dai Y. PopPhy-CNN: A Phylogenetic Tree Embedded Architecture for Convolutional Neural Networks to Predict Host Phenotype From Metagenomic Data. IEEE J Biomed Health Inform 2020; 24:2993-3001. [PMID: 32396115 DOI: 10.1109/jbhi.2020.2993761] [Citation(s) in RCA: 28] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Accurate prediction of the host phenotype from a metagenomic sample and identification of the associated microbial markers are important in understanding potential host-microbiome interactions related to disease initiation and progression. We introduce PopPhy-CNN, a novel convolutional neural network (CNN) learning framework that effectively exploits phylogenetic structure in microbial taxa for host phenotype prediction. Our approach takes an input format of a 2D matrix representing the phylogenetic tree populated with the relative abundance of microbial taxa in a metagenomic sample. This conversion empowers CNNs to explore the spatial relationship of the taxonomic annotations on the tree and their quantitative characteristics in metagenomic data. We show the competitiveness of our model compared to other available methods using nine metagenomic datasets of moderate size for binary classification. With synthetic and biological datasets, we show the superior and robust performance of our model for multi-class classification. Furthermore, we design a novel scheme for feature extraction from the learned CNN models and demonstrate improved performance when the extracted features. PopPhy-CNN is a practical deep learning framework for the prediction of host phenotype with the ability of facilitating the retrieval of predictive microbial taxa.
Collapse
|
49
|
Khan S, Kelly L. Multiclass Disease Classification from Microbial Whole-Community Metagenomes. Pac Symp Biocomput 2020; 25:55-66. [PMID: 31797586 PMCID: PMC7120658] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
The microbiome, the community of microorganisms living within an individual, is a promising avenue for developing non-invasive methods for disease screening and diagnosis. Here, we utilize 5643 aggregated, annotated whole-community metagenomes to implement the first multiclass microbiome disease classifier of this scale, able to discriminate between 18 different diseases and healthy. We compared three different machine learning models: random forests, deep neural nets, and a novel graph convolutional architecture which exploits the graph structure of phylogenetic trees as its input. We show that the graph convolutional model outperforms deep neural nets in terms of accuracy (achieving 75% average test-set accuracy), receiver-operator-characteristics (92.1% average area-under-ROC (AUC)), and precision-recall (50% average area-under-precision-recall (AUPR)). Additionally, the convolutional net's performance complements that of the random forest, showing a lower propensity for Type-I errors (false-positives) while the random forest makes less Type-II errors (false-negatives). Lastly, we are able to achieve over 90% average top-3 accuracy across all of our models. Together, these results indicate that there are predictive, disease-specific signatures across microbiomes that can be used for diagnostic purposes.
Collapse
Affiliation(s)
- Saad Khan
- Department of Systems & Computational Biology, Bronx, NY, USA
| | - Libusha Kelly
- Department of Systems & Computational Biology, Bronx, NY, USA
- Department of Microbiology & Immunology Albert Einstein College of Medicine, Bronx, NY, USA
| |
Collapse
|
50
|
van den Bogert B, Boekhorst J, Pirovano W, May A. On the Role of Bioinformatics and Data Science in Industrial Microbiome Applications. Front Genet 2019; 10:721. [PMID: 31447883 PMCID: PMC6696986 DOI: 10.3389/fgene.2019.00721] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Accepted: 07/09/2019] [Indexed: 01/08/2023] Open
Abstract
Advances in sequencing and computational biology have drastically increased our capability to explore the taxonomic and functional compositions of microbial communities that play crucial roles in industrial processes. Correspondingly, commercial interest has risen for applications where microbial communities make important contributions. These include food production, probiotics, cosmetics, and enzyme discovery. Other commercial applications include software that takes the user's gut microbiome data as one of its inputs and outputs evidence-based, automated, and personalized diet recommendations for balanced blood sugar levels. These applications pose several bioinformatic and data science challenges that range from requiring strain-level resolution in community profiles to the integration of large datasets for predictive machine learning purposes. In this perspective, we provide our insights on such challenges by touching upon several industrial areas, and briefly discuss advances and future directions of bioinformatics and data science in microbiome research.
Collapse
Affiliation(s)
| | | | | | - Ali May
- Research and Development Dept., BaseClear, Leiden, Netherlands
| |
Collapse
|