1
|
Gündüz HA, Mreches R, Moosbauer J, Robertson G, To XY, Franzosa EA, Huttenhower C, Rezaei M, McHardy AC, Bischl B, Münch PC, Binder M. Optimized model architectures for deep learning on genomic data. Commun Biol 2024; 7:516. [PMID: 38693292 PMCID: PMC11063068 DOI: 10.1038/s42003-024-06161-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Accepted: 04/08/2024] [Indexed: 05/03/2024] Open
Abstract
The success of deep learning in various applications depends on task-specific architecture design choices, including the types, hyperparameters, and number of layers. In computational biology, there is no consensus on the optimal architecture design, and decisions are often made using insights from more well-established fields such as computer vision. These may not consider the domain-specific characteristics of genome sequences, potentially limiting performance. Here, we present GenomeNet-Architect, a neural architecture design framework that automatically optimizes deep learning models for genome sequence data. It optimizes the overall layout of the architecture, with a search space specifically designed for genomics. Additionally, it optimizes hyperparameters of individual layers and the model training procedure. On a viral classification task, GenomeNet-Architect reduced the read-level misclassification rate by 19%, with 67% faster inference and 83% fewer parameters, and achieved similar contig-level accuracy with ~100 times fewer parameters compared to the best-performing deep learning baselines.
Collapse
Affiliation(s)
- Hüseyin Anil Gündüz
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - René Mreches
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Julia Moosbauer
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Gary Robertson
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Xiao-Yin To
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Eric A Franzosa
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
| | - Curtis Huttenhower
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA
| | - Mina Rezaei
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Alice C McHardy
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
- German Centre for Infection Research (DZIF), partner site Hannover Braunschweig, Braunschweig, Germany
| | - Bernd Bischl
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Philipp C Münch
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany.
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA.
- German Centre for Infection Research (DZIF), partner site Hannover Braunschweig, Braunschweig, Germany.
| | - Martin Binder
- Department of Statistics, LMU Munich, Munich, Germany.
- Munich Center for Machine Learning, Munich, Germany.
| |
Collapse
|
2
|
Roy G, Prifti E, Belda E, Zucker JD. Deep learning methods in metagenomics: a review. Microb Genom 2024; 10. [PMID: 38630611 DOI: 10.1099/mgen.0.001231] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/19/2024] Open
Abstract
The ever-decreasing cost of sequencing and the growing potential applications of metagenomics have led to an unprecedented surge in data generation. One of the most prevalent applications of metagenomics is the study of microbial environments, such as the human gut. The gut microbiome plays a crucial role in human health, providing vital information for patient diagnosis and prognosis. However, analysing metagenomic data remains challenging due to several factors, including reference catalogues, sparsity and compositionality. Deep learning (DL) enables novel and promising approaches that complement state-of-the-art microbiome pipelines. DL-based methods can address almost all aspects of microbiome analysis, including novel pathogen detection, sequence classification, patient stratification and disease prediction. Beyond generating predictive models, a key aspect of these methods is also their interpretability. This article reviews DL approaches in metagenomics, including convolutional networks, autoencoders and attention-based models. These methods aggregate contextualized data and pave the way for improved patient care and a better understanding of the microbiome's key role in our health.
Collapse
Affiliation(s)
- Gaspar Roy
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
| | - Edi Prifti
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l'hopital, 75013 Paris, France
| | - Eugeni Belda
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l'hopital, 75013 Paris, France
| | - Jean-Daniel Zucker
- IRD, Sorbonne University, UMMISCO, 32 avenue Henry Varagnat, Bondy Cedex, France
- Sorbonne University, INSERM, Nutriomics, 91 bvd de l'hopital, 75013 Paris, France
| |
Collapse
|
3
|
Baddal B, Taner F, Uzun Ozsahin D. Harnessing of Artificial Intelligence for the Diagnosis and Prevention of Hospital-Acquired Infections: A Systematic Review. Diagnostics (Basel) 2024; 14:484. [PMID: 38472956 DOI: 10.3390/diagnostics14050484] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2023] [Revised: 01/23/2024] [Accepted: 02/19/2024] [Indexed: 03/14/2024] Open
Abstract
Healthcare-associated infections (HAIs) are the most common adverse events in healthcare and constitute a major global public health concern. Surveillance represents the foundation for the effective prevention and control of HAIs, yet conventional surveillance is costly and labor intensive. Artificial intelligence (AI) and machine learning (ML) have the potential to support the development of HAI surveillance algorithms for the understanding of HAI risk factors, the improvement of patient risk stratification as well as the prediction and timely detection and prevention of infections. AI-supported systems have so far been explored for clinical laboratory testing and imaging diagnosis, antimicrobial resistance profiling, antibiotic discovery and prediction-based clinical decision support tools in terms of HAIs. This review aims to provide a comprehensive summary of the current literature on AI applications in the field of HAIs and discuss the future potentials of this emerging technology in infection practice. Following the PRISMA guidelines, this study examined the articles in databases including PubMed and Scopus until November 2023, which were screened based on the inclusion and exclusion criteria, resulting in 162 included articles. By elucidating the advancements in the field, we aim to highlight the potential applications of AI in the field, report related issues and shortcomings and discuss the future directions.
Collapse
Affiliation(s)
- Buket Baddal
- Department of Medical Microbiology and Clinical Microbiology, Faculty of Medicine, Near East University, North Cyprus, Mersin 10, 99138 Nicosia, Turkey
- DESAM Research Institute, Near East University, North Cyprus, Mersin 10, 99138 Nicosia, Turkey
| | - Ferdiye Taner
- Department of Medical Microbiology and Clinical Microbiology, Faculty of Medicine, Near East University, North Cyprus, Mersin 10, 99138 Nicosia, Turkey
- DESAM Research Institute, Near East University, North Cyprus, Mersin 10, 99138 Nicosia, Turkey
| | - Dilber Uzun Ozsahin
- Department of Medical Diagnostic Imaging, College of Health Science, University of Sharjah, Sharjah 27272, United Arab Emirates
- Research Institute for Medical and Health Sciences, University of Sharjah, Sharjah 27272, United Arab Emirates
- Operational Research Centre in Healthcare, Near East University, North Cyprus, Mersin 10, 99138 Nicosia, Turkey
| |
Collapse
|
4
|
Wu Z, Guo Y, Hayakawa M, Yang W, Lu Y, Ma J, Li L, Li C, Liu Y, Niu J. Artificial intelligence-driven microbiome data analysis for estimation of postmortem interval and crime location. Front Microbiol 2024; 15:1334703. [PMID: 38314433 PMCID: PMC10834752 DOI: 10.3389/fmicb.2024.1334703] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2023] [Accepted: 01/08/2024] [Indexed: 02/06/2024] Open
Abstract
Microbial communities, demonstrating dynamic changes in cadavers and the surroundings, provide invaluable insights for forensic investigations. Conventional methodologies for microbiome sequencing data analysis face obstacles due to subjectivity and inefficiency. Artificial Intelligence (AI) presents an efficient and accurate tool, with the ability to autonomously process and analyze high-throughput data, and assimilate multi-omics data, encompassing metagenomics, transcriptomics, and proteomics. This facilitates accurate and efficient estimation of the postmortem interval (PMI), detection of crime location, and elucidation of microbial functionalities. This review presents an overview of microorganisms from cadavers and crime scenes, emphasizes the importance of microbiome, and summarizes the application of AI in high-throughput microbiome data processing in forensic microbiology.
Collapse
Affiliation(s)
- Ze Wu
- Department of Dermatology, General Hospital of Northern Theater Command, Shenyang, China
| | - Yaoxing Guo
- Department of Dermatology, The First Hospital of China Medical University, Shenyang, China
- Key Laboratory of Immunodermatology, Ministry of Education and NHC, Shenyang, China
- National Joint Engineering Research Center for Theranostics of Immunological Skin Diseases, Shenyang, China
| | - Miren Hayakawa
- Beijing Anzhen Hospital, Capital Medical University, Beijing, China
| | - Wei Yang
- Department of Dermatology, General Hospital of Northern Theater Command, Shenyang, China
| | - Yansong Lu
- Department of Dermatology, General Hospital of Northern Theater Command, Shenyang, China
| | - Jingyi Ma
- Department of Dermatology, General Hospital of Northern Theater Command, Shenyang, China
| | - Linghui Li
- Department of Dermatology, General Hospital of Northern Theater Command, Shenyang, China
| | - Chuntao Li
- Department of Dermatology, General Hospital of Northern Theater Command, Shenyang, China
| | - Yingchun Liu
- Department of Dermatology, General Hospital of Northern Theater Command, Shenyang, China
| | - Jun Niu
- Department of Dermatology, General Hospital of Northern Theater Command, Shenyang, China
| |
Collapse
|
5
|
Arias PM, Butler J, Randhawa GS, Soltysiak MPM, Hill KA, Kari L. Environment and taxonomy shape the genomic signature of prokaryotic extremophiles. Sci Rep 2023; 13:16105. [PMID: 37752120 PMCID: PMC10522608 DOI: 10.1038/s41598-023-42518-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/06/2023] [Accepted: 09/11/2023] [Indexed: 09/28/2023] Open
Abstract
This study provides comprehensive quantitative evidence suggesting that adaptations to extreme temperatures and pH imprint a discernible environmental component in the genomic signature of microbial extremophiles. Both supervised and unsupervised machine learning algorithms were used to analyze genomic signatures, each computed as the k-mer frequency vector of a 500 kbp DNA fragment arbitrarily selected to represent a genome. Computational experiments classified/clustered genomic signatures extracted from a curated dataset of [Formula: see text] extremophile (temperature, pH) bacteria and archaea genomes, at multiple scales of analysis, [Formula: see text]. The supervised learning resulted in high accuracies for taxonomic classifications at [Formula: see text], and medium to medium-high accuracies for environment category classifications of the same datasets at [Formula: see text]. For [Formula: see text], our findings were largely consistent with amino acid compositional biases and codon usage patterns in coding regions, previously attributed to extreme environment adaptations. The unsupervised learning of unlabelled sequences identified several exemplars of hyperthermophilic organisms with large similarities in their genomic signatures, in spite of belonging to different domains in the Tree of Life.
Collapse
Affiliation(s)
- Pablo Millán Arias
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada.
| | - Joseph Butler
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Gurjit S Randhawa
- School of Mathematical and Computational Sciences, University of Prince Edward Island, Charlottetown, PE, Canada
| | | | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
6
|
Gündüz HA, Binder M, To XY, Mreches R, Bischl B, McHardy AC, Münch PC, Rezaei M. A self-supervised deep learning method for data-efficient training in genomics. Commun Biol 2023; 6:928. [PMID: 37696966 PMCID: PMC10495322 DOI: 10.1038/s42003-023-05310-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Accepted: 09/01/2023] [Indexed: 09/13/2023] Open
Abstract
Deep learning in bioinformatics is often limited to problems where extensive amounts of labeled data are available for supervised classification. By exploiting unlabeled data, self-supervised learning techniques can improve the performance of machine learning models in the presence of limited labeled data. Although many self-supervised learning methods have been suggested before, they have failed to exploit the unique characteristics of genomic data. Therefore, we introduce Self-GenomeNet, a self-supervised learning technique that is custom-tailored for genomic data. Self-GenomeNet leverages reverse-complement sequences and effectively learns short- and long-term dependencies by predicting targets of different lengths. Self-GenomeNet performs better than other self-supervised methods in data-scarce genomic tasks and outperforms standard supervised training with ~10 times fewer labeled training data. Furthermore, the learned representations generalize well to new datasets and tasks. These findings suggest that Self-GenomeNet is well suited for large-scale, unlabeled genomic datasets and could substantially improve the performance of genomic models.
Collapse
Affiliation(s)
- Hüseyin Anil Gündüz
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Martin Binder
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Xiao-Yin To
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - René Mreches
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Bernd Bischl
- Department of Statistics, LMU Munich, Munich, Germany
- Munich Center for Machine Learning, Munich, Germany
| | - Alice C McHardy
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Philipp C Münch
- Department for Computational Biology of Infection Research, Helmholtz Center for Infection Research, 38124, Braunschweig, Germany.
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.
- German Center for Infection Research (DZIF), partner site Hannover Braunschweig, Braunschweig, Germany.
- Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA.
| | - Mina Rezaei
- Department of Statistics, LMU Munich, Munich, Germany.
- Munich Center for Machine Learning, Munich, Germany.
| |
Collapse
|
7
|
Park H, Lim SJ, Cosme J, O'Connell K, Sandeep J, Gayanilo F, Cutter Jr. GR, Montes E, Nitikitpaiboon C, Fisher S, Moustahfid H, Thompson LR. Investigation of machine learning algorithms for taxonomic classification of marine metagenomes. Microbiol Spectr 2023; 11:e0523722. [PMID: 37695074 PMCID: PMC10580933 DOI: 10.1128/spectrum.05237-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 06/30/2023] [Indexed: 09/12/2023] Open
Abstract
Microbial communities play key roles in ocean ecosystems through regulation of biogeochemical processes such as carbon and nutrient cycling, food web dynamics, and gut microbiomes of invertebrates, fish, reptiles, and mammals. Assessments of marine microbial diversity are therefore critical to understanding spatiotemporal variations in microbial community structure and function in ocean ecosystems. With recent advances in DNA shotgun sequencing for metagenome samples and computational analysis, it is now possible to access the taxonomic and genomic content of ocean microbial communities to study their structural patterns, diversity, and functional potential. However, existing taxonomic classification tools depend upon manually curated phylogenetic trees, which can create inaccuracies in metagenomes from less well-characterized communities, such as from ocean water. Herein, we explore the utility of deep learning tools-DeepMicrobes and a novel Residual Network architecture-that leverage natural language processing and convolutional neural network architectures to map input sequence data (k-mers) to output labels (taxonomic groups) without reliance on a curated taxonomic tree. We trained both models using metagenomic reads simulated from marine microbial genomes in the MarRef database. The performance of both models (accuracy, precision, and percent microbe predicted) was compared with the standard taxonomic classification tool Kraken2 using 10 complex metagenomic data sets simulated from MarRef. Our results demonstrate that time, compute power, and microbial genomic diversity still pose challenges for machine learning (ML). Moreover, our results suggest that high genome coverage and rectification of class imbalance are prerequisites for a well-trained model, and therefore should be a major consideration in future ML work. IMPORTANCE Taxonomic profiling of microbial communities is essential to model microbial interactions and inform habitat conservation. This work develops approaches in constructing training/testing data sets from publicly available marine metagenomes and evaluates the performance of machine learning (ML) approaches in read-based taxonomic classification of marine metagenomes. Predictions from two models are used to test accuracy in metagenomic classification and to guide improvements in ML approaches. Our study provides insights on the methods, results, and challenges of deep learning on marine microbial metagenomic data sets. Future machine learning approaches can be improved by rectifying genome coverage and class imbalance in the training data sets, developing alternative models, and increasing the accessibility of computational resources for model training and refinement.
Collapse
Affiliation(s)
- Helen Park
- Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing, China
- EPSRC/BBSRC Future Biomanufacturing Research Hub, EPSRC Synthetic Biology Research Centre SYNBIOCHEM Manchester Institute of Biotechnology and School of Chemistry, The University of Manchester, Manchester, United Kingdom
| | - Shen Jean Lim
- Cooperative Institute for Marine and Atmospheric Studies, Rosenstiel School of Marine, Atmospheric, and Earth Science, University of Miami, Miami, Florida, USA
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
- College of Marine Science, University of South Florida, St Petersburg, Florida, USA
| | | | - Kyle O'Connell
- Deloitte Consulting LLP, Biomedical Data Science Team, Arlington, Virginia, USA
- Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, Northwest, Washington, DC, USA
| | - Jilla Sandeep
- Harte Research Institute, Texas A&M University-Corpus Christi, Corpus Christi, Texas, USA
| | - Felimon Gayanilo
- Harte Research Institute, Texas A&M University-Corpus Christi, Corpus Christi, Texas, USA
| | - George R. Cutter Jr.
- Southwest Fisheries Science Center, Antarctic Ecosystem Research Division, National Oceanic and Atmospheric Administration, La Jolla, California, USA
| | - Enrique Montes
- Cooperative Institute for Marine and Atmospheric Studies, Rosenstiel School of Marine, Atmospheric, and Earth Science, University of Miami, Miami, Florida, USA
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
| | - Chotinan Nitikitpaiboon
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Sam Fisher
- Deloitte Consulting LLP, Biomedical Data Science Team, Arlington, Virginia, USA
| | - Hassan Moustahfid
- NOAA/US Integrated Ocean Observing System (IOOS), Silver Spring, Maryland, USA
| | - Luke R. Thompson
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
- Northern Gulf Institute, Mississippi State University, Mississippi, USA
| |
Collapse
|
8
|
Zhao L, Walkowiak S, Fernando WGD. Artificial Intelligence: A Promising Tool in Exploring the Phytomicrobiome in Managing Disease and Promoting Plant Health. PLANTS (BASEL, SWITZERLAND) 2023; 12:plants12091852. [PMID: 37176910 PMCID: PMC10180744 DOI: 10.3390/plants12091852] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/06/2023] [Revised: 04/25/2023] [Accepted: 04/27/2023] [Indexed: 05/15/2023]
Abstract
There is increasing interest in harnessing the microbiome to improve cropping systems. With the availability of high-throughput and low-cost sequencing technologies, gathering microbiome data is becoming more routine. However, the analysis of microbiome data is challenged by the size and complexity of the data, and the incomplete nature of many microbiome databases. Further, to bring microbiome data value, it often needs to be analyzed in conjunction with other complex data that impact on crop health and disease management, such as plant genotype and environmental factors. Artificial intelligence (AI), boosted through deep learning (DL), has achieved significant breakthroughs and is a powerful tool for managing large complex datasets such as the interplay between the microbiome, crop plants, and their environment. In this review, we aim to provide readers with a brief introduction to AI techniques, and we introduce how AI has been applied to areas of microbiome sequencing taxonomy, the functional annotation for microbiome sequences, associating the microbiome community with host traits, designing synthetic communities, genomic selection, field phenotyping, and disease forecasting. At the end of this review, we proposed further efforts that are required to fully exploit the power of AI in studying phytomicrobiomes.
Collapse
Affiliation(s)
- Liang Zhao
- Department of Plant Science, University of Manitoba, Winnipeg, MB R3T 2N2, Canada
| | | | | |
Collapse
|
9
|
Cres CM, Tritt A, Bouchard KE, Zhang Y. DL-TODA: A Deep Learning Tool for Omics Data Analysis. Biomolecules 2023; 13:biom13040585. [PMID: 37189333 DOI: 10.3390/biom13040585] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2022] [Revised: 03/07/2023] [Accepted: 03/22/2023] [Indexed: 05/17/2023] Open
Abstract
Metagenomics is a technique for genome-wide profiling of microbiomes; this technique generates billions of DNA sequences called reads. Given the multiplication of metagenomic projects, computational tools are necessary to enable the efficient and accurate classification of metagenomic reads without needing to construct a reference database. The program DL-TODA presented here aims to classify metagenomic reads using a deep learning model trained on over 3000 bacterial species. A convolutional neural network architecture originally designed for computer vision was applied for the modeling of species-specific features. Using synthetic testing data simulated with 2454 genomes from 639 species, DL-TODA was shown to classify nearly 75% of the reads with high confidence. The classification accuracy of DL-TODA was over 0.98 at taxonomic ranks above the genus level, making it comparable with Kraken2 and Centrifuge, two state-of-the-art taxonomic classification tools. DL-TODA also achieved an accuracy of 0.97 at the species level, which is higher than 0.93 by Kraken2 and 0.85 by Centrifuge on the same test set. Application of DL-TODA to the human oral and cropland soil metagenomes further demonstrated its use in analyzing microbiomes from diverse environments. Compared to Centrifuge and Kraken2, DL-TODA predicted distinct relative abundance rankings and is less biased toward a single taxon.
Collapse
Affiliation(s)
- Cecile M Cres
- Department of Cell and Molecular Biology, College of the Environment and Life Sciences, University of Rhode Island, Kingston, RI 02881, USA
| | - Andrew Tritt
- Lawrence Berkeley National Laboratory, Scientific Data Division, Berkeley, CA 94720, USA
- Lawrence Berkeley National Laboratory, Applied Mathematics & Computational Research Division, Berkeley, CA 94720, USA
| | - Kristofer E Bouchard
- Lawrence Berkeley National Laboratory, Scientific Data Division, Berkeley, CA 94720, USA
- Lawrence Berkeley National Laboratory, Biological Systems & Engineering Division, Berkeley, CA 94720, USA
- Redwood Center for Theoretical Neuroscience, Helen Wills Neuroscience Institute, University of California, Berkeley, CA 94720, USA
| | - Ying Zhang
- Department of Cell and Molecular Biology, College of the Environment and Life Sciences, University of Rhode Island, Kingston, RI 02881, USA
| |
Collapse
|
10
|
CNN_FunBar: Advanced Learning Technique for Fungi ITS Region Classification. Genes (Basel) 2023; 14:genes14030634. [PMID: 36980906 PMCID: PMC10048311 DOI: 10.3390/genes14030634] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2022] [Revised: 12/28/2022] [Accepted: 01/09/2023] [Indexed: 03/06/2023] Open
Abstract
Fungal species identification from metagenomic data is a highly challenging task. Internal Transcribed Spacer (ITS) region is a potential DNA marker for fungi taxonomy prediction. Computational approaches, especially deep learning algorithms, are highly efficient for better pattern recognition and classification of large datasets compared to in silico techniques such as BLAST and machine learning methods. Here in this study, we present CNN_FunBar, a convolutional neural network-based approach for the classification of fungi ITS sequences from UNITE+INSDC reference datasets. Effects of convolution kernel size, filter numbers, k-mer size, degree of diversity and category-wise frequency of ITS sequences on classification performances of CNN models have been assessed at all taxonomic levels (species, genus, family, order, class and phylum). It is observed that CNN models can produce >93% average accuracy for classifying ITS sequences from balanced datasets with 500 sequences per category and 6-mer frequency features at all levels. The comparative study has revealed that CNN_FunBar can outperform machine learning-based algorithms (SVM, KNN, Naïve-Bayes and Random Forest) as well as existing fungal taxonomy prediction software (funbarRF, Mothur, RDP Classifier and SINTAX). The present study will be helpful for fungal taxonomy classification using large metagenomic datasets.
Collapse
|
11
|
Abadi SAR, Mohammadi A, Koohi S. An automated ultra-fast, memory-efficient, and accurate method for viral genome classification. J Biomed Inform 2023; 139:104316. [PMID: 36781036 DOI: 10.1016/j.jbi.2023.104316] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2022] [Revised: 01/30/2023] [Accepted: 02/08/2023] [Indexed: 02/13/2023]
Abstract
The classification of different organisms into subtypes is one of the most important tools of organism studies, and among them, the classification of viruses itself has been the focus of many studies due to their use in virology and epidemiology. Many methods have been proposed to classify viruses, some of which are designed for a specific family of organisms and some of which are more general. But still, especially for certain categories such as Influenza and HIV, classification is facing performance challenges as well as processing and memory bottlenecks. In this way, we designed an automated classifier, called PC-mer, that is based on k-mer and physicochemical characteristics of nucleotides, which reduces the number of features about 2 k times compared to the alternative methods based on k-mer, and compared to integer and one-hot encoding methods, it is possible to keep the number of features constant despite the growth of the sequence length. In this way, it also increases the training speed by an average of 17.93 times. This improvement in processing complexity is provided while PC-mer can also improve the classifying performance for a variety of virus families.
Collapse
Affiliation(s)
| | - Amirhossein Mohammadi
- No 717, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran
| | - Somayyeh Koohi
- No 717, Department of Computer Engineering, Sharif University of Technology, Tehran, Iran.
| |
Collapse
|
12
|
Madival SD, Mishra DC, Sharma A, Kumar S, Maji AK, Budhlakoti N, Sinha D, Rai A. A Deep Clustering-based Novel Approach for Binning of Metagenomics Data. Curr Genomics 2022; 23:353-368. [PMID: 36778191 PMCID: PMC9878855 DOI: 10.2174/1389202923666220928150100] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2022] [Revised: 08/30/2022] [Accepted: 09/02/2022] [Indexed: 11/22/2022] Open
Abstract
Background One major challenge in binning Metagenomics data is the limited availability of reference datasets, as only 1% of the total microbial population is yet cultured. This has given rise to the efficacy of unsupervised methods for binning in the absence of any reference datasets. Objective To develop a deep clustering-based binning approach for Metagenomics data and to evaluate results with suitable measures. Methods In this study, a deep learning-based approach has been taken for binning the Metagenomics data. The results are validated on different datasets by considering features such as Tetra-nucleotide frequency (TNF), Hexa-nucleotide frequency (HNF) and GC-Content. Convolutional Autoencoder is used for feature extraction and for binning; the K-means clustering method is used. Results In most cases, it has been found that evaluation parameters such as the Silhouette index and Rand index are more than 0.5 and 0.8, respectively, which indicates that the proposed approach is giving satisfactory results. The performance of the developed approach is compared with current methods and tools using benchmarked low complexity simulated and real metagenomic datasets. It is found better for unsupervised and at par with semi-supervised methods. Conclusion An unsupervised advanced learning-based approach for binning has been proposed, and the developed method shows promising results for various datasets. This is a novel approach for solving the lack of reference data problem of binning in metagenomics.
Collapse
Affiliation(s)
| | - Dwijesh Chandra Mishra
- Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India;,Address correspondence to this author at the Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India; E-mail:
| | - Anu Sharma
- Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India
| | - Sanjeev Kumar
- Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India
| | - Arpan Kumar Maji
- Division of Computer Applications, ICAR-IASRI, New Delhi- 110012, India
| | - Neeraj Budhlakoti
- Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India
| | - Dipro Sinha
- Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India
| | - Anil Rai
- Division of Agriculture Bioinformatics, ICAR-IASRI, New Delhi- 110012, India
| |
Collapse
|
13
|
Deciphering microbial gene function using natural language processing. Nat Commun 2022; 13:5731. [PMID: 36175448 PMCID: PMC9523054 DOI: 10.1038/s41467-022-33397-4] [Citation(s) in RCA: 12] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2022] [Accepted: 09/16/2022] [Indexed: 11/08/2022] Open
Abstract
Revealing the function of uncharacterized genes is a fundamental challenge in an era of ever-increasing volumes of sequencing data. Here, we present a concept for tackling this challenge using deep learning methodologies adopted from natural language processing (NLP). We repurpose NLP algorithms to model "gene semantics" based on a biological corpus of more than 360 million microbial genes within their genomic context. We use the language models to predict functional categories for 56,617 genes and find that out of 1369 genes associated with recently discovered defense systems, 98% are inferred correctly. We then systematically evaluate the "discovery potential" of different functional categories, pinpointing those with the most genes yet to be characterized. Finally, we demonstrate our method's ability to discover systems associated with microbial interaction and defense. Our results highlight that combining microbial genomics and language models is a promising avenue for revealing gene functions in microbes.
Collapse
|
14
|
Taxonomic classification of DNA sequences beyond sequence similarity using deep neural networks. Proc Natl Acad Sci U S A 2022; 119:e2122636119. [PMID: 36018838 PMCID: PMC9436379 DOI: 10.1073/pnas.2122636119] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Taxonomic classification, that is, the assignment to biological clades with shared ancestry, is a common task in genetics, mainly based on a genome similarity search of large genome databases. The classification quality depends heavily on the database, since representative relatives must be present. Many genomic sequences cannot be classified at all or only with a high misclassification rate. Here we present BERTax, a deep neural network program based on natural language processing to precisely classify the superkingdom and phylum of DNA sequences taxonomically without the need for a known representative relative from a database. We show BERTax to be at least on par with the state-of-the-art approaches when taxonomically similar species are part of the training data. For novel organisms, however, BERTax clearly outperforms any existing approach. Finally, we show that BERTax can also be combined with database approaches to further increase the prediction quality in almost all cases. Since BERTax is not based on similar entries in databases, it allows precise taxonomic classification of a broader range of genomic sequences, thus increasing the overall information gain.
Collapse
|
15
|
Bai X, Ren J, Sun F. MLR-OOD: A Markov Chain Based Likelihood Ratio Method for Out-Of-Distribution Detection of Genomic Sequences. J Mol Biol 2022; 434:167586. [PMID: 35427634 PMCID: PMC10433695 DOI: 10.1016/j.jmb.2022.167586] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2022] [Revised: 04/05/2022] [Accepted: 04/05/2022] [Indexed: 12/23/2022]
Abstract
Machine learning or deep learning models have been widely used for taxonomic classification of metagenomic sequences and many studies reported high classification accuracy. Such models are usually trained based on sequences in several training classes in hope of accurately classifying unknown sequences into these classes. However, when deploying the classification models on real testing data sets, sequences that do not belong to any of the training classes may be present and are falsely assigned to one of the training classes with high confidence. Such sequences are referred to as out-of-distribution (OOD) sequences and are ubiquitous in metagenomic studies. To address this problem, we develop a deep generative model-based method, MLR-OOD, that measures the probability of a testing sequencing belonging to OOD by the likelihood ratio of the maximum of the in-distribution (ID) class conditional likelihoods and the Markov chain likelihood of the testing sequence measuring the sequence complexity. We compose three different microbial data sets consisting of bacterial, viral, and plasmid sequences for comprehensively benchmarking OOD detection methods. We show that MLR-OOD achieves the state-of-the-art performance demonstrating the generality of MLR-OOD to various types of microbial data sets. It is also shown that MLR-OOD is robust to the GC content, which is a major confounding effect for OOD detection of genomic sequences. In conclusion, MLR-OOD will greatly reduce false positives caused by OOD sequences in metagenomic sequence classification.
Collapse
Affiliation(s)
- Xin Bai
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA
| | - Jie Ren
- Google Research, Brain Team, USA
| | - Fengzhu Sun
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, CA 90089, USA.
| |
Collapse
|
16
|
Câmara GBM, Coutinho MGF, da Silva LMD, Gadelha WVDN, Torquato MF, Barbosa RDM, Fernandes MAC. Convolutional Neural Network Applied to SARS-CoV-2 Sequence Classification. SENSORS (BASEL, SWITZERLAND) 2022; 22:5730. [PMID: 35957287 PMCID: PMC9371030 DOI: 10.3390/s22155730] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/29/2022] [Revised: 07/28/2022] [Accepted: 07/28/2022] [Indexed: 06/15/2023]
Abstract
COVID-19, the illness caused by the severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) virus belonging to the Coronaviridade family, a single-strand positive-sense RNA genome, has been spreading around the world and has been declared a pandemic by the World Health Organization. On 17 January 2022, there were more than 329 million cases, with more than 5.5 million deaths. Although COVID-19 has a low mortality rate, its high capacities for contamination, spread, and mutation worry the authorities, especially after the emergence of the Omicron variant, which has a high transmission capacity and can more easily contaminate even vaccinated people. Such outbreaks require elucidation of the taxonomic classification and origin of the virus (SARS-CoV-2) from the genomic sequence for strategic planning, containment, and treatment of the disease. Thus, this work proposes a high-accuracy technique to classify viruses and other organisms from a genome sequence using a deep learning convolutional neural network (CNN). Unlike the other literature, the proposed approach does not limit the length of the genome sequence. The results show that the novel proposal accurately distinguishes SARS-CoV-2 from the sequences of other viruses. The results were obtained from 1557 instances of SARS-CoV-2 from the National Center for Biotechnology Information (NCBI) and 14,684 different viruses from the Virus-Host DB. As a CNN has several changeable parameters, the tests were performed with forty-eight different architectures; the best of these had an accuracy of 91.94 ± 2.62% in classifying viruses into their realms correctly, in addition to 100% accuracy in classifying SARS-CoV-2 into its respective realm, Riboviria. For the subsequent classifications (family, genera, and subgenus), this accuracy increased, which shows that the proposed architecture may be viable in the classification of the virus that causes COVID-19.
Collapse
Affiliation(s)
- Gabriel B. M. Câmara
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil;
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Maria G. F. Coutinho
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Lucileide M. D. da Silva
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Federal Institute of Education, Science and Technology of Rio Grande do Norte, Paraiso, Santa Cruz 59200-000, RN, Brazil
| | - Walter V. do N. Gadelha
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Matheus F. Torquato
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
| | - Raquel de M. Barbosa
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Department of Pharmacy and Pharmaceutical Technology, University of Granada, 18071 Granada, Spain
| | - Marcelo A. C. Fernandes
- Bioinformatics Multidisciplinary Environment (BioME), Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil;
- Laboratory of Machine Learning and Intelligent Instrumentation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil; (M.G.F.C.); (L.M.D.d.S.); (W.V.d.N.G.); (M.F.T.)
- Department of Computer Engineering and Automation, Federal University of Rio Grande do Norte, Natal 59078-970, RN, Brazil
| |
Collapse
|
17
|
Jiang Y, Luo J, Huang D, Liu Y, Li DD. Machine Learning Advances in Microbiology: A Review of Methods and Applications. Front Microbiol 2022; 13:925454. [PMID: 35711777 PMCID: PMC9196628 DOI: 10.3389/fmicb.2022.925454] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2022] [Accepted: 05/09/2022] [Indexed: 12/18/2022] Open
Abstract
Microorganisms play an important role in natural material and elemental cycles. Many common and general biology research techniques rely on microorganisms. Machine learning has been gradually integrated with multiple fields of study. Machine learning, including deep learning, aims to use mathematical insights to optimize variational functions to aid microbiology using various types of available data to help humans organize and apply collective knowledge of various research objects in a systematic and scaled manner. Classification and prediction have become the main achievements in the development of microbial community research in the direction of computational biology. This review summarizes the application and development of machine learning and deep learning in the field of microbiology and shows and compares the advantages and disadvantages of different algorithm tools in four fields: microbiome and taxonomy, microbial ecology, pathogen and epidemiology, and drug discovery.
Collapse
|
18
|
McElhinney JMWR, Catacutan MK, Mawart A, Hasan A, Dias J. Interfacing Machine Learning and Microbial Omics: A Promising Means to Address Environmental Challenges. Front Microbiol 2022; 13:851450. [PMID: 35547145 PMCID: PMC9083327 DOI: 10.3389/fmicb.2022.851450] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2022] [Accepted: 03/14/2022] [Indexed: 11/13/2022] Open
Abstract
Microbial communities are ubiquitous and carry an exceptionally broad metabolic capability. Upon environmental perturbation, microbes are also amongst the first natural responsive elements with perturbation-specific cues and markers. These communities are thereby uniquely positioned to inform on the status of environmental conditions. The advent of microbial omics has led to an unprecedented volume of complex microbiological data sets. Importantly, these data sets are rich in biological information with potential for predictive environmental classification and forecasting. However, the patterns in this information are often hidden amongst the inherent complexity of the data. There has been a continued rise in the development and adoption of machine learning (ML) and deep learning architectures for solving research challenges of this sort. Indeed, the interface between molecular microbial ecology and artificial intelligence (AI) appears to show considerable potential for significantly advancing environmental monitoring and management practices through their application. Here, we provide a primer for ML, highlight the notion of retaining biological sample information for supervised ML, discuss workflow considerations, and review the state of the art of the exciting, yet nascent, interdisciplinary field of ML-driven microbial ecology. Current limitations in this sphere of research are also addressed to frame a forward-looking perspective toward the realization of what we anticipate will become a pivotal toolkit for addressing environmental monitoring and management challenges in the years ahead.
Collapse
Affiliation(s)
- James M. W. R. McElhinney
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
| | | | - Aurelie Mawart
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Ayesha Hasan
- Applied Genomics Laboratory, Center for Membranes and Advanced Water Technology, Khalifa University, Abu Dhabi, United Arab Emirates
- Department of Biomedical Engineering, Khalifa University, Abu Dhabi, United Arab Emirates
| | - Jorge Dias
- EECS, Center for Autonomous Robotic Systems, Khalifa University, Abu Dhabi, United Arab Emirates
| |
Collapse
|
19
|
WalkIm: Compact image-based encoding for high-performance classification of biological sequences using simple tuning-free CNNs. PLoS One 2022; 17:e0267106. [PMID: 35427371 PMCID: PMC9012348 DOI: 10.1371/journal.pone.0267106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Accepted: 04/01/2022] [Indexed: 11/28/2022] Open
Abstract
The classification of biological sequences is an open issue for a variety of data sets, such as viral and metagenomics sequences. Therefore, many studies utilize neural network tools, as the well-known methods in this field, and focus on designing customized network structures. However, a few works focus on more effective factors, such as input encoding method or implementation technology, to address accuracy and efficiency issues in this area. Therefore, in this work, we propose an image-based encoding method, called as WalkIm, whose adoption, even in a simple neural network, provides competitive accuracy and superior efficiency, compared to the existing classification methods (e.g. VGDC, CASTOR, and DLM-CNN) for a variety of biological sequences. Using WalkIm for classifying various data sets (i.e. viruses whole-genome data, metagenomics read data, and metabarcoding data), it achieves the same performance as the existing methods, with no enforcement of parameter initialization or network architecture adjustment for each data set. It is worth noting that even in the case of classifying high-mutant data sets, such as Coronaviruses, it achieves almost 100% accuracy for classifying its various types. In addition, WalkIm achieves high-speed convergence during network training, as well as reduction of network complexity. Therefore WalkIm method enables us to execute the classifying neural networks on a normal desktop system in a short time interval. Moreover, we addressed the compatibility of WalkIm encoding method with free-space optical processing technology. Taking advantages of optical implementation of convolutional layers, we illustrated that the training time can be reduced by up to 500 time. In addition to all aforementioned advantages, this encoding method preserves the structure of generated images in various modes of sequence transformation, such as reverse complement, complement, and reverse modes.
Collapse
|
20
|
Mathieu A, Leclercq M, Sanabria M, Perin O, Droit A. Machine Learning and Deep Learning Applications in Metagenomic Taxonomy and Functional Annotation. Front Microbiol 2022; 13:811495. [PMID: 35359727 PMCID: PMC8964132 DOI: 10.3389/fmicb.2022.811495] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2021] [Accepted: 02/02/2022] [Indexed: 12/12/2022] Open
Abstract
Shotgun sequencing of environmental DNA (i.e., metagenomics) has revolutionized the field of environmental microbiology, allowing the characterization of all microorganisms in a sequencing experiment. To identify the microbes in terms of taxonomy and biological activity, the sequenced reads must necessarily be aligned on known microbial genomes/genes. However, current alignment methods are limited in terms of speed and can produce a significant number of false positives when detecting bacterial species or false negatives in specific cases (virus, plasmids, and gene detection). Moreover, recent advances in metagenomics have enabled the reconstruction of new genomes using de novo binning strategies, but these genomes, not yet fully characterized, are not used in classic approaches, whereas machine and deep learning methods can use them as models. In this article, we attempted to review the different methods and their efficiency to improve the annotation of metagenomic sequences. Deep learning models have reached the performance of the widely used k-mer alignment-based tools, with better accuracy in certain cases; however, they still must demonstrate their robustness across the variety of environmental samples and across the rapid expansion of accessible genomes in databases.
Collapse
Affiliation(s)
- Alban Mathieu
- Computational Biology Laboratory, CHU de Québec - Université Laval Research Centre, Québec City, QC, Canada
| | - Mickael Leclercq
- Computational Biology Laboratory, CHU de Québec - Université Laval Research Centre, Québec City, QC, Canada
| | | | - Olivier Perin
- Digital Sciences Department, L'Oréal Advanced Research, Aulnay-sous-Bois, France
| | - Arnaud Droit
- Computational Biology Laboratory, CHU de Québec - Université Laval Research Centre, Québec City, QC, Canada
| |
Collapse
|
21
|
Efficient and Quality-Optimized Metagenomic Pipeline Designed for Taxonomic Classification in Routine Microbiological Clinical Tests. Microorganisms 2022; 10:microorganisms10040711. [PMID: 35456762 PMCID: PMC9026403 DOI: 10.3390/microorganisms10040711] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/27/2021] [Revised: 03/09/2022] [Accepted: 03/23/2022] [Indexed: 01/26/2023] Open
Abstract
Metagenomics analysis is now routinely used for clinical diagnosis in several diseases, and we need confidence in interpreting metagenomics analysis of microbiota. Particularly from the side of clinical microbiology, we consider that it would be a major milestone to further advance microbiota studies with an innovative and significant approach consisting of processing steps and quality assessment for interpreting metagenomics data used for diagnosis. Here, we propose a methodology for taxon identification and abundance assessment of shotgun sequencing data of microbes that are well fitted for clinical setup. Processing steps of quality controls have been developed in order (i) to avoid low-quality reads and sequences, (ii) to optimize abundance thresholds and profiles, (iii) to combine classifiers and reference databases for best classification of species and abundance profiles for both prokaryotic and eukaryotic sequences, and (iv) to introduce external positive control. We find that the best strategy is to use a pipeline composed of a combination of different but complementary classifiers such as Kraken2/Bracken and Kaiju. Such improved quality assessment will have a major impact on the robustness of biological and clinical conclusions drawn from metagenomic studies.
Collapse
|
22
|
Herrera-García JA, Martinez M, Zamora-Tavares P, Vargas-Ponce O, Hernández-Sandoval L, Rodríguez-Zaragoza FA. Metabarcoding of the phytotelmata of Pseudalcantarea grandis (Bromeliaceae) from an arid zone. PeerJ 2022; 10:e12706. [PMID: 35127281 PMCID: PMC8801176 DOI: 10.7717/peerj.12706] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/17/2021] [Accepted: 12/07/2021] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND Pseudalcantarea grandis (Schltdl.) Pinzón & Barfuss is a tank bromeliad that grows on cliffs in the southernmost portion of the Chihuahuan desert. Phytotelmata are water bodies formed by plants that function as micro-ecosystems where bacteria, algae, protists, insects, fungi, and some vertebrates can develop. We hypothesized that the bacterial diversity contained in the phytotelma formed in a bromeliad from an arid zone would differ in sites with and without surrounding vegetation. Our study aimed to characterize the bacterial composition and putative metabolic functions in P. grandis phytotelmata collected in vegetated and non-vegetated sites. METHODS Water from 10 individuals was sampled. Five individuals had abundant surrounding vegetation, and five had little or no vegetation. We extracted DNA and amplified seven hypervariable regions of the 16S gene (V2, V4, V8, V3-6, 7-9). Metabarcoding sequencing was performed on the Ion Torrent PGM platform. Taxonomic identity was assigned by the binning reads and coverage between hit and query from the reference database of at least 90%. Putative metabolic functions of the bacterial families were assigned mainly using the FAPROTAX database. The dominance patterns in each site were visualized with rank/abundance curves using the number of Operational Taxonomic Units (OTUs) per family. A percentage similarity analysis (SIMPER) was used to estimate dissimilarity between the sites. Relationships among bacterial families (identified by the dominance analysis and SIMPER), sites, and their respective putative functions were analyzed with shade plots. RESULTS A total of 1.5 million useful bacterial sequences were obtained. Sequences were clustered into OTUs, and taxonomic assignment was conducted using BLAST in the Greengenes databases. Bacterial diversity was 23 phyla, 52 classes, 98 orders, 218 families, and 297 genera. Proteobacteria (37%), Actinobacteria (19%), and Firmicutes (15%) comprised the highest percentage (71%). There was a 68.3% similarity between the two sites at family level, with 149 families shared. Aerobic chemoheterotrophy and fermentation were the main metabolic functions in both sites, followed by ureolysis, nitrate reduction, aromatic compound degradation, and nitrogen fixation. The dominant bacteria shared most of the metabolic functions between sites. Some functions were recorded for one site only and were related to families with the lowest OTUs richness. Bacterial diversity in the P. grandis tanks included dominant phyla and families present at low percentage that could be considered part of a rare biosphere. A rare biosphere can form genetic reservoirs, the local abundance of which depends on external abiotic and biotic factors, while their interactions could favor micro-ecosystem resilience and resistance.
Collapse
Affiliation(s)
| | - Mahinda Martinez
- Universidad Autónoma de Querétaro, Querétaro, Mexico,Laboratorio Nacional de Identificación y Caracterización Vegetal, Querétaro, Mexico
| | - Pilar Zamora-Tavares
- Instituto de Botánica, departamento de Botánica y Zoología, Centro Universitario de Ciencias Biológicas y Agropecuarias, Universidad de Guadalajara, Guadalajara, Jalisco, México,Laboratorio Nacional de Identificación y Caracterización Vegetal, Guadalajara, Mexico
| | - Ofelia Vargas-Ponce
- Instituto de Botánica, departamento de Botánica y Zoología, Centro Universitario de Ciencias Biológicas y Agropecuarias, Universidad de Guadalajara, Guadalajara, Jalisco, México,Laboratorio Nacional de Identificación y Caracterización Vegetal, Guadalajara, Mexico
| | - Luis Hernández-Sandoval
- Universidad Autónoma de Querétaro, Querétaro, Mexico,Laboratorio Nacional de Identificación y Caracterización Vegetal, Querétaro, Mexico
| | - Fabián Alejandro Rodríguez-Zaragoza
- Laboratorio de Ecología Molecular, Microbiología y Taxonomía (LEMITAX), Departamento de Ecología, Centro Universitario de Ciencias Biológicas y Agropecuarias, Universidad de Guadalajara, Guadalajara, Jalisco, Mexico
| |
Collapse
|
23
|
Decoding gut microbiota by imaging analysis of fecal samples. iScience 2021; 24:103481. [PMID: 34927025 PMCID: PMC8652011 DOI: 10.1016/j.isci.2021.103481] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2019] [Revised: 09/21/2021] [Accepted: 11/19/2021] [Indexed: 01/09/2023] Open
Abstract
The gut microbiota plays a crucial role in maintaining health. Monitoring the complex dynamics of its microbial population is, therefore, important. Here, we present a deep convolution network that can characterize the dynamic changes in the gut microbiota using low-resolution images of fecal samples. Further, we demonstrate that the microbial relative abundances, quantified via 16S rRNA amplicon sequencing, can be quantitatively predicted by the neural network. Our approach provides a simple and inexpensive method of gut microbiota analysis. A deep convolution network classifies gut microbiota based on fecal sample images Image-based quantitative prediction of gut microbiota composition is demonstrated This result provides a simple and inexpensive method of gut microbiota analysis
Collapse
|
24
|
Deng Z, Zhang J, Li J, Zhang X. Application of Deep Learning in Plant-Microbiota Association Analysis. Front Genet 2021; 12:697090. [PMID: 34691142 PMCID: PMC8531731 DOI: 10.3389/fgene.2021.697090] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2021] [Accepted: 08/31/2021] [Indexed: 01/04/2023] Open
Abstract
Unraveling the association between microbiome and plant phenotype can illustrate the effect of microbiome on host and then guide the agriculture management. Adequate identification of species and appropriate choice of models are two challenges in microbiome data analysis. Computational models of microbiome data could help in association analysis between the microbiome and plant host. The deep learning methods have been widely used to learn the microbiome data due to their powerful strength of handling the complex, sparse, noisy, and high-dimensional data. Here, we review the analytic strategies in the microbiome data analysis and describe the applications of deep learning models for plant–microbiome correlation studies. We also introduce the application cases of different models in plant–microbiome correlation analysis and discuss how to adapt the models on the critical steps in data processing. From the aspect of data processing manner, model structure, and operating principle, most deep learning models are suitable for the plant microbiome data analysis. The ability of feature representation and pattern recognition is the advantage of deep learning methods in modeling and interpretation for association analysis. Based on published computational experiments, the convolutional neural network and graph neural networks could be recommended for plant microbiome analysis.
Collapse
Affiliation(s)
- Zhiyu Deng
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Jinming Zhang
- Department of Infectious Diseases, Ruijin Hospital, Shanghai Jiaotong University School of Medicine, Shanghai, China
| | - Junya Li
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China.,University of Chinese Academy of Sciences, Beijing, China
| | - Xiujun Zhang
- Key Laboratory of Plant Germplasm Enhancement and Specialty Agriculture, Wuhan Botanical Garden, Chinese Academy of Sciences, Wuhan, China.,Center of Economic Botany, Core Botanical Gardens, Chinese Academy of Sciences, Wuhan, China
| |
Collapse
|
25
|
Gupta S, Aga D, Pruden A, Zhang L, Vikesland P. Data Analytics for Environmental Science and Engineering Research. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2021; 55:10895-10907. [PMID: 34338518 DOI: 10.1021/acs.est.1c01026] [Citation(s) in RCA: 17] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The advent of new data acquisition and handling techniques has opened the door to alternative and more comprehensive approaches to environmental monitoring that will improve our capacity to understand and manage environmental systems. Researchers have recently begun using machine learning (ML) techniques to analyze complex environmental systems and their associated data. Herein, we provide an overview of data analytics frameworks suitable for various Environmental Science and Engineering (ESE) research applications. We present current applications of ML algorithms within the ESE domain using three representative case studies: (1) Metagenomic data analysis for characterizing and tracking antimicrobial resistance in the environment; (2) Nontarget analysis for environmental pollutant profiling; and (3) Detection of anomalies in continuous data generated by engineered water systems. We conclude by proposing a path to advance incorporation of data analytics approaches in ESE research and application.
Collapse
Affiliation(s)
- Suraj Gupta
- The Interdisciplinary PhD Program in Genetics, Bioinformatics, and Computational Biology, Virginia Tech, Blacksburg, Virginia 24061, United States
| | - Diana Aga
- Department of Chemistry, University at Buffalo, The State University of New York, Buffalo, New York 14226, United States
| | - Amy Pruden
- Via Department of Civil and Environmental Engineering, Virginia Tech, Blacksburg, Virginia 24061, United States
| | - Liqing Zhang
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia 24061, United States
| | - Peter Vikesland
- Via Department of Civil and Environmental Engineering, Virginia Tech, Blacksburg, Virginia 24061, United States
| |
Collapse
|
26
|
Young RB, Marcelino VR, Chonwerawong M, Gulliver EL, Forster SC. Key Technologies for Progressing Discovery of Microbiome-Based Medicines. Front Microbiol 2021; 12:685935. [PMID: 34239510 PMCID: PMC8258393 DOI: 10.3389/fmicb.2021.685935] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2021] [Accepted: 05/25/2021] [Indexed: 12/22/2022] Open
Abstract
A growing number of experimental and computational approaches are illuminating the “microbial dark matter” and uncovering the integral role of commensal microbes in human health. Through this work, it is now clear that the human microbiome presents great potential as a therapeutic target for a plethora of diseases, including inflammatory bowel disease, diabetes and obesity. The development of more efficacious and targeted treatments relies on identification of causal links between the microbiome and disease; with future progress dependent on effective links between state-of-the-art sequencing approaches, computational analyses and experimental assays. We argue determining causation is essential, which can be attained by generating hypotheses using multi-omic functional analyses and validating these hypotheses in complex, biologically relevant experimental models. In this review we discuss existing analysis and validation methods, and propose best-practice approaches required to enable the next phase of microbiome research.
Collapse
Affiliation(s)
- Remy B Young
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, VIC, Australia.,Infection and Immunity Program, Monash Biomedicine Discovery Institute and Department of Microbiology, Monash University, Clayton, VIC, Australia
| | - Vanessa R Marcelino
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, VIC, Australia.,Department of Molecular and Translational Sciences, Monash University, Clayton, VIC, Australia
| | - Michelle Chonwerawong
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, VIC, Australia.,Department of Molecular and Translational Sciences, Monash University, Clayton, VIC, Australia
| | - Emily L Gulliver
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, VIC, Australia.,Infection and Immunity Program, Monash Biomedicine Discovery Institute and Department of Microbiology, Monash University, Clayton, VIC, Australia.,Department of Molecular and Translational Sciences, Monash University, Clayton, VIC, Australia
| | - Samuel C Forster
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, VIC, Australia.,Infection and Immunity Program, Monash Biomedicine Discovery Institute and Department of Microbiology, Monash University, Clayton, VIC, Australia.,Department of Molecular and Translational Sciences, Monash University, Clayton, VIC, Australia
| |
Collapse
|
27
|
Ziemski M, Wisanwanichthan T, Bokulich NA, Kaehler BD. Beating Naive Bayes at Taxonomic Classification of 16S rRNA Gene Sequences. Front Microbiol 2021; 12:644487. [PMID: 34220738 PMCID: PMC8249850 DOI: 10.3389/fmicb.2021.644487] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2020] [Accepted: 05/31/2021] [Indexed: 12/28/2022] Open
Abstract
Naive Bayes classifiers (NBC) have dominated the field of taxonomic classification of amplicon sequences for over a decade. Apart from having runtime requirements that allow them to be trained and used on modest laptops, they have persistently provided class-topping classification accuracy. In this work we compare NBC with random forest classifiers, neural network classifiers, and a perfect classifier that can only fail when different species have identical sequences, and find that in some practical scenarios there is little scope for improving on NBC for taxonomic classification of 16S rRNA gene sequences. Further improvements in taxonomy classification are unlikely to come from novel algorithms alone, and will need to leverage other technological innovations, such as ecological frequency information.
Collapse
Affiliation(s)
- Michal Ziemski
- Laboratory of Food Systems Biotechnology, Institute of Food, Nutrition, and Health, ETH Zürich, Zurich, Switzerland
| | | | - Nicholas A. Bokulich
- Laboratory of Food Systems Biotechnology, Institute of Food, Nutrition, and Health, ETH Zürich, Zurich, Switzerland
| | | |
Collapse
|
28
|
Survey of artificial intelligence approaches in the study of anthropogenic impacts on symbiotic organisms – a holistic view. Symbiosis 2021. [DOI: 10.1007/s13199-021-00778-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
29
|
Karagöz MA, Nalbantoglu OU. Taxonomic classification of metagenomic sequences from Relative Abundance Index profiles using deep learning. Biomed Signal Process Control 2021. [DOI: 10.1016/j.bspc.2021.102539] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022]
|
30
|
Kaden M, Bohnsack KS, Weber M, Kudła M, Gutowska K, Blazewicz J, Villmann T. Learning vector quantization as an interpretable classifier for the detection of SARS-CoV-2 types based on their RNA sequences. Neural Comput Appl 2021; 34:67-78. [PMID: 33935376 PMCID: PMC8076884 DOI: 10.1007/s00521-021-06018-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2020] [Accepted: 04/07/2021] [Indexed: 02/06/2023]
Abstract
We present an approach to discriminate SARS-CoV-2 virus types based on their RNA sequence descriptions avoiding a sequence alignment. For that purpose, sequences are preprocessed by feature extraction and the resulting feature vectors are analyzed by prototype-based classification to remain interpretable. In particular, we propose to use variants of learning vector quantization (LVQ) based on dissimilarity measures for RNA sequence data. The respective matrix LVQ provides additional knowledge about the classification decisions like discriminant feature correlations and, additionally, can be equipped with easy to realize reject options for uncertain data. Those options provide self-controlled evidence, i.e., the model refuses to make a classification decision if the model evidence for the presented data is not sufficient. This model is first trained using a GISAID dataset with given virus types detected according to the molecular differences in coronavirus populations by phylogenetic tree clustering. In a second step, we apply the trained model to another but unlabeled SARS-CoV-2 virus dataset. For these data, we can either assign a virus type to the sequences or reject atypical samples. Those rejected sequences allow to speculate about new virus types with respect to nucleotide base mutations in the viral sequences. Moreover, this rejection analysis improves model robustness. Last but not least, the presented approach has lower computational complexity compared to methods based on (multiple) sequence alignment. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s00521-021-06018-2.
Collapse
Affiliation(s)
- Marika Kaden
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Katrin Sophie Bohnsack
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Mirko Weber
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| | - Mateusz Kudła
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
| | - Kaja Gutowska
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
- Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland
- European Centre for Bioinformatics and Genomics, Piotrowo 2, 60-965 Poznan, Poland
| | - Jacek Blazewicz
- Institute of Computing Science, Poznan University of Technology, Piotrowo 2, 60-965 Poznan, Poland
- Institute of Bioorganic Chemistry, Polish Academy of Sciences, Noskowskiego 12/14, 61-704 Poznan, Poland
- European Centre for Bioinformatics and Genomics, Piotrowo 2, 60-965 Poznan, Poland
| | - Thomas Villmann
- University of Applied Sciences Mittweida, Technikumplatz 17, 09648 Mittweida, Germany
- Saxon Institute for Computational Intelligence and Machine Learning, Technikumplatz 17, 09648 Mittweida, Germany
| |
Collapse
|
31
|
Du Z, Xiao X, Uversky VN. Classification of Chromosomal DNA Sequences Using Hybrid Deep Learning Architectures. Curr Bioinform 2021. [DOI: 10.2174/1574893615666200224095531] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022]
Abstract
Background:
Chromosomal DNA contains most of the genetic information of
eukaryotes and plays an important role in the growth, development and reproduction of living
organisms. Most chromosomal DNA sequences are known to wrap around histones, and
distinguishing these DNA sequences from ordinary DNA sequences is important for understanding
the genetic code of life. The main difficulty behind this problem is the feature selection process.
DNA sequences have no explicit features, and the common representation methods, such as onehot
coding, introduced the major drawback of high dimensionality. Recently, deep learning models
have been proved to be able to automatically extract useful features from input patterns.
Objective:
We aim to investigate which deep learning networks could achieve notable
improvements in the field of DNA sequence classification using only sequence information.
Methods: In this paper, we present four different deep learning architectures using convolutional
neural networks and long short-term memory networks for the purpose of chromosomal DNA
sequence classification. Natural language model Word2vec was used to generate word embedding
of sequence and learn features from it by deep learning.
Results:
The comparison of these four architectures is carried out on 10 chromosomal DNA
datasets. The results show that the architecture of convolutional neural networks combined with
long short-term memory networks is superior to other methods with regards to the accuracy of
chromosomal DNA prediction.
Conclusion:
In this study, four deep learning models were compared for an automatic classification
of chromosomal DNA sequences with no steps of sequence preprocessing. In particular, we have
regarded DNA sequences as natural language and extracted word embedding with Word2Vec to
represent DNA sequences. Results show a superiority of the CNN+LSTM model in the ten
classification tasks. The reason for this success is that the CNN module captures the regulatory
motifs, while the following LSTM layer captures the long-term dependencies between them.
Collapse
Affiliation(s)
- Zhihua Du
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, China
| | - Xiangdong Xiao
- Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ), Shenzhen University, China
| | - Vladimir N. Uversky
- Department of Molecular Medicine, Morsani College of Medicine, University of South Florida, 12901 Bruce B. Downs Blvd. MDC07, Tampa, Florida, (V.N.U.), United States
| |
Collapse
|
32
|
Ghannam RB, Techtmann SM. Machine learning applications in microbial ecology, human microbiome studies, and environmental monitoring. Comput Struct Biotechnol J 2021; 19:1092-1107. [PMID: 33680353 PMCID: PMC7892807 DOI: 10.1016/j.csbj.2021.01.028] [Citation(s) in RCA: 74] [Impact Index Per Article: 24.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/02/2020] [Revised: 01/16/2021] [Accepted: 01/18/2021] [Indexed: 01/04/2023] Open
Abstract
Advances in nucleic acid sequencing technology have enabled expansion of our ability to profile microbial diversity. These large datasets of taxonomic and functional diversity are key to better understanding microbial ecology. Machine learning has proven to be a useful approach for analyzing microbial community data and making predictions about outcomes including human and environmental health. Machine learning applied to microbial community profiles has been used to predict disease states in human health, environmental quality and presence of contamination in the environment, and as trace evidence in forensics. Machine learning has appeal as a powerful tool that can provide deep insights into microbial communities and identify patterns in microbial community data. However, often machine learning models can be used as black boxes to predict a specific outcome, with little understanding of how the models arrived at predictions. Complex machine learning algorithms often may value higher accuracy and performance at the sacrifice of interpretability. In order to leverage machine learning into more translational research related to the microbiome and strengthen our ability to extract meaningful biological information, it is important for models to be interpretable. Here we review current trends in machine learning applications in microbial ecology as well as some of the important challenges and opportunities for more broad application of machine learning to understanding microbial communities.
Collapse
Key Words
- 16S rRNA
- ANN, Artificial Neural Networks
- ASV, Amplicon Sequence Variant
- AUC, Area Under the Curve
- Forensics
- GB, Gradient Boosting
- ML, Machine Learning
- Machine learning
- Marker genes
- Metagenomics
- PCoA, Principal Coordinate Analysis
- RF, Random Forests
- ROC, Receiver Operating Characteristic
- SML, Supervised Machine Learning
- SVM, Support Vector Machines
- USML, Unsupervised Machine Learning
- tSNE, t-distributed Stochastic Neighbor Embedding
Collapse
Affiliation(s)
- Ryan B. Ghannam
- Department of Biological Sciences, Michigan Technological University, Houghton MI, United States
| | - Stephen M. Techtmann
- Department of Biological Sciences, Michigan Technological University, Houghton MI, United States
| |
Collapse
|
33
|
Imchen M, Kumavath R. Metagenomic insights into the antibiotic resistome of mangrove sediments and their association to socioeconomic status. ENVIRONMENTAL POLLUTION (BARKING, ESSEX : 1987) 2021; 268:115795. [PMID: 33068846 DOI: 10.1016/j.envpol.2020.115795] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/25/2020] [Revised: 09/03/2020] [Accepted: 10/06/2020] [Indexed: 06/11/2023]
Abstract
Mangrove sediments are prone to anthropogenic activities that could enrich antibiotics resistance genes (ARGs). The emergence and dissemination of ARGs are of serious concern to public health worldwide. Therefore, a comprehensive resistome analysis of global mangrove sediment is of paramount importance. In this study, we have implemented a deep machine learning approach to analyze the resistome of mangrove sediments from Brazil, China, Saudi Arabia, India, and Malaysia. Geography (RANOSIM = 39.26%; p < 0.005) as well as human intervention (RANOSIM = 16.92%; p < 0.005) influenced the ARG diversity. ARG diversity was also inversely correlated to the human development index (HDI) of the host country (R = -0.53; p < 0.05) rather than antibiotics consumption (p > 0.05). Several genes including multidrug efflux pumps were significantly (p < 0.05) enriched in the sites with human intervention. Resistome was consistently dominated by rpoB2 (19.26 ± 0.01%), multidrug ABC transporter (10.40 ± 0.23%), macB (8.84 ± 0.36n%), tetA (4.13 ± 0.35%), mexF (3.26 ± 0.19%), CpxR (2.93 ± 0.2%), bcrA (2.38 ± 0.24%), acrB (2.37 ± 0.18%), mexW (2.19 ± 0.17%), and vanR (1.99 ± 0.11%). Besides, mobile ARGs such as vanA, tet(48), mcr, and tetX were also detected in the mangrove sediments. Comparative analysis against terrestrial and ocean resistomes showed that the ocean ecosystem harbored the lowest ARG diversity (Chao1 = 71.12) followed by mangroves (Chao1 = 258.07) and terrestrial ecosystem (Chao1 = 294.07). ARG subtypes such as abeS and qacG were detected exclusively in ocean datasets. Likewise, rpoB2, multidrug ABC transporter, and macB, detected in mangrove and terrestrial datasets, were not detected in the ocean datasets. This study shows that the socioeconomic factors strongly determine the antibiotic resistome in the mangrove. Direct anthropogenic intervention in the mangrove environment also enriches antibiotic resistome.
Collapse
Affiliation(s)
- Madangchanok Imchen
- Department of Genomic Science, School of Biological Sciences, Central University of Kerala, Tejaswini Hills, Periya (P.O) Kasaragod, Kerala, 671320, India
| | - Ranjith Kumavath
- Department of Genomic Science, School of Biological Sciences, Central University of Kerala, Tejaswini Hills, Periya (P.O) Kasaragod, Kerala, 671320, India.
| |
Collapse
|
34
|
Zheng D, Pang G, Liu B, Chen L, Yang J. Learning transferable deep convolutional neural networks for the classification of bacterial virulence factors. Bioinformatics 2020; 36:3693-3702. [PMID: 32251507 DOI: 10.1093/bioinformatics/btaa230] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Revised: 03/25/2020] [Accepted: 04/01/2020] [Indexed: 12/23/2022] Open
Abstract
MOTIVATION Identification of virulence factors (VFs) is critical to the elucidation of bacterial pathogenesis and prevention of related infectious diseases. Current computational methods for VF prediction focus on binary classification or involve only several class(es) of VFs with sufficient samples. However, thousands of VF classes are present in real-world scenarios, and many of them only have a very limited number of samples available. RESULTS We first construct a large VF dataset, covering 3446 VF classes with 160 495 sequences, and then propose deep convolutional neural network models for VF classification. We show that (i) for common VF classes with sufficient samples, our models can achieve state-of-the-art performance with an overall accuracy of 0.9831 and an F1-score of 0.9803; (ii) for uncommon VF classes with limited samples, our models can learn transferable features from auxiliary data and achieve good performance with accuracy ranging from 0.9277 to 0.9512 and F1-score ranging from 0.9168 to 0.9446 when combined with different predefined features, outperforming traditional classifiers by 1-13% in accuracy and by 1-16% in F1-score. AVAILABILITY AND IMPLEMENTATION All of our datasets are made publicly available at http://www.mgc.ac.cn/VFNet/, and the source code of our models is publicly available at https://github.com/zhengdd0422/VFNet. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Dandan Zheng
- NHC Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100176, China
| | - Guansong Pang
- Australian Institute for Machine Learning, The University of Adelaide, Adelaide, SA 5005, Australia
| | - Bo Liu
- NHC Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100176, China
| | - Lihong Chen
- NHC Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100176, China
| | - Jian Yang
- NHC Key Laboratory of Systems Biology of Pathogens, Institute of Pathogen Biology, Chinese Academy of Medical Sciences and Peking Union Medical College, Beijing 100176, China
| |
Collapse
|
35
|
Power spectrum and dynamic time warping for DNA sequences classification. EVOLVING SYSTEMS 2020. [DOI: 10.1007/s12530-019-09306-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
36
|
Zhao Z, Cristian A, Rosen G. Keeping up with the genomes: efficient learning of our increasing knowledge of the tree of life. BMC Bioinformatics 2020; 21:412. [PMID: 32957925 PMCID: PMC7507296 DOI: 10.1186/s12859-020-03744-7] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2020] [Accepted: 09/08/2020] [Indexed: 11/26/2022] Open
Abstract
BACKGROUND It is a computational challenge for current metagenomic classifiers to keep up with the pace of training data generated from genome sequencing projects, such as the exponentially-growing NCBI RefSeq bacterial genome database. When new reference sequences are added to training data, statically trained classifiers must be rerun on all data, resulting in a highly inefficient process. The rich literature of "incremental learning" addresses the need to update an existing classifier to accommodate new data without sacrificing much accuracy compared to retraining the classifier with all data. RESULTS We demonstrate how classification improves over time by incrementally training a classifier on progressive RefSeq snapshots and testing it on: (a) all known current genomes (as a ground truth set) and (b) a real experimental metagenomic gut sample. We demonstrate that as a classifier model's knowledge of genomes grows, classification accuracy increases. The proof-of-concept naïve Bayes implementation, when updated yearly, now runs in 1/4th of the non-incremental time with no accuracy loss. CONCLUSIONS It is evident that classification improves by having the most current knowledge at its disposal. Therefore, it is of utmost importance to make classifiers computationally tractable to keep up with the data deluge. The incremental learning classifier can be efficiently updated without the cost of reprocessing nor the access to the existing database and therefore save storage as well as computation resources.
Collapse
Affiliation(s)
- Zhengqiao Zhao
- Ecological and Evolutionary Signal-process and Informatics (EESI) Lab, Department of Electrical and Computer Engineering, Drexel University, Market Street, Philadelphia, US
| | - Alexandru Cristian
- Department of Computer Science, Drexel University, Market Street, Philadelphia, US
| | - Gail Rosen
- Ecological and Evolutionary Signal-process and Informatics (EESI) Lab, Department of Electrical and Computer Engineering, Drexel University, Market Street, Philadelphia, US
| |
Collapse
|
37
|
Han R, Wang S, Gao X. Novel algorithms for efficient subsequence searching and mapping in nanopore raw signals towards targeted sequencing. Bioinformatics 2020; 36:1333-1343. [PMID: 31593235 DOI: 10.1093/bioinformatics/btz742] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2019] [Revised: 07/24/2019] [Accepted: 10/01/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Genome diagnostics have gradually become a prevailing routine for human healthcare. With the advances in understanding the causal genes for many human diseases, targeted sequencing provides a rapid, cost-efficient and focused option for clinical applications, such as single nucleotide polymorphism (SNP) detection and haplotype classification, in a specific genomic region. Although nanopore sequencing offers a perfect tool for targeted sequencing because of its mobility, PCR-freeness and long read properties, it poses a challenging computational problem of how to efficiently and accurately search and map genomic subsequences of interest in a pool of nanopore reads (or raw signals). Due to its relatively low sequencing accuracy, there is no reliable solution to this problem, especially at low sequencing coverage. RESULTS Here, we propose a brand new signal-based subsequence inquiry pipeline as well as two novel algorithms to tackle this problem. The proposed algorithms follow the principle of subsequence dynamic time warping and directly operate on the electrical current signals, without loss of information in base-calling. Therefore, the proposed algorithms can serve as a tool for sequence inquiry in targeted sequencing. Two novel criteria are offered for the consequent signal quality analysis and data classification. Comprehensive experiments on real-world nanopore datasets show the efficiency and effectiveness of the proposed algorithms. We further demonstrate the potential applications of the proposed algorithms in two typical tasks in nanopore-based targeted sequencing: SNP detection under low sequencing coverage, and haplotype classification under low sequencing accuracy. AVAILABILITY AND IMPLEMENTATION The project is accessible at https://github.com/icthrm/cwSDTWnano.git, and the presented bench data is available upon request. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Renmin Han
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| | - Sheng Wang
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| | - Xin Gao
- King Abdullah University of Science and Technology (KAUST), Computational Bioscience Research Center (CBRC), Computer, Electrical and Mathematical Sciences and Engineering (CEMSE) Division, Thuwal 23955-6900, Saudi Arabia
| |
Collapse
|
38
|
Amato D, Bosco GL, Rizzo R. CORENup: a combination of convolutional and recurrent deep neural networks for nucleosome positioning identification. BMC Bioinformatics 2020; 21:326. [PMID: 32938377 PMCID: PMC7493859 DOI: 10.1186/s12859-020-03627-x] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2020] [Accepted: 06/22/2020] [Indexed: 12/14/2022] Open
Abstract
BACKGROUND Nucleosomes wrap the DNA into the nucleus of the Eukaryote cell and regulate its transcription phase. Several studies indicate that nucleosomes are determined by the combined effects of several factors, including DNA sequence organization. Interestingly, the identification of nucleosomes on a genomic scale has been successfully performed by computational methods using DNA sequence as input data. RESULTS In this work, we propose CORENup, a deep learning model for nucleosome identification. CORENup processes a DNA sequence as input using one-hot representation and combines in a parallel fashion a fully convolutional neural network and a recurrent layer. These two parallel levels are devoted to catching both non periodic and periodic DNA string features. A dense layer is devoted to their combination to give a final classification. CONCLUSIONS Results computed on public data sets of different organisms show that CORENup is a state of the art methodology for nucleosome positioning identification based on a Deep Neural Network architecture. The comparisons have been carried out using two groups of datasets, currently adopted by the best performing methods, and CORENup has shown top performance both in terms of classification metrics and elapsed computation time.
Collapse
Affiliation(s)
- Domenico Amato
- Dipartimento di Matematica e Informatica, Università degli studi di Palermo, Via Archirafi, 34, Palermo, 90123, Italy
| | - Giosue' Lo Bosco
- Dipartimento di Matematica e Informatica, Università degli studi di Palermo, Via Archirafi, 34, Palermo, 90123, Italy. .,Dipartimento di Scienze per l'Innovazione tecnologica, Istituto Euro-Mediterraneo di Scienza e Tecnologia, Via Michele Miraglia, 20, Palermo, 9039, Italy.
| | - Riccardo Rizzo
- CNR-ICAR, National Research Council of Italy, Via Ugo La Malfa, 153, Palermo, 90146, Italy
| |
Collapse
|
39
|
Urso A, Fiannaca A, La Rosa M, La Paglia L, Lo Bosco G, Rizzo R. BITS2019: the sixteenth annual meeting of the Italian society of bioinformatics. BMC Bioinformatics 2020; 21:363. [PMID: 32938383 PMCID: PMC7493178 DOI: 10.1186/s12859-020-03708-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
The 16th Annual Meeting of the Bioinformatics Italian Society was held in Palermo, Italy, on June 26-28, 2019. More than 80 scientific contributions were presented, including 4 keynote lectures, 31 oral communications and 49 posters. Also, three workshops were organised before and during the meeting. Full papers from some of the works presented in Palermo were submitted for this Supplement of BMC Bioinformatics. Here, we provide an overview of meeting aims and scope. We also shortly introduce selected papers that have been accepted for publication in this Supplement, for a complete presentation of the outcomes of the meeting.
Collapse
Affiliation(s)
- Alfonso Urso
- ICAR-CNR, Institute for high performance computing and networking, National Research Council of Italy, Palermo, 90146, Italy.
| | - Antonino Fiannaca
- ICAR-CNR, Institute for high performance computing and networking, National Research Council of Italy, Palermo, 90146, Italy
| | - Massimo La Rosa
- ICAR-CNR, Institute for high performance computing and networking, National Research Council of Italy, Palermo, 90146, Italy
| | - Laura La Paglia
- ICAR-CNR, Institute for high performance computing and networking, National Research Council of Italy, Palermo, 90146, Italy
| | - Giosue' Lo Bosco
- Department of Mathematics and Computer Science, University of Palermo, Palermo, 90128, Italy
| | - Riccardo Rizzo
- ICAR-CNR, Institute for high performance computing and networking, National Research Council of Italy, Palermo, 90146, Italy
| |
Collapse
|
40
|
Su X, Jing G, Zhang Y, Wu S. Method development for cross-study microbiome data mining: Challenges and opportunities. Comput Struct Biotechnol J 2020; 18:2075-2080. [PMID: 32802279 PMCID: PMC7419250 DOI: 10.1016/j.csbj.2020.07.020] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/04/2020] [Revised: 07/22/2020] [Accepted: 07/24/2020] [Indexed: 01/26/2023] Open
Abstract
During the past decade, tremendous amount of microbiome sequencing data has been generated to study on the dynamic associations between microbial profiles and environments. How to precisely and efficiently decipher large-scale of microbiome data and furtherly take advantages from it has become one of the most essential bottlenecks for microbiome research at present. In this mini-review, we focus on the three key steps of analyzing cross-study microbiome datasets, including microbiome profiling, data integrating and data mining. By introducing the current bioinformatics approaches and discussing their limitations, we prospect the opportunities in development of computational methods for the three steps, and propose the promising solutions to multi-omics data analysis for comprehensive understanding and rapid investigation of microbiome from different angles, which could potentially promote the data-driven research by providing a broader view of the "microbiome data space".
Collapse
Affiliation(s)
- Xiaoquan Su
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071 China
- Single-Cell Center, Qingdao Institute of BioEnergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao, Shandong 266101 China
| | - Gongchao Jing
- Single-Cell Center, Qingdao Institute of BioEnergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao, Shandong 266101 China
| | - Yufeng Zhang
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071 China
- Single-Cell Center, Qingdao Institute of BioEnergy and Bioprocess Technology, Chinese Academy of Sciences, Qingdao, Shandong 266101 China
| | - Shunyao Wu
- College of Computer Science and Technology, Qingdao University, Qingdao, Shandong 266071 China
| |
Collapse
|
41
|
Deep learning model for metagenome fragment classification using spaced k-mers feature extraction. JURNAL TEKNOLOGI DAN SISTEM KOMPUTER 2020. [DOI: 10.14710/jtsiskom.2020.13407] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
An open challenge in bioinformatics is the analysis of the sequenced metagenomes from the various environments. Several studies demonstrated bacteria classification at the genus level using k-mers as feature extraction where the highest value of k gives better accuracy but it is costly in terms of computational resources and computational time. Spaced k-mers method was used to extract the feature of the sequence using 111 1111 10001 where 1 was a match and 0 was the condition that could be a match or did not match. Currently, deep learning provides the best solutions to many problems in image recognition, speech recognition, and natural language processing. In this research, two different deep learning architectures, namely Deep Neural Network (DNN) and Convolutional Neural Network (CNN), trained to approach the taxonomic classification of metagenome data and spaced k-mers method for feature extraction. The result showed the DNN classifier reached 90.89 % and the CNN classifier reached 88.89 % accuracy at the genus level taxonomy.
Collapse
|
42
|
Vu D, Groenewald M, Verkley G. Convolutional neural networks improve fungal classification. Sci Rep 2020; 10:12628. [PMID: 32724224 PMCID: PMC7387343 DOI: 10.1038/s41598-020-69245-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2020] [Accepted: 07/06/2020] [Indexed: 01/30/2023] Open
Abstract
Sequence classification plays an important role in metagenomics studies. We assess the deep neural network approach for fungal sequence classification as it has emerged as a successful paradigm for big data classification and clustering. Two deep learning-based classifiers, a convolutional neural network (CNN) and a deep belief network (DBN) were trained using our recently released barcode datasets. Experimental results show that CNN outperformed the traditional BLAST classification and the most accurate machine learning based Ribosomal Database Project (RDP) classifier on datasets that had many of the labels present in the training datasets. When classifying an independent dataset namely the "Top 50 Most Wanted Fungi", CNN and DBN assigned less sequences than BLAST. However, they could assign much more sequences than the RDP classifier. In terms of efficiency, it took the machine learning classifiers up to two seconds to classify a test dataset while it was 53 s for BLAST. The result of the current study will enable us to speed up the taxonomic assignments for the fungal barcode sequences generated at our institute as ~ 70% of them still need to be validated for public release. In addition, it will help to quickly provide a taxonomic profile for metagenomics samples.
Collapse
Affiliation(s)
- Duong Vu
- Westerdijk Fungal Biodiversity Institute, Uppsalalaan 8, 3584CT, Utrecht, The Netherlands.
| | - Marizeth Groenewald
- Westerdijk Fungal Biodiversity Institute, Uppsalalaan 8, 3584CT, Utrecht, The Netherlands
| | - Gerard Verkley
- Westerdijk Fungal Biodiversity Institute, Uppsalalaan 8, 3584CT, Utrecht, The Netherlands
| |
Collapse
|
43
|
Shang J, Sun Y. CHEER: HierarCHical taxonomic classification for viral mEtagEnomic data via deep leaRning. Methods 2020; 189:95-103. [PMID: 32454212 PMCID: PMC7255349 DOI: 10.1016/j.ymeth.2020.05.018] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Revised: 05/05/2020] [Accepted: 05/17/2020] [Indexed: 02/07/2023] Open
Abstract
The fast accumulation of viral metagenomic data has contributed significantly to new RNA virus discovery. However, the short read size, complex composition, and large data size can all make taxonomic analysis difficult. In particular, commonly used alignment-based methods are not ideal choices for detecting new viral species. In this work, we present a novel hierarchical classification model named CHEER, which can conduct read-level taxonomic classification from order to genus for new species. By combining k-mer embedding-based encoding, hierarchically organized CNNs, and carefully trained rejection layer, CHEER is able to assign correct taxonomic labels for reads from new species. We tested CHEER on both simulated and real sequencing data. The results show that CHEER can achieve higher accuracy than popular alignment-based and alignment-free taxonomic assignment tools. The source code, scripts, and pre-trained parameters for CHEER are available via GitHub:https://github.com/KennthShang/CHEER.
Collapse
Affiliation(s)
- Jiayu Shang
- Electrical Engineering Dept., City University of Hong Kong, Kowloon, Hong Kong Special Administrative Region
| | - Yanni Sun
- Electrical Engineering Dept., City University of Hong Kong, Kowloon, Hong Kong Special Administrative Region.
| |
Collapse
|
44
|
Yan H, Bombarely A, Li S. DeepTE: a computational method for de novo classification of transposons with convolutional neural network. Bioinformatics 2020; 36:4269-4275. [DOI: 10.1093/bioinformatics/btaa519] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2020] [Revised: 04/12/2020] [Accepted: 05/12/2020] [Indexed: 01/23/2023] Open
Abstract
Abstract
Motivation
Transposable elements (TEs) classification is an essential step to decode their roles in genome evolution. With a large number of genomes from non-model species becoming available, accurate and efficient TE classification has emerged as a new challenge in genomic sequence analysis.
Results
We developed a novel tool, DeepTE, which classifies unknown TEs using convolutional neural networks (CNNs). DeepTE transferred sequences into input vectors based on k-mer counts. A tree structured classification process was used where eight models were trained to classify TEs into super families and orders. DeepTE also detected domains inside TEs to correct false classification. An additional model was trained to distinguish between non-TEs and TEs in plants. Given unclassified TEs of different species, DeepTE can classify TEs into seven orders, which include 15, 24 and 16 super families in plants, metazoans and fungi, respectively. In several benchmarking tests, DeepTE outperformed other existing tools for TE classification. In conclusion, DeepTE successfully leverages CNN for TE classification, and can be used to precisely classify TEs in newly sequenced eukaryotic genomes.
Availability and implementation
DeepTE is accessible at https://github.com/LiLabAtVT/DeepTE.
Supplementary information
Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Haidong Yan
- School of Plant and Environmental Sciences (SPES), Virginia Tech, Blacksburg, VA 24061, USA
| | - Aureliano Bombarely
- School of Plant and Environmental Sciences (SPES), Virginia Tech, Blacksburg, VA 24061, USA
- Department of Life Sciences, University of Milan, Milan 20122, Italy
| | - Song Li
- School of Plant and Environmental Sciences (SPES), Virginia Tech, Blacksburg, VA 24061, USA
- Graduate Program in Genetics, Bioinformatics and Computational Biology (GBCB), Virginia Tech, Blacksburg, VA 24061, USA
| |
Collapse
|
45
|
Sperlea T, Muth L, Martin R, Weigel C, Waldminghaus T, Heider D. gammaBOriS: Identification and Taxonomic Classification of Origins of Replication in Gammaproteobacteria using Motif-based Machine Learning. Sci Rep 2020; 10:6727. [PMID: 32317695 PMCID: PMC7174414 DOI: 10.1038/s41598-020-63424-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2019] [Accepted: 03/31/2020] [Indexed: 01/23/2023] Open
Abstract
The biology of bacterial cells is, in general, based on information encoded on circular chromosomes. Regulation of chromosome replication is an essential process that mostly takes place at the origin of replication (oriC), a locus unique per chromosome. Identification of high numbers of oriC is a prerequisite for systematic studies that could lead to insights into oriC functioning as well as the identification of novel drug targets for antibiotic development. Current methods for identifying oriC sequences rely on chromosome-wide nucleotide disparities and are therefore limited to fully sequenced genomes, leaving a large number of genomic fragments unstudied. Here, we present gammaBOriS (Gammaproteobacterial oriC Searcher), which identifies oriC sequences on gammaproteobacterial chromosomal fragments. It does so by employing motif-based machine learning methods. Using gammaBOriS, we created BOriS DB, which currently contains 25,827 gammaproteobacterial oriC sequences from 1,217 species, thus making it the largest available database for oriC sequences to date. Furthermore, we present gammaBOriTax, a machine-learning based approach for taxonomic classification of oriC sequences, which was trained on the sequences in BOriS DB. Finally, we extracted the motifs relevant for identification and classification decisions of the models. Our results suggest that machine learning sequence classification approaches can offer great support in functional motif identification.
Collapse
Affiliation(s)
- Theodor Sperlea
- Faculty of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032, Marburg, Lahn, Germany
| | - Lea Muth
- Faculty of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032, Marburg, Lahn, Germany
| | - Roman Martin
- Faculty of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032, Marburg, Lahn, Germany
| | - Christoph Weigel
- Institute of Biotechnology, Faculty III, Technische Universität Berlin (TUB), Straße des 17. Juni 135, D-10623, Berlin, Germany
| | - Torsten Waldminghaus
- Chromosome Biology Group, LOEWE Center for Synthetic Microbiology (SYNMIKRO), Philipps-Universität Marburg, D-35043, Marburg, Lahn, Germany
| | - Dominik Heider
- Faculty of Mathematics and Computer Science, University of Marburg, Hans-Meerwein-Str. 6, D-35032, Marburg, Lahn, Germany.
| |
Collapse
|
46
|
Desai HP, Parameshwaran AP, Sunderraman R, Weeks M. Comparative Study Using Neural Networks for 16S Ribosomal Gene Classification. J Comput Biol 2020. [DOI: 10.1089/cmb.2019.0436] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Heta P. Desai
- Department of Computer Science, Georgia State University, Atlanta, Georgia
| | | | | | - Michael Weeks
- Department of Computer Science, Georgia State University, Atlanta, Georgia
| |
Collapse
|
47
|
Kumar H, Park W, Srikanth K, Choi BH, Cho ES, Lee KT, Kim JM, Kim K, Park J, Lim D, Park JE. Comparison of Bacterial Populations in the Ceca of Swine at Two Different Stages and their Functional Annotations. Genes (Basel) 2019; 10:E382. [PMID: 31137556 PMCID: PMC6562920 DOI: 10.3390/genes10050382] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2019] [Revised: 05/16/2019] [Accepted: 05/16/2019] [Indexed: 12/18/2022] Open
Abstract
The microbial composition in the cecum of pig influences host health, immunity, nutrient digestion, and feeding requirements significantly. Advancements in metagenome sequencing technologies such as 16S rRNAs have made it possible to explore cecum microbial population. In this study, we performed a comparative analysis of cecum microbiota of crossbred Korean native pigs at two different growth stages (stage L = 10 weeks, and stage LD = 26 weeks) using 16S rRNA sequencing technology. Our results revealed remarkable differences in microbial composition, α and β diversity, and differential abundance between the two stages. Phylum composition analysis with respect to SILVA132 database showed Firmicutes to be present at 51.87% and 48.76% in stages L and LD, respectively. Similarly, Bacteroidetes were present at 37.28% and 45.98% in L and LD, respectively. The genera Prevotella, Anaerovibrio, Succinivibrio, Megasphaera were differentially enriched in stage L, whereas Clostridium, Terrisporobacter, Rikenellaceae were enriched in stage LD. Functional annotation of microbiome by level-three KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway analysis revealed that glycine, serine, threonine, valine, leucine, isoleucine arginine, proline, and tryptophan metabolism were differentially enriched in stage L, whereas alanine, aspartate, glutamate, cysteine, methionine, phenylalanine, tyrosine, and tryptophan biosynthesis metabolism were differentially enriched in stage LD. Through machine-learning approaches such as LEfSe (linear discriminant analysis effect size), random forest, and Pearson's correlation, we found pathways such as amino acid metabolism, transport systems, and genetic regulation of metabolism are commonly enriched in both stages. Our findings suggest that the bacterial compositions in cecum content of pigs are heavily involved in their nutrient digestion process. This study may help to meet the demand of human food and can play significant roles in medicinal application.
Collapse
Affiliation(s)
- Himansu Kumar
- Division of Animal Genomics and Bioinformatics, National Institute of Animal Science, RDA, Wanju 55365, Korea.
| | - Woncheol Park
- Division of Animal Genomics and Bioinformatics, National Institute of Animal Science, RDA, Wanju 55365, Korea.
| | - Krishnamoorthy Srikanth
- Division of Animal Genomics and Bioinformatics, National Institute of Animal Science, RDA, Wanju 55365, Korea.
| | - Bong-Hwan Choi
- Division of Animal Genomics and Bioinformatics, National Institute of Animal Science, RDA, Wanju 55365, Korea.
| | - Eun-Seok Cho
- Swine Science Division, National Institute of Animal Science, RDA, Cheonan 31000, Korea.
| | - Kyung-Tai Lee
- Animal Genetics and Breeding Division, National Institute of Animal Science, RDA, Cheonan 31000, Korea.
| | - Jun-Mo Kim
- Department of Animal Science and Technology, Chung-Ang University, Anseong 17546, Korea.
| | | | | | - Dajeong Lim
- Division of Animal Genomics and Bioinformatics, National Institute of Animal Science, RDA, Wanju 55365, Korea.
| | - Jong-Eun Park
- Division of Animal Genomics and Bioinformatics, National Institute of Animal Science, RDA, Wanju 55365, Korea.
| |
Collapse
|
48
|
Qu K, Guo F, Liu X, Lin Y, Zou Q. Application of Machine Learning in Microbiology. Front Microbiol 2019; 10:827. [PMID: 31057526 PMCID: PMC6482238 DOI: 10.3389/fmicb.2019.00827] [Citation(s) in RCA: 89] [Impact Index Per Article: 17.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2019] [Accepted: 04/01/2019] [Indexed: 02/01/2023] Open
Abstract
Microorganisms are ubiquitous and closely related to people's daily lives. Since they were first discovered in the 19th century, researchers have shown great interest in microorganisms. People studied microorganisms through cultivation, but this method is expensive and time consuming. However, the cultivation method cannot keep a pace with the development of high-throughput sequencing technology. To deal with this problem, machine learning (ML) methods have been widely applied to the field of microbiology. Literature reviews have shown that ML can be used in many aspects of microbiology research, especially classification problems, and for exploring the interaction between microorganisms and the surrounding environment. In this study, we summarize the application of ML in microbiology.
Collapse
Affiliation(s)
- Kaiyang Qu
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Fei Guo
- College of Intelligence and Computing, Tianjin University, Tianjin, China
| | - Xiangrong Liu
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Yuan Lin
- School of Information Science and Technology, Xiamen University, Xiamen, China
- Department of System Integration, Sparebanken Vest, Bergen, Norway
| | - Quan Zou
- Institute of Fundamental and Frontier Sciences, University of Electronic Science and Technology of China, Chengdu, China
- Center for Informational Biology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
49
|
Di Gangi M, Lo Bosco G, Rizzo R. Deep learning architectures for prediction of nucleosome positioning from sequences data. BMC Bioinformatics 2018; 19:418. [PMID: 30453896 PMCID: PMC6245688 DOI: 10.1186/s12859-018-2386-9] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Background Nucleosomes are DNA-histone complex, each wrapping about 150 pairs of double-stranded DNA. Their function is fundamental for one of the primary functions of Chromatin i.e. packing the DNA into the nucleus of the Eukaryote cells. Several biological studies have shown that the nucleosome positioning influences the regulation of cell type-specific gene activities. Moreover, computational studies have shown evidence of sequence specificity concerning the DNA fragment wrapped into nucleosomes, clearly underlined by the organization of particular DNA substrings. As the main consequence, the identification of nucleosomes on a genomic scale has been successfully performed by computational methods using a sequence features representation. Results In this work, we propose a deep learning model for nucleosome identification. Our model stacks convolutional layers and Long Short-term Memories to automatically extract features from short- and long-range dependencies in a sequence. Using this model we are able to avoid the feature extraction and selection steps while improving the classification performances. Conclusions Results computed on eleven data sets of five different organisms, from Yeast to Human, show the superiority of the proposed method with respect to the state of the art recently presented in the literature.
Collapse
Affiliation(s)
- Mattia Di Gangi
- Fondazione Bruno Kessler, Via Sommarive, 18, Trento, 38123, Italy.,ICT International Doctoral School, Via Sommarive, 9, Trento, 38123, Italy
| | - Giosuè Lo Bosco
- Dipartimento di Matematica e Informatica, Università degli studi di Palermo, Via Archirafi, 34, Palermo, 90123, Italy. .,Dipartimento di Scienze per l'Innovazione tecnologica, Istituto Euro-Mediterraneo di Scienza e Tecnologia, Via Michele Miraglia, 20, Palermo, 90139, Italy.
| | - Riccardo Rizzo
- CNR-ICAR, National Research Council of Italy, Via Ugo La Malfa, 153, Palermo, 90146, Italy
| |
Collapse
|