1
|
Peres da Silva R, Suphavilai C, Nagarajan N. MetageNN: a memory-efficient neural network taxonomic classifier robust to sequencing errors and missing genomes. BMC Bioinformatics 2024; 25:153. [PMID: 38627615 PMCID: PMC11022314 DOI: 10.1186/s12859-024-05760-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2022] [Accepted: 03/22/2024] [Indexed: 04/19/2024] Open
Abstract
BACKGROUND With the rapid increase in throughput of long-read sequencing technologies, recent studies have explored their potential for taxonomic classification by using alignment-based approaches to reduce the impact of higher sequencing error rates. While alignment-based methods are generally slower, k-mer-based taxonomic classifiers can overcome this limitation, potentially at the expense of lower sensitivity for strains and species that are not in the database. RESULTS We present MetageNN, a memory-efficient long-read taxonomic classifier that is robust to sequencing errors and missing genomes. MetageNN is a neural network model that uses short k-mer profiles of sequences to reduce the impact of distribution shifts on error-prone long reads. Benchmarking MetageNN against other machine learning approaches for taxonomic classification (GeNet) showed substantial improvements with long-read data (20% improvement in F1 score). By utilizing nanopore sequencing data, MetageNN exhibits improved sensitivity in situations where the reference database is incomplete. It surpasses the alignment-based MetaMaps and MEGAN-LR, as well as the k-mer-based Kraken2 tools, with improvements of 100%, 36%, and 23% respectively at the read-level analysis. Notably, at the community level, MetageNN consistently demonstrated higher sensitivities than the previously mentioned tools. Furthermore, MetageNN requires < 1/4th of the database storage used by Kraken2, MEGAN-LR and MMseqs2 and is > 7× faster than MetaMaps and GeNet and > 2× faster than MEGAN-LR and MMseqs2. CONCLUSION This proof of concept work demonstrates the utility of machine-learning-based methods for taxonomic classification using long reads. MetageNN can be used on sequences not classified by conventional methods and offers an alternative approach for memory-efficient classifiers that can be optimized further.
Collapse
Affiliation(s)
- Rafael Peres da Silva
- School of Computing, National University of Singapore, Singapore, 117417, Republic of Singapore.
- Agency for Science, Technology and Research (A*STAR), Genome Institute of Singapore (GIS), Singapore, 138672, Republic of Singapore.
| | - Chayaporn Suphavilai
- Agency for Science, Technology and Research (A*STAR), Genome Institute of Singapore (GIS), Singapore, 138672, Republic of Singapore
| | - Niranjan Nagarajan
- School of Computing, National University of Singapore, Singapore, 117417, Republic of Singapore.
- Agency for Science, Technology and Research (A*STAR), Genome Institute of Singapore (GIS), Singapore, 138672, Republic of Singapore.
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 119228, Republic of Singapore.
| |
Collapse
|
2
|
Rana S, Singh P, Bhardwaj T, Somvanshi P. A Comprehensive Metagenome Study Identifies Distinct Biological Pathways in Asthma Patients: An In-Silico Approach. Biochem Genet 2024:10.1007/s10528-023-10635-y. [PMID: 38285123 DOI: 10.1007/s10528-023-10635-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Accepted: 12/12/2023] [Indexed: 01/30/2024]
Abstract
Asthma is a multifactorial disease with phenotypes and several clinical and pathophysiological characteristics. Besides innate and adaptive immune responses, the gut microbiome generates Treg cells, mediating the allergic response to environmental factors and exposure to allergens. Because of the complexity of asthma, microbiome analysis and other precision medicine methods are now widely regarded as essential elements of efficient disease therapy. An in-silico pipeline enables the comparative taxonomic profiling of 16S rRNA metagenomic profiles of 20 asthmatic patients and 15 healthy controls utilizing QIIME2. Further, PICRUSt supports downstream gene enrichment and pathway analysis, inferring the enriched pathways in a diseased state. A significant abundance of the phylum Proteobacteria, Sutterella, and Megamonas is identified in asthma patients and a diminished genus Akkermansia. Nasal samples reveal a high relative abundance of Mycoplasma in the nasal samples. Further, differential functional profiling identifies the metabolic pathways related to cofactors and amino acids, secondary metabolism, and signaling pathways. These findings support that a combination of bacterial communities is involved in mediating the responses involved in chronic respiratory conditions like asthma by exerting their influence on various metabolic pathways.
Collapse
Affiliation(s)
- Samiksha Rana
- School of Computational & Integrative Sciences (SC&IS), Jawaharlal Nehru University, JNU Campus, New Delhi, 110067, India
| | - Pooja Singh
- School of Computational & Integrative Sciences (SC&IS), Jawaharlal Nehru University, JNU Campus, New Delhi, 110067, India
| | - Tulika Bhardwaj
- Department of Agricultural, Food and Nutritional Sciences, University of Alberta, Edmonton, AB, T6G 2P5, Canada
| | - Pallavi Somvanshi
- School of Computational & Integrative Sciences (SC&IS), Jawaharlal Nehru University, JNU Campus, New Delhi, 110067, India.
- Special Centre of Systems Medicine (SCSM), Jawaharlal Nehru University, JNU Campus, New Delhi, 110067, India.
| |
Collapse
|
3
|
Pérez-Rodríguez R, Domínguez-Domínguez O, Pedraza-Lara C, Rosas-Valdez R, Pérez-Ponce de León G, García-Andrade AB, Doadrio I. Multi-locus phylogeny of the catfish genus Ictalurus Rafinesque, 1820 (Actinopterygii, Siluriformes) and its systematic and evolutionary implications. BMC Ecol Evol 2023; 23:27. [PMID: 37370016 PMCID: PMC10304232 DOI: 10.1186/s12862-023-02134-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2022] [Accepted: 06/12/2023] [Indexed: 06/29/2023] Open
Abstract
BACKGROUND Ictalurus is one of the most representative groups of North American freshwater fishes. Although this group has a well-studied fossil record and has been the subject of several morphological and molecular phylogenetic studies, incomplete taxonomic sampling and insufficient taxonomic studies have produced a rather complex classification, along with intricate patterns of evolutionary history in the genus that are considered unresolved and remain under debate. RESULTS Based on four loci and the most comprehensive taxonomic sampling analyzed to date, including currently recognized species, previously synonymized species, undescribed taxa, and poorly studied populations, this study produced a resolved phylogenetic framework that provided plausible species delimitation and an evolutionary time framework for the genus Ictalurus. CONCLUSIONS Our phylogenetic hypothesis revealed that Ictalurus comprises at least 13 evolutionary units, partially corroborating the current classification and identifying populations that emerge as putative undescribed taxa. The divergence times of the species indicate that the diversification of Ictalurus dates to the early Oligocene, confirming its status as one of the oldest genera within the family Ictaluridae.
Collapse
Affiliation(s)
- Rodolfo Pérez-Rodríguez
- Laboratorio de Biología Acuática, Facultad de Biología, Universidad Michoacana de San Nicolás de Hidalgo, Ciudad Universitaria, Morelia, 58000, Michoacán, México
| | - Omar Domínguez-Domínguez
- Laboratorio de Biología Acuática, Facultad de Biología, Universidad Michoacana de San Nicolás de Hidalgo, Ciudad Universitaria, Morelia, 58000, Michoacán, México
| | - Carlos Pedraza-Lara
- Forensic Science, Medicine School, National Autonomous University of Mexico, Circuito de la investigación científica s/n, Ciudad Universitaria, Coyoacan, 04510, CdMx, Mexico
| | - Rogelio Rosas-Valdez
- Laboratorio de Colecciones Biológicas y Sistemática Molecular, Unidad Académica de Ciencias Biológicas, Universidad Autónoma de Zacatecas, Av. Preparatoria S/N, Campus Universitario II, Col. Agronómica, Zacatecas, C. P. 98066, México
| | - Gerardo Pérez-Ponce de León
- Instituto de Biología, UNAM, Circuito exterior s/n, Ciudad Universitaria, Coyoacán, C.P. 04510, D.F, México
- Escuela Nacional de Estudios Superiores Unidad Mérida, Universidad Nacional Autónoma de México, Km 4.5 Carretera Mérida-Tetiz, Ucú, Yucatán, México
| | - Ana Berenice García-Andrade
- Laboratorio de Biología Acuática, Facultad de Biología, Universidad Michoacana de San Nicolás de Hidalgo, Ciudad Universitaria, Morelia, 58000, Michoacán, México
- Laboratorio de Macroecología Evolutiva, Red de Biología Evolutiva, Instituto de Ecología, A.C. Carretera antigua a Coatepec 351, El Haya, Xalapa, 91070, Veracruz, México
| | - Ignacio Doadrio
- Departamento de Biodiversidad y Biología Evolutiva, Museo Nacional de Ciencias Naturales, CSIC, c/José Gutiérrez Abascal 2, Madrid, E-28006, España.
| |
Collapse
|
4
|
Burks DJ, Pusadkar V, Azad RK. POSMM: an efficient alignment-free metagenomic profiler that complements alignment-based profiling. Environ Microbiome 2023; 18:16. [PMID: 36890583 PMCID: PMC9993663 DOI: 10.1186/s40793-023-00476-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 02/25/2023] [Indexed: 06/18/2023]
Abstract
We present here POSMM (pronounced 'Possum'), Python-Optimized Standard Markov Model classifier, which is a new incarnation of the Markov model approach to metagenomic sequence analysis. Built on the top of a rapid Markov model based classification algorithm SMM, POSMM reintroduces high sensitivity associated with alignment-free taxonomic classifiers to probe whole genome or metagenome datasets of increasingly prohibitive sizes. Logistic regression models generated and optimized using the Python sklearn library, transform Markov model probabilities to scores suitable for thresholding. Featuring a dynamic database-free approach, models are generated directly from genome fasta files per run, making POSMM a valuable accompaniment to many other programs. By combining POSMM with ultrafast classifiers such as Kraken2, their complementary strengths can be leveraged to produce higher overall accuracy in metagenomic sequence classification than by either as a standalone classifier. POSMM is a user-friendly and highly adaptable tool designed for broad use by the metagenome scientific community.
Collapse
Affiliation(s)
- David J Burks
- Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, 76203, USA
| | - Vaidehi Pusadkar
- Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, 76203, USA
| | - Rajeev K Azad
- Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, 76203, USA.
- Department of Mathematics, University of North Texas, Denton, TX, 76203, USA.
| |
Collapse
|
5
|
Haftom Baraki Abraha, Jae-Won Lee, Gayeong Kim, Mokhammad Khoiron Ferdiansyah, Rathnayaka Mudiyanselage Ramesha, Kwang-Pyo Kim. Genomic diversity and comprehensive taxonomical classification of 61 Bacillus subtilis group member infecting bacteriophages, and the identification of ortholog taxonomic signature genes. BMC Genomics 2022; 23:835. [PMID: 36526963 DOI: 10.1186/s12864-022-09055-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 11/28/2022] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Despite the applications of Bacillus subtilis group species in various sectors, limited information is available regarding their phages. Here, 61 B. subtilis group species-infecting phages (BSPs) were studied for their taxonomic classification considering the genome-size, genomic diversity, and the host, followed by the identification of orthologs taxonomic signature genes. RESULTS BSPs have widely ranging genome sizes that can be bunched into groups to demonstrate correlations to family and subfamily classifications. Comparative analysis re-confirmed the existing, BSPs-containing 14 genera and 21 species and displayed inter-genera similarities within existing subfamilies. Importantly, it also revealed the need for the creation of new taxonomic classifications, including 28 species, nine genera, and two subfamilies (New subfamily1 and New subfamily2) to accommodate inter-genera relatedness. Following pangenome analysis, no ortholog shared by all BSPs was identified, while orthologs, namely, the tail fibers/spike proteins and poly-gamma-glutamate hydrolase, that are shared by more than two-thirds of the BSPs were identified. More importantly, major capsid protein (MCP) type I, MCP type II, MCP type III and peptidoglycan binding proteins that are distinctive orthologs for Herelleviridae, Salasmaviridae, New subfamily1, and New subfamily2, respectively, were identified and analyzed which could serve as signatures to distinguish BSP members of the respective taxon. CONCLUSIONS In this study, we show the genomic diversity and propose a comprehensive classification of 61 BSPs, including the proposition for the creation of two new subfamilies, followed by the identification of orthologs taxonomic signature genes, potentially contributing to phage taxonomy.
Collapse
|
6
|
Furstenau TN, Schneider T, Shaffer I, Vazquez AJ, Sahl J, Fofanov V. MTSv: rapid alignment-based taxonomic classification and high-confidence metagenomic analysis. PeerJ 2022; 10:e14292. [PMID: 36389404 PMCID: PMC9651046 DOI: 10.7717/peerj.14292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Accepted: 10/03/2022] [Indexed: 11/11/2022] Open
Abstract
As the size of reference sequence databases and high-throughput sequencing datasets continue to grow, it is becoming computationally infeasible to use traditional alignment to large genome databases for taxonomic classification of metagenomic reads. Exact matching approaches can rapidly assign taxonomy and summarize the composition of microbial communities, but they sacrifice accuracy and can lead to false positives. Full alignment tools provide higher confidence assignments and can assign sequences from genomes that diverge from reference sequences; however, full alignment tools are computationally intensive. To address this, we designed MTSv specifically for alignment-based taxonomic assignment in metagenomic analysis. This tool implements an FM-index assisted q-gram filter and SIMD accelerated Smith-Waterman algorithm to find alignments. However, unlike traditional aligners, MTSv will not attempt to make additional alignments to a TaxID once an alignment of sufficient quality has been found. This improves efficiency when many reference sequences are available per taxon. MTSv was designed to be flexible and can be modified to run on either memory or processor constrained systems. Although MTSv cannot compete with the speeds of exact k-mer matching approaches, it is reasonably fast and has higher precision than popular exact matching approaches. Because MTSv performs a full alignment it can classify reads even when the genomes share low similarity with reference sequences and provides a tool for high confidence pathogen detection with low off-target assignments to near neighbor species.
Collapse
Affiliation(s)
- Tara N. Furstenau
- School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, Arizona, United States
| | - Tsosie Schneider
- School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, Arizona, United States
| | - Isaac Shaffer
- School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, Arizona, United States
| | - Adam J. Vazquez
- Pathogen and Microbiome Institute, Northern Arizona University, Flagstaff, Arizona, United States
| | - Jason Sahl
- Pathogen and Microbiome Institute, Northern Arizona University, Flagstaff, Arizona, United States
| | - Viacheslav Fofanov
- School of Informatics, Computing, and Cyber Systems, Northern Arizona University, Flagstaff, Arizona, United States,Pathogen and Microbiome Institute, Northern Arizona University, Flagstaff, Arizona, United States
| |
Collapse
|
7
|
Yang C, Chowdhury D, Zhang Z, Cheung WK, Lu A, Bian Z, Zhang L. A review of computational tools for generating metagenome-assembled genomes from metagenomic sequencing data. Comput Struct Biotechnol J 2021; 19:6301-6314. [PMID: 34900140 PMCID: PMC8640167 DOI: 10.1016/j.csbj.2021.11.028] [Citation(s) in RCA: 60] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Revised: 11/17/2021] [Accepted: 11/17/2021] [Indexed: 12/16/2022] Open
Abstract
Metagenomic sequencing provides a culture-independent avenue to investigate the complex microbial communities by constructing metagenome-assembled genomes (MAGs). A MAG represents a microbial genome by a group of sequences from genome assembly with similar characteristics. It enables us to identify novel species and understand their potential functions in a dynamic ecosystem. Many computational tools have been developed to construct and annotate MAGs from metagenomic sequencing, however, there is a prominent gap to comprehensively introduce their background and practical performance. In this paper, we have thoroughly investigated the computational tools designed for both upstream and downstream analyses, including metagenome assembly, metagenome binning, gene prediction, functional annotation, taxonomic classification, and profiling. We have categorized the commonly used tools into unique groups based on their functional background and introduced the underlying core algorithms and associated information to demonstrate a comparative outlook. Furthermore, we have emphasized the computational requisition and offered guidance to the users to select the most efficient tools. Finally, we have indicated current limitations, potential solutions, and future perspectives for further improving the tools of MAG construction and annotation. We believe that our work provides a consolidated resource for the current stage of MAG studies and shed light on the future development of more effective MAG analysis tools on metagenomic sequencing.
Collapse
Key Words
- CNN, convolutional neural network
- DBG, De Bruijn graph
- GTDB, Genome Taxonomy Database
- Gene functional annotation
- Gene prediction
- Genome assembly
- HMM, Hidden Markov Model
- KEGG, Kyoto Encyclopedia of Genes and Genomes
- LCA, lowest common ancestor
- LPA, label propagation algorithm
- MAGs, metagenome-assembled genomes
- Metagenome binning
- Metagenome-assembled genomes
- Metagenomic sequencing
- Microbial abundance profiling
- OLC, overlap-layout consensus
- ONT, Oxford Nanopore Technologies
- ORFs, open reading frames
- PacBio, Pacific Biosciences
- QC, quality control
- SLR, synthetic long reads
- TNFs, tetranucleotide frequencies
- Taxonomic classification
Collapse
Affiliation(s)
- Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Debajyoti Chowdhury
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Institute of Integrated Bioinformedicine and Translational Sciences, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - William K. Cheung
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Aiping Lu
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Institute of Integrated Bioinformedicine and Translational Sciences, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Zhaoxiang Bian
- Institute of Brain and Gut Research, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Chinese Medicine Clinical Study Center, School of Chinese Medicine, Hong Kong Baptist University, Hong Kong Special Administrative Region
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong Special Administrative Region
- Computational Medicine Lab, Hong Kong Baptist University, Hong Kong Special Administrative Region
| |
Collapse
|
8
|
Abstract
Recovering and annotating bacterial genomes from metagenomes involves a series of complex computational tools that are often difficult to use for researches without a specialistic bioinformatic background. In this chapter we review all the steps that lead from raw reads to a collection of quality-controlled, functionally annotated bacterial genomes and propose a working protocol using state-of-the-art, open source software tools.
Collapse
Affiliation(s)
- Davide Albanese
- Unit of Computational Biology, Research and Innovation Centre, Fondazione Edmund Mach, San Michele all'Adige, Italy
| | - Claudio Donati
- Unit of Computational Biology, Research and Innovation Centre, Fondazione Edmund Mach, San Michele all'Adige, Italy.
| |
Collapse
|
9
|
Eun Kang J, Ciampi A, Hijri M. SeSaMe: Metagenome Sequence Classification of Arbuscular Mycorrhizal Fungi-associated Microorganisms. Genomics Proteomics Bioinformatics 2020; 18:601-12. [PMID: 33346086 DOI: 10.1016/j.gpb.2018.07.010] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 10/03/2017] [Revised: 06/21/2018] [Accepted: 07/24/2018] [Indexed: 01/22/2023]
Abstract
Arbuscular mycorrhizal fungi (AMF) are plant root symbionts that play key roles in plant growth and soil fertility. They are obligate biotrophic fungi that form coenocytic multinucleated hyphae and spores. Numerous studies have shown that diverse microorganisms live on the surface of and inside their mycelia, resulting in a metagenome when whole-genome sequencing (WGS) data are obtained from sequencing AMF cultivated in vivo. The metagenome contains not only the AMF sequences, but also those from associated microorganisms. In this study, we introduce a novel bioinformatics program, Spore-associated Symbiotic Microbes (SeSaMe), designed for taxonomic classification of short sequences obtained by next-generation DNA sequencing. A genus-specific usage bias database was created based on amino acid usage and codon usage of a three consecutive codon DNA 9-mer encoding an amino acid trimer in a protein secondary structure. The program distinguishes between coding sequence (CDS) and non-CDS, and classifies a query sequence into a genus group out of 54 genera used as reference. The mean percentages of correct predictions of the CDS and the non-CDS test sets at the genus level were 71% and 50% for bacteria, 68% and 73% for fungi (excluding AMF), and 49% and 72% for AMF (Rhizophagus irregularis), respectively. SeSaMe provides not only a means for estimating taxonomic diversity and abundance but also the gene reservoir of the reference taxonomic groups associated with AMF. Therefore, it enables users to study the symbiotic roles of associated microorganisms. It can also be applicable to other microorganisms as well as soil metagenomes. SeSaMe is freely available at www.fungalsesame.org.
Collapse
|
10
|
Bui VK, Wei C. CDKAM: a taxonomic classification tool using discriminative k-mers and approximate matching strategies. BMC Bioinformatics 2020; 21:468. [PMID: 33081690 PMCID: PMC7576720 DOI: 10.1186/s12859-020-03777-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2020] [Accepted: 09/23/2020] [Indexed: 12/25/2022] Open
Abstract
Background Current taxonomic classification tools use exact string matching algorithms that are effective to tackle the data from the next generation sequencing technology. However, the unique error patterns in the third generation sequencing (TGS) technologies could reduce the accuracy of these programs. Results We developed a Classification tool using Discriminative K-mers and Approximate Matching algorithm (CDKAM). This approximate matching method was used for searching k-mers, which included two phases, a quick mapping phase and a dynamic programming phase. Simulated datasets as well as real TGS datasets have been tested to compare the performance of CDKAM with existing methods. We showed that CDKAM performed better in many aspects, especially when classifying TGS data with average length 1000–1500 bases.
Conclusions CDKAM is an effective program with higher accuracy and lower memory requirement for TGS metagenome sequence classification. It produces a high species-level accuracy.
Collapse
Affiliation(s)
- Van-Kien Bui
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China
| | - Chaochun Wei
- Department of Bioinformatics and Biostatistics, School of Life Sciences and Biotechnology, Shanghai Jiao Tong University, Shanghai, 200240, China. .,Shanghai Center for Systems Biomedicine, Shanghai Jiao Tong University, Shanghai, 200240, China.
| |
Collapse
|
11
|
Shang J, Sun Y. CHEER: HierarCHical taxonomic classification for viral mEtagEnomic data via deep leaRning. Methods 2020; 189:95-103. [PMID: 32454212 PMCID: PMC7255349 DOI: 10.1016/j.ymeth.2020.05.018] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/03/2020] [Revised: 05/05/2020] [Accepted: 05/17/2020] [Indexed: 02/07/2023] Open
Abstract
The fast accumulation of viral metagenomic data has contributed significantly to new RNA virus discovery. However, the short read size, complex composition, and large data size can all make taxonomic analysis difficult. In particular, commonly used alignment-based methods are not ideal choices for detecting new viral species. In this work, we present a novel hierarchical classification model named CHEER, which can conduct read-level taxonomic classification from order to genus for new species. By combining k-mer embedding-based encoding, hierarchically organized CNNs, and carefully trained rejection layer, CHEER is able to assign correct taxonomic labels for reads from new species. We tested CHEER on both simulated and real sequencing data. The results show that CHEER can achieve higher accuracy than popular alignment-based and alignment-free taxonomic assignment tools. The source code, scripts, and pre-trained parameters for CHEER are available via GitHub:https://github.com/KennthShang/CHEER.
Collapse
Affiliation(s)
- Jiayu Shang
- Electrical Engineering Dept., City University of Hong Kong, Kowloon, Hong Kong Special Administrative Region
| | - Yanni Sun
- Electrical Engineering Dept., City University of Hong Kong, Kowloon, Hong Kong Special Administrative Region.
| |
Collapse
|
12
|
Jamnikar-Ciglenecki U, Civnik V, Kirbis A, Kuhar U. A molecular survey, whole genome sequencing and phylogenetic analysis of astroviruses from roe deer. BMC Vet Res 2020; 16:68. [PMID: 32085761 PMCID: PMC7035776 DOI: 10.1186/s12917-020-02289-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2019] [Accepted: 02/17/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Although astroviruses (AstV) have been detected in a variety of host species, there are only limited records of their occurrence in deer. One of the most important game species in Europe, due to its meat and antlers, is roe deer. Infected game animals can pose a threat to the health of other animals and of humans, so more attention needs to be focused on understanding the diversity of viruses in wildlife. The complete genome and organization of the roe deer AstV genome have not so far been described. RESULTS In our study, 111 game animals were screened for the presence of AstV. While no AstVs were detected in red deer, wild boar, chamois and mouflon, AstV RNA was present in three samples of roe deer. They were further subjected to whole genome sequencing with next generation sequencing. In this study, two AstV genomes were assembled; one in sample D5-14 and one in sample D12-14, while, in sample D45-14, no AstV sequences were identified. The complete coding sequences of the AstV SLO/D5-14 strain genome and of the almost complete genome of the AstV SLO/D12-14 strain were determined. They showed a typical Mamastrovirus organization. Phylogenetic analyses and amino acid pairwise distance analysis revealed that Slovenian roe deer AstV strains are closely related to each other and, also, related to other deer, bovine, water buffalo, yak, Sichuan takin, dromedary, porcine and porcupine AstV strains - thus forming a highly supported group of currently unassigned sequences. CONCLUSIONS Our findings suggest the existence of a new Mamastrovirus genogroup might be constituted while this aforementioned group is distantly related to Mamastrovirus genogroups I and II. In this study, additional data supporting a novel taxonomic classification are presented.
Collapse
Affiliation(s)
- Urska Jamnikar-Ciglenecki
- Institute of Food safety, Feed and Environment, University of Ljubljana, Veterinary faculty, Gerbičeva 60, 1000, Ljubljana, Slovenia.
| | - Vita Civnik
- Institute of Food safety, Feed and Environment, University of Ljubljana, Veterinary faculty, Gerbičeva 60, 1000, Ljubljana, Slovenia
| | - Andrej Kirbis
- Institute of Food safety, Feed and Environment, University of Ljubljana, Veterinary faculty, Gerbičeva 60, 1000, Ljubljana, Slovenia
| | - Urska Kuhar
- Institute of Microbiology and Parasitology, University of Ljubljana, Veterinary faculty, Gerbičeva 60, 1000, Ljubljana, Slovenia
| |
Collapse
|
13
|
Abstract
Gut microbial composition has shown to be associated with obesity, diabetes mellitus, inflammatory bowel disease, colitis, autoimmune disorders, and cancer, among other diseases. Microbiome research has significantly evolved through the years and continues to advance as we develop new and better strategies to more accurately measure its composition and function. Careful selection of study design, inclusion and exclusion criteria of participants, and methodology are paramount to accurately analyze microbial structure. Here we present the most up-to-date available information on methods for gut microbial collection and analysis.
Collapse
Affiliation(s)
- Elisa Morales
- Robbins College of Health and Human Sciences, Baylor University, Waco, TX, USA
| | - Jun Chen
- Division of Biomedical Statistics and Informatics, Department of Health Sciences Research, Mayo Clinic, Rochester, Minnesota, USA
| | - K Leigh Greathouse
- Robbins College of Health and Human Sciences, Baylor University, Waco, TX, USA.
| |
Collapse
|
14
|
Randhawa GS, Hill KA, Kari L. ML-DSP: Machine Learning with Digital Signal Processing for ultrafast, accurate, and scalable genome classification at all taxonomic levels. BMC Genomics 2019; 20:267. [PMID: 30943897 PMCID: PMC6448311 DOI: 10.1186/s12864-019-5571-y] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2018] [Accepted: 02/27/2019] [Indexed: 11/11/2022] Open
Abstract
Background Although software tools abound for the comparison, analysis, identification, and classification of genomic sequences, taxonomic classification remains challenging due to the magnitude of the datasets and the intrinsic problems associated with classification. The need exists for an approach and software tool that addresses the limitations of existing alignment-based methods, as well as the challenges of recently proposed alignment-free methods. Results We propose a novel combination of supervised Machine Learning with Digital Signal Processing, resulting in ML-DSP: an alignment-free software tool for ultrafast, accurate, and scalable genome classification at all taxonomic levels. We test ML-DSP by classifying 7396 full mitochondrial genomes at various taxonomic levels, from kingdom to genus, with an average classification accuracy of >97%. A quantitative comparison with state-of-the-art classification software tools is performed, on two small benchmark datasets and one large 4322 vertebrate mtDNA genomes dataset. Our results show that ML-DSP overwhelmingly outperforms the alignment-based software MEGA7 (alignment with MUSCLE or CLUSTALW) in terms of processing time, while having comparable classification accuracies for small datasets and superior accuracies for the large dataset. Compared with the alignment-free software FFP (Feature Frequency Profile), ML-DSP has significantly better classification accuracy, and is overall faster. We also provide preliminary experiments indicating the potential of ML-DSP to be used for other datasets, by classifying 4271 complete dengue virus genomes into subtypes with 100% accuracy, and 4,710 bacterial genomes into phyla with 95.5% accuracy. Lastly, our analysis shows that the “Purine/Pyrimidine”, “Just-A” and “Real” numerical representations of DNA sequences outperform ten other such numerical representations used in the Digital Signal Processing literature for DNA classification purposes. Conclusions Due to its superior classification accuracy, speed, and scalability to large datasets, ML-DSP is highly relevant in the classification of newly discovered organisms, in distinguishing genomic signatures and identifying their mechanistic determinants, and in evaluating genome integrity.
Collapse
Affiliation(s)
- Gurjit S Randhawa
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | - Kathleen A Hill
- Department of Biology, University of Western Ontario, London, ON, Canada
| | - Lila Kari
- School of Computer Science, University of Waterloo, Waterloo, ON, Canada
| |
Collapse
|
15
|
Abstract
The field of palaeomicrobiology-the study of ancient microorganisms-is rapidly growing due to recent methodological and technological advancements. It is now possible to obtain vast quantities of DNA data from ancient specimens in a high-throughput manner and use this information to investigate the dynamics and evolution of past microbial communities. However, we still know very little about how the characteristics of ancient DNA influence our ability to accurately assign microbial taxonomies (i.e. identify species) within ancient metagenomic samples. Here, we use both simulated and published metagenomic data sets to investigate how ancient DNA characteristics affect alignment-based taxonomic classification. We find that nucleotide-to-nucleotide, rather than nucleotide-to-protein, alignments are preferable when assigning taxonomies to short DNA fragment lengths routinely identified within ancient specimens (<60 bp). We determine that deamination (a form of ancient DNA damage) and random sequence substitutions corresponding to ∼100,000 years of genomic divergence minimally impact alignment-based classification. We also test four different reference databases and find that database choice can significantly bias the results of alignment-based taxonomic classification in ancient metagenomic studies. Finally, we perform a reanalysis of previously published ancient dental calculus data, increasing the number of microbial DNA sequences assigned taxonomically by an average of 64.2-fold and identifying microbial species previously unidentified in the original study. Overall, this study enhances our understanding of how ancient DNA characteristics influence alignment-based taxonomic classification of ancient microorganisms and provides recommendations for future palaeomicrobiological studies.
Collapse
Affiliation(s)
- Raphael Eisenhofer
- Australian Centre for Ancient DNA, University of Adelaide, Adelaide, SA, Australia.,Centre of Excellence for Australia Biodiversity and Heritage, University of Adelaide, Adelaide, SA, Australia
| | - Laura Susan Weyrich
- Australian Centre for Ancient DNA, University of Adelaide, Adelaide, SA, Australia.,Centre of Excellence for Australia Biodiversity and Heritage, University of Adelaide, Adelaide, SA, Australia
| |
Collapse
|
16
|
Abstract
In recent decades, the accumulation of data on 16s ribosomal RNA genes has yielded free and public databases such as SILVA, GreenGenes, The Ribosomal Database Project, and IMG, handling massive amounts of raw data and meta information. 16s rRNA gene contains hypervariable regions with great classification power. As a result, numerous classification tools have emerged including state-of-the-art tools such as Mothur, Qiime, and the 16s classifier. However, there is a gap between the sequence databases, the taxonomy profiling tools and available meta information such as geo/body-location information. Here, we present BioAtlas, and interactive web tool for searching, exploring, and analyzing prokaryotic distributions by integration of various resources of metagenomics databases. In the following section we show how to use BioAtlas to (1) search and explore prokaryote occurrences across the geospatial map of the world, (2) investigate and hunt for occurrences across generic user-generated surface-specific maps, with an example map of a human female, with data from Bouslimani et al., and (3) classify a user-given sequences dataset through our online platform for visual exploration of the spatial abundances of the identified microbes.
Collapse
|
17
|
Banos S, Lentendu G, Kopf A, Wubet T, Glöckner FO, Reich M. A comprehensive fungi-specific 18S rRNA gene sequence primer toolkit suited for diverse research issues and sequencing platforms. BMC Microbiol 2018; 18:190. [PMID: 30458701 PMCID: PMC6247509 DOI: 10.1186/s12866-018-1331-4] [Citation(s) in RCA: 43] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2018] [Accepted: 10/30/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Several fungi-specific primers target the 18S rRNA gene sequence, one of the prominent markers for fungal classification. The design of most primers goes back to the last decades. Since then, the number of sequences in public databases increased leading to the discovery of new fungal groups and changes in fungal taxonomy. However, no reevaluation of primers was carried out and relevant information on most primers is missing. With this study, we aimed to develop an 18S rRNA gene sequence primer toolkit allowing an easy selection of the best primer pair appropriate for different sequencing platforms, research aims (biodiversity assessment versus isolate classification) and target groups. RESULTS We performed an intensive literature research, reshuffled existing primers into new pairs, designed new Illumina-primers, and annealing blocking oligonucleotides. A final number of 439 primer pairs were subjected to in silico PCRs. Best primer pairs were selected and experimentally tested. The most promising primer pair with a small amplicon size, nu-SSU-1333-5'/nu-SSU-1647-3' (FF390/FR-1), was successful in describing fungal communities by Illumina sequencing. Results were confirmed by a simultaneous metagenomics and eukaryote-specific primer approach. Co-amplification occurred in all sample types but was effectively reduced by blocking oligonucleotides. CONCLUSIONS The compiled data revealed the presence of an enormous diversity of fungal 18S rRNA gene primer pairs in terms of fungal coverage, phylum spectrum and co-amplification. Therefore, the primer pair has to be carefully selected to fulfill the requirements of the individual research projects. The presented primer toolkit offers comprehensive lists of 164 primers, 439 primer combinations, 4 blocking oligonucleotides, and top primer pairs holding all relevant information including primer's characteristics and performance to facilitate primer pair selection.
Collapse
Affiliation(s)
- Stefanos Banos
- Molecular Ecology, Institute of Ecology, FB02, University of Bremen, Leobener Str. 2, 28359, Bremen, Germany
| | - Guillaume Lentendu
- Department of Soil Ecology, Helmholtz Centre for Environmental Research GmbH - UFZ, Halle-Saale, Germany.,Department of Ecology, University of Kaiserslautern, Kaiserslautern, Germany
| | - Anna Kopf
- Microbial Genomics and Bioinformatics Research Group, Max Planck Institute for Marine Microbiology, Bremen, Germany
| | - Tesfaye Wubet
- Department of Soil Ecology, Helmholtz Centre for Environmental Research GmbH - UFZ, Halle-Saale, Germany.,Present address: Department of Community Ecology, Helmholtz Centre for Environmental Research GmbH - UFZ, Halle-Saale, Germany.,German Centre for Integrative Biodiversity Research (iDiv) Halle-Jena-Leipzig, Leipzig, Germany
| | - Frank Oliver Glöckner
- Microbial Genomics and Bioinformatics Research Group, Max Planck Institute for Marine Microbiology, Bremen, Germany.,Department of Life Sciences and Chemistry, Jacobs University Bremen gGmbH, Bremen, Germany
| | - Marlis Reich
- Molecular Ecology, Institute of Ecology, FB02, University of Bremen, Leobener Str. 2, 28359, Bremen, Germany.
| |
Collapse
|
18
|
Nasko DJ, Koren S, Phillippy AM, Treangen TJ. RefSeq database growth influences the accuracy of k-mer-based lowest common ancestor species identification. Genome Biol 2018; 19:165. [PMID: 30373669 PMCID: PMC6206640 DOI: 10.1186/s13059-018-1554-6] [Citation(s) in RCA: 68] [Impact Index Per Article: 11.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2018] [Accepted: 10/01/2018] [Indexed: 12/05/2022] Open
Abstract
In order to determine the role of the database in taxonomic sequence classification, we examine the influence of the database over time on k-mer-based lowest common ancestor taxonomic classification. We present three major findings: the number of new species added to the NCBI RefSeq database greatly outpaces the number of new genera; as a result, more reads are classified with newer database versions, but fewer are classified at the species level; and Bayesian-based re-estimation mitigates this effect but struggles with novel genomes. These results suggest a need for new classification approaches specially adapted for large databases.
Collapse
Affiliation(s)
- Daniel J Nasko
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - Sergey Koren
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, Bethesda, MD, USA
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, USA.
| |
Collapse
|
19
|
Yu J, Blom J, Glaeser SP, Jaenicke S, Juhre T, Rupp O, Schwengers O, Spänig S, Goesmann A. A review of bioinformatics platforms for comparative genomics. Recent developments of the EDGAR 2.0 platform and its utility for taxonomic and phylogenetic studies. J Biotechnol 2017; 261:2-9. [PMID: 28705636 DOI: 10.1016/j.jbiotec.2017.07.010] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2017] [Revised: 07/06/2017] [Accepted: 07/07/2017] [Indexed: 12/12/2022]
Abstract
The rapid development of next generation sequencing technology has greatly increased the amount of available microbial genomes. As a result of this development, there is a rising demand for fast and automated approaches in analyzing these genomes in a comparative way. Whole genome sequencing also bears a huge potential for obtaining a higher resolution in phylogenetic and taxonomic classification. During the last decade, several software tools and platforms have been developed in the field of comparative genomics. In this manuscript, we review the most commonly used platforms and approaches for ortholog group analyses with a focus on their potential for phylogenetic and taxonomic research. Furthermore, we describe the latest improvements of the EDGAR platform for comparative genome analyses and present recent examples of its application for the phylogenomic analysis of different taxa. Finally, we illustrate the role of the EDGAR platform as part of the BiGi Center for Microbial Bioinformatics within the German network on Bioinformatics Infrastructure (de.NBI).
Collapse
Affiliation(s)
- J Yu
- Int. Research Training Group 1906 (DiDy), Bielefeld University, Bielefeld, 33501, Germany; Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, 35392, Germany
| | - J Blom
- Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, 35392, Germany.
| | - S P Glaeser
- Institute of Applied Microbiology, Justus-Liebig-University Giessen, Giessen, 35392, Germany
| | - S Jaenicke
- Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, 35392, Germany
| | - T Juhre
- Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, 35392, Germany
| | - O Rupp
- Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, 35392, Germany
| | - O Schwengers
- Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, 35392, Germany
| | - S Spänig
- Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, 35392, Germany
| | - A Goesmann
- Bioinformatics and Systems Biology, Justus-Liebig-University Giessen, Giessen, 35392, Germany
| |
Collapse
|
20
|
Gao X, Lin H, Revanna K, Dong Q. A Bayesian taxonomic classification method for 16S rRNA gene sequences with improved species-level accuracy. BMC Bioinformatics 2017; 18:247. [PMID: 28486927 PMCID: PMC5424349 DOI: 10.1186/s12859-017-1670-4] [Citation(s) in RCA: 93] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2017] [Accepted: 05/03/2017] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND Species-level classification for 16S rRNA gene sequences remains a serious challenge for microbiome researchers, because existing taxonomic classification tools for 16S rRNA gene sequences either do not provide species-level classification, or their classification results are unreliable. The unreliable results are due to the limitations in the existing methods which either lack solid probabilistic-based criteria to evaluate the confidence of their taxonomic assignments, or use nucleotide k-mer frequency as the proxy for sequence similarity measurement. RESULTS We have developed a method that shows significantly improved species-level classification results over existing methods. Our method calculates true sequence similarity between query sequences and database hits using pairwise sequence alignment. Taxonomic classifications are assigned from the species to the phylum levels based on the lowest common ancestors of multiple database hits for each query sequence, and further classification reliabilities are evaluated by bootstrap confidence scores. The novelty of our method is that the contribution of each database hit to the taxonomic assignment of the query sequence is weighted by a Bayesian posterior probability based upon the degree of sequence similarity of the database hit to the query sequence. Our method does not need any training datasets specific for different taxonomic groups. Instead only a reference database is required for aligning to the query sequences, making our method easily applicable for different regions of the 16S rRNA gene or other phylogenetic marker genes. CONCLUSIONS Reliable species-level classification for 16S rRNA or other phylogenetic marker genes is critical for microbiome research. Our software shows significantly higher classification accuracy than the existing tools and we provide probabilistic-based confidence scores to evaluate the reliability of our taxonomic classification assignments based on multiple database matches to query sequences. Despite its higher computational costs, our method is still suitable for analyzing large-scale microbiome datasets for practical purposes. Furthermore, our method can be applied for taxonomic classification of any phylogenetic marker gene sequences. Our software, called BLCA, is freely available at https://github.com/qunfengdong/BLCA .
Collapse
Affiliation(s)
- Xiang Gao
- Department of Public Health Sciences, Loyola University Chicago Health Sciences Division, Maywood, IL, 60153, USA
| | - Huaiying Lin
- Department of Public Health Sciences, Loyola University Chicago Health Sciences Division, Maywood, IL, 60153, USA.,Center for Biomedical Informatics, Loyola University Chicago Health Sciences Division, Maywood, IL, 60153, USA
| | - Kashi Revanna
- Department of Public Health Sciences, Loyola University Chicago Health Sciences Division, Maywood, IL, 60153, USA.,Center for Biomedical Informatics, Loyola University Chicago Health Sciences Division, Maywood, IL, 60153, USA
| | - Qunfeng Dong
- Department of Public Health Sciences, Loyola University Chicago Health Sciences Division, Maywood, IL, 60153, USA. .,Center for Biomedical Informatics, Loyola University Chicago Health Sciences Division, Maywood, IL, 60153, USA. .,Bioinformatics Program, Loyola University Chicago Lake Shore Campus, Chicago, IL, 60660, USA. .,Department of Computer Science, Loyola University Chicago Water Tower Campus, Chicago, IL, 60611, USA.
| |
Collapse
|
21
|
Abstract
Background A key step in microbiome sequencing analysis is read assignment to taxonomic units. This is often performed using one of four taxonomic classifications, namely SILVA, RDP, Greengenes or NCBI. It is unclear how similar these are and how to compare analysis results that are based on different taxonomies. Results We provide a method and software for mapping taxonomic entities from one taxonomy onto another. We use it to compare the four taxonomies and the Open Tree of life Taxonomy (OTT). Conclusions While we find that SILVA, RDP and Greengenes map well into NCBI, and all four map well into the OTT, mapping the two larger taxonomies on to the smaller ones is problematic. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-3501-4) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Monika Balvočiūtė
- University of Tübingen, Department of Computer Science, Sand 14, Tübingen, 72076, Germany.
| | - Daniel H Huson
- University of Tübingen, Department of Computer Science, Sand 14, Tübingen, 72076, Germany
| |
Collapse
|
22
|
Gregor I, Dröge J, Schirmer M, Quince C, McHardy AC. PhyloPythiaS+: a self-training method for the rapid reconstruction of low-ranking taxonomic bins from metagenomes. PeerJ 2016; 4:e1603. [PMID: 26870609 PMCID: PMC4748697 DOI: 10.7717/peerj.1603] [Citation(s) in RCA: 67] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2015] [Accepted: 12/24/2015] [Indexed: 12/21/2022] Open
Abstract
Background. Metagenomics is an approach for characterizing environmental microbial communities in situ, it allows their functional and taxonomic characterization and to recover sequences from uncultured taxa. This is often achieved by a combination of sequence assembly and binning, where sequences are grouped into ‘bins’ representing taxa of the underlying microbial community. Assignment to low-ranking taxonomic bins is an important challenge for binning methods as is scalability to Gb-sized datasets generated with deep sequencing techniques. One of the best available methods for species bins recovery from deep-branching phyla is the expert-trained PhyloPythiaS package, where a human expert decides on the taxa to incorporate in the model and identifies ‘training’ sequences based on marker genes directly from the sample. Due to the manual effort involved, this approach does not scale to multiple metagenome samples and requires substantial expertise, which researchers who are new to the area do not have. Results. We have developed PhyloPythiaS+, a successor to our PhyloPythia(S) software. The new (+) component performs the work previously done by the human expert. PhyloPythiaS+ also includes a new k-mer counting algorithm, which accelerated the simultaneous counting of 4–6-mers used for taxonomic binning 100-fold and reduced the overall execution time of the software by a factor of three. Our software allows to analyze Gb-sized metagenomes with inexpensive hardware, and to recover species or genera-level bins with low error rates in a fully automated fashion. PhyloPythiaS+ was compared to MEGAN, taxator-tk, Kraken and the generic PhyloPythiaS model. The results showed that PhyloPythiaS+ performs especially well for samples originating from novel environments in comparison to the other methods. Availability.PhyloPythiaS+ in a virtual machine is available for installation under Windows, Unix systems or OS X on: https://github.com/algbioi/ppsp/wiki.
Collapse
Affiliation(s)
- Ivan Gregor
- Max-Planck Research Group for Computational Genomics and Epidemiology, Max-Planck Institute for Informatics, Saarbrücken, Germany; Department of Algorithmic Bioinformatics, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany; Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
| | - Johannes Dröge
- Max-Planck Research Group for Computational Genomics and Epidemiology, Max-Planck Institute for Informatics, Saarbrücken, Germany; Department of Algorithmic Bioinformatics, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany; Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
| | - Melanie Schirmer
- The Broad Institute of MIT and Harvard , Cambridge, MA , United States
| | | | - Alice C McHardy
- Max-Planck Research Group for Computational Genomics and Epidemiology, Max-Planck Institute for Informatics, Saarbrücken, Germany; Department of Algorithmic Bioinformatics, Heinrich-Heine-University Düsseldorf, Düsseldorf, Germany; Computational Biology of Infection Research, Helmholtz Center for Infection Research, Braunschweig, Germany
| |
Collapse
|
23
|
Valenzuela-González F, Martínez-Porchas M, Villalpando-Canchola E, Vargas-Albores F. Studying long 16S rDNA sequences with ultrafast-metagenomic sequence classification using exact alignments (Kraken). J Microbiol Methods 2016; 122:38-42. [PMID: 26812576 DOI: 10.1016/j.mimet.2016.01.011] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2015] [Revised: 01/22/2016] [Accepted: 01/22/2016] [Indexed: 10/22/2022]
Abstract
Ultrafast-metagenomic sequence classification using exact alignments (Kraken) is a novel approach to classify 16S rDNA sequences. The classifier is based on mapping short sequences to the lowest ancestor and performing alignments to form subtrees with specific weights in each taxon node. This study aimed to evaluate the classification performance of Kraken with long 16S rDNA random environmental sequences produced by cloning and then Sanger sequenced. A total of 480 clones were isolated and expanded, and 264 of these clones formed contigs (1352 ± 153 bp). The same sequences were analyzed using the Ribosomal Database Project (RDP) classifier. Deeper classification performance was achieved by Kraken than by the RDP: 73% of the contigs were classified up to the species or variety levels, whereas 67% of these contigs were classified no further than the genus level by the RDP. The results also demonstrated that unassembled sequences analyzed by Kraken provide similar or inclusively deeper information. Moreover, sequences that did not form contigs, which are usually discarded by other programs, provided meaningful information when analyzed by Kraken. Finally, it appears that the assembly step for Sanger sequences can be eliminated when using Kraken. Kraken cumulates the information of both sequence senses, providing additional elements for the classification. In conclusion, the results demonstrate that Kraken is an excellent choice for use in the taxonomic assignment of sequences obtained by Sanger sequencing or based on third generation sequencing, of which the main goal is to generate larger sequences.
Collapse
Affiliation(s)
| | - Marcel Martínez-Porchas
- Centro de Investigación en Alimentación y Desarrollo, A. C. Km 0.6 Carretera a La Victoria, Hermosillo, Sonora, \Mexico
| | - Enrique Villalpando-Canchola
- Centro de Investigación en Alimentación y Desarrollo, A. C. Km 0.6 Carretera a La Victoria, Hermosillo, Sonora, \Mexico
| | - Francisco Vargas-Albores
- Centro de Investigación en Alimentación y Desarrollo, A. C. Km 0.6 Carretera a La Victoria, Hermosillo, Sonora, \Mexico.
| |
Collapse
|
24
|
Assunção A, Costa MC, Carlier JD. Application of urea-agarose gel electrophoresis to select non-redundant 16S rRNAs for taxonomic studies: palladium(II) removal bacteria. Appl Microbiol Biotechnol 2015; 100:2721-35. [PMID: 26590590 DOI: 10.1007/s00253-015-7163-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2015] [Revised: 10/30/2015] [Accepted: 11/07/2015] [Indexed: 11/26/2022]
Abstract
The 16S ribosomal RNA (rRNA) gene has been the most commonly used sequence to characterize bacterial communities. The classical approach to obtain gene sequences to study bacterial diversity implies cloning amplicons, selecting clones, and Sanger sequencing cloned fragments. A more recent approach is direct sequencing of millions of genes using massive parallel technologies, allowing a large-scale biodiversity analysis of many samples simultaneously. However, currently, this technique is still expensive when applied to few samples; therefore, the classical approach is still used. Recently, we found a community able to remove 50 mg/L Pd(II). In this work, aiming to identify the bacteria potentially involved in Pd(II) removal, the separation of urea/heat-denatured DNA fragments by urea-agarose gel electrophoresis was applied for the first time to select 16S rRNA-cloned amplicons for taxonomic studies. The major raise in the percentage of bacteria belonging to genus Clostridium sensu stricto from undetected to 21 and 41 %, respectively, for cultures without, with 5 and 50 mg/L Pd(II) accompanying Pd(II) removal point to this taxa as a potential key agent for the bio-recovery of this metal. Despite sulfate-reducing bacteria were not detected, the hypothesis of Pd(II) removal by activity of these bacteria cannot be ruled out because a slight decrease of sulfate concentration of the medium was verified and the formation of PbS precipitates seems to occur. This work also contributes with knowledge about suitable partial 16S rRNA gene regions for taxonomic studies and shows that unidirectional sequencing is enough when Sanger sequencing cloned 16S rRNA genes for taxonomic studies to genus level.
Collapse
Affiliation(s)
- Ana Assunção
- Centro de Ciências do Mar (CCMAR), Universidade do Algarve, Campus de Gambelas, 8005-139, Faro, Portugal
| | - Maria Clara Costa
- Centro de Ciências do Mar (CCMAR), Universidade do Algarve, Campus de Gambelas, 8005-139, Faro, Portugal
- Faculdade de Ciências e Tecnologia, Universidade do Algarve, Campus de Gambelas, 8005-139, Faro, Portugal
| | - Jorge Dias Carlier
- Centro de Ciências do Mar (CCMAR), Universidade do Algarve, Campus de Gambelas, 8005-139, Faro, Portugal.
| |
Collapse
|
25
|
Derakhshani H, Tun HM, Khafipour E. An extended single-index multiplexed 16S rRNA sequencing for microbial community analysis on MiSeq illumina platforms. J Basic Microbiol 2015; 56:321-6. [PMID: 26426811 DOI: 10.1002/jobm.201500420] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2015] [Accepted: 09/19/2015] [Indexed: 12/26/2022]
Abstract
The primary 16S rRNA sequencing protocol for microbial community analysis using Illumina platforms includes a single-indexing approach that allows pooling of hundreds of samples in each sequencing run. The protocol targets the V4 hypervariable region (HVR) of 16S rRNA using 150 bp paired-end (PE) sequencing. However, the latest improvement in Illumina chemistry has increased the read length up to 600 bp using 300 bp PE sequencing. To take advantage of the longer read length, a dual-indexing approach was previously developed for targeting different HVRs. However, due to simple working protocols, the single-index 150 bp PE approach still continues to be attractive to many researchers. Here, we described an extended single-indexing protocol for 300 bp PE illumina sequencing that targets the V3-V4 HVRs of 16S rRNA. The new primer set led to increased read length and alignment resolution, as well as increased richness and diversity of resulting microbial profile compared to that obtained from150 bp PE protocol for V4 sequencing. The β-diversity profile also differed qualitatively and quantitatively between the two approaches. Both primer sets had high coverage rates and specificity to detect dominant phyla; however, their coverage rate with regards to the rare biosphere varied. Our data further confirms that the choice of primer is the most deterministic factor in sequencing coverage and specificity.
Collapse
Affiliation(s)
- Hooman Derakhshani
- Department of Animal Science, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Hein Min Tun
- Department of Animal Science, University of Manitoba, Winnipeg, Manitoba, Canada
| | - Ehsan Khafipour
- Department of Animal Science, University of Manitoba, Winnipeg, Manitoba, Canada
- Department of Medical Microbiology, University of Manitoba, Winnipeg, Manitoba, Canada
| |
Collapse
|
26
|
Zakrzewski M, Bekel T, Ander C, Pühler A, Rupp O, Stoye J, Schlüter A, Goesmann A. MetaSAMS--a novel software platform for taxonomic classification, functional annotation and comparative analysis of metagenome datasets. J Biotechnol 2012; 167:156-65. [PMID: 23026555 DOI: 10.1016/j.jbiotec.2012.09.013] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2012] [Revised: 07/21/2012] [Accepted: 09/21/2012] [Indexed: 12/01/2022]
Abstract
Metagenomics aims at exploring microbial communities concerning their composition and functioning. Application of high-throughput sequencing technologies for the analysis of environmental DNA-preparations can generate large sets of metagenome sequence data which have to be analyzed by means of bioinformatics tools to unveil the taxonomic composition of the analyzed community as well as the repertoire of genes and gene functions. A bioinformatics software platform is required that allows the automated taxonomic and functional analysis and interpretation of metagenome datasets without manual effort. To address current demands in metagenome data analyses, the novel platform MetaSAMS was developed. MetaSAMS automatically accomplishes the tasks necessary for analyzing the composition and functional repertoire of a given microbial community from metagenome sequence data by implementing two software pipelines: (i) the first pipeline consists of three different classifiers performing the taxonomic profiling of metagenome sequences and (ii) the second functional pipeline accomplishes region predictions on assembled contigs and assigns functional information to predicted coding sequences. Moreover, MetaSAMS provides tools for statistical and comparative analyses based on the taxonomic and functional annotations. The capabilities of MetaSAMS are demonstrated for two metagenome datasets obtained from a biogas-producing microbial community of a production-scale biogas plant. The MetaSAMS web interface is available at https://metasams.cebitec.uni-bielefeld.de.
Collapse
Affiliation(s)
- Martha Zakrzewski
- Institute for Bioinformatics-IfB, Center for Biotechnology-CeBiTec, Bielefeld University, Bielefeld, Germany
| | | | | | | | | | | | | | | |
Collapse
|