1
|
Meglécz E. mkLTG: a command-line tool for taxonomic assignment of metabarcoding sequences using variable identity thresholds. Biol Futur 2023; 74:369-375. [PMID: 38300415 DOI: 10.1007/s42977-024-00201-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Accepted: 01/04/2024] [Indexed: 02/02/2024]
Abstract
Metabarcoding is now a widely used method for biodiversity studies. Taxonomic assignment of environmental sequences is one of the key steps of metabarcoding. Assignments based on lowest common ancestor (LCA) method generally rely on fixed arbitrary thresholds, and this is generally not well adapted for assignment of taxonomically diverse groups with variable coverage in reference databases. The mkLTG is a LCA-based method that uses a series of percentage of identity thresholds starting from stringent parameters and decreasing it if necessary. All parameters can be set separately for each percentage of identity threshold, which makes this tool adaptable for different databases, genetic markers and diverse taxonomic groups. The optimization step was included using the COI marker and a comprehensive, non-redundant database. The mkLTG tool is a command-line application with few dependencies that runs in all operating systems, therefore, it is easy to include into complex pipelines. All scripts are freely available including the benchmarking at https://github.com/meglecz/mkLTG .
Collapse
Affiliation(s)
- Emese Meglécz
- IMBE, CNRS, IRD, Aix Marseille University, Avignon University, Marseille, France.
| |
Collapse
|
2
|
Mugnai F, Costantini F, Chenuil A, Leduc M, Gutiérrez Ortega JM, Meglécz E. Be positive: customized reference databases and new, local barcodes balance false taxonomic assignments in metabarcoding studies. PeerJ 2023; 11:e14616. [PMID: 36643652 PMCID: PMC9835706 DOI: 10.7717/peerj.14616] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/08/2022] [Accepted: 12/01/2022] [Indexed: 01/11/2023] Open
Abstract
Background In metabarcoding analyses, the taxonomic assignment is crucial to place sequencing data in biological and ecological contexts. This fundamental step depends on a reference database, which should have a good taxonomic coverage to avoid unassigned sequences. However, this goal is rarely achieved in many geographic regions and for several taxonomic groups. On the other hand, more is not necessarily better, as sequences in reference databases belonging to taxonomic groups out of the studied region/environment context might lead to false assignments. Methods We investigated the effect of using several subsets of a cytochrome c oxidase subunit I (COI) reference database on taxonomic assignment. Published metabarcoding sequences from the Mediterranean Sea were assigned to taxa using COInr, which is a comprehensive, non-redundant and recent database of COI sequences obtained both from BOLD and NCBI, and two of its subsets: (i) all sequences except insects (COInr-WO-Insecta), which represent the overwhelming majority of COInr database, but are irrelevant for marine samples, and (ii) all sequences from taxonomic families present in the Mediterranean Sea (COInr-Med). Four different algorithms for taxonomic assignment were employed in parallel to evaluate differences in their output and data consistency. Results The reduction of the database to more specific custom subsets increased the number of unassigned sequences. Nevertheless, since most of them were incorrectly assigned by the less specific databases, this is a positive outcome. Moreover, the taxonomic resolution (the lowest taxonomic level to which a sequence is attributed) of several sequences tended to increase when using customized databases. These findings clearly indicated the need for customized databases adapted to each study. However, the very high proportion of unassigned sequences points to the need to enrich the local database with new barcodes specifically obtained from the studied region and/or taxonomic group. Including novel local barcodes to the COI database proved to be very profitable: by adding only 116 new barcodes sequenced in our laboratory, thus increasing the reference database by only 0.04%, we were able to improve the resolution for ca. 0.6-1% of the Amplicon Sequence Variants (ASVs).
Collapse
Affiliation(s)
- Francesco Mugnai
- Department of Biological, Geological and Environmental Sciences (BiGeA), University of Bologna, Ravenna, Italy
| | - Federica Costantini
- Department of Biological, Geological and Environmental Sciences (BiGeA), University of Bologna, Ravenna, Italy,Consorzio Nazionale Interuniversitario per le Scienze del Mare (CoNISMa), Roma, Italy
| | - Anne Chenuil
- Aix Marseille Univ, Avignon Université, CNRS, IRD, IMBE, Marseille, France
| | | | | | - Emese Meglécz
- Aix Marseille Univ, Avignon Université, CNRS, IRD, IMBE, Marseille, France
| |
Collapse
|
3
|
Garfias-Gallegos D, Zirión-Martínez C, Bustos-Díaz ED, Arellano-Fernández TV, Lovaco-Flores JA, Espinosa-Jaime A, Avelar-Rivas JA, Sélem-Mójica N. Metagenomics Bioinformatic Pipeline. Methods Mol Biol 2022; 2512:153-179. [PMID: 35818005 DOI: 10.1007/978-1-0716-2429-6_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Microbial communities' taxonomic and functional diversity has been broadly studied since sequencing technologies enabled faster and cheaper data obtainment. Nevertheless, the programming skills needed and the amount of software available may be overwhelming to someone trying to analyze these data. Here, we present a comprehensive and straightforward pipeline that takes shotgun metagenomics data through the needed steps to obtain valuable results. The raw data goes through a quality control process, metagenomic assembly, binning (the obtention of single genomes from a metagenome), taxonomic assignment, and taxonomic diversity analysis and visualization.
Collapse
Affiliation(s)
| | | | - Edder D Bustos-Díaz
- Laboratorio de Evolución de la Diversidad Metabólica, Langebio, Cinvestav, Mexico
| | - Tania Vanessa Arellano-Fernández
- Laboratorio de Sistemas Genéticos, Langebio, Cinvestav, Mexico
- Escuela Nacional de Estudios Superiores, Unidad León, UNAM, León, Mexico
| | - José Abel Lovaco-Flores
- Escuela Nacional de Estudios Superiores, Unidad León, UNAM, León, Mexico
- BetterLab-C3, Irapuato, Mexico
| | | | | | - Nelly Sélem-Mójica
- BetterLab-C3, Irapuato, Mexico.
- Centro de Ciencias Matemáticas, UNAM, Morelia, Mexico.
| |
Collapse
|
4
|
Dacey DP, Chain FJJ. Concatenation of paired-end reads improves taxonomic classification of amplicons for profiling microbial communities. BMC Bioinformatics 2021; 22:493. [PMID: 34641782 PMCID: PMC8507205 DOI: 10.1186/s12859-021-04410-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2021] [Accepted: 09/29/2021] [Indexed: 01/04/2023] Open
Abstract
Background Taxonomic classification of genetic markers for microbiome analysis is affected by the numerous choices made from sample preparation to bioinformatics analysis. Paired-end read merging is routinely used to capture the entire amplicon sequence when the read ends overlap. However, the exclusion of unmerged reads from further analysis can result in underestimating the diversity in the sequenced microbial community and is influenced by bioinformatic processes such as read trimming and the choice of reference database. A potential solution to overcome this is to concatenate (join) reads that do not overlap and keep them for taxonomic classification. The use of concatenated reads can outperform taxonomic recovery from single-end reads, but it remains unclear how their performance compares to merged reads. Using various sequenced mock communities with different amplicons, read length, read depth, taxonomic composition, and sequence quality, we tested how merging and concatenating reads performed for genus recall and precision in bioinformatic pipelines combining different parameters for read trimming and taxonomic classification using different reference databases. Results The addition of concatenated reads to merged reads always increased pipeline performance. The top two performing pipelines both included read concatenation, with variable strengths depending on the mock community. The pipeline that combined merged and concatenated reads that were quality-trimmed performed best for mock communities with larger amplicons and higher average quality sequences. The pipeline that used length-trimmed concatenated reads outperformed quality trimming in mock communities with lower quality sequences but lost a significant amount of input sequences for taxonomic classification during processing. Genus level classification was more accurate using the SILVA reference database compared to Greengenes. Conclusions Merged sequences with the addition of concatenated sequences that were unable to be merged increased performance of taxonomic classifications. This was especially beneficial in mock communities with larger amplicons. We have shown for the first time, using an in-depth comparison of pipelines containing merged vs concatenated reads combined with different trimming parameters and reference databases, the potential advantages of concatenating sequences in improving resolution in microbiome investigations. Supplementary Information The online version contains supplementary material available at 10.1186/s12859-021-04410-2.
Collapse
Affiliation(s)
- Daniel P Dacey
- Department of Biological Sciences, University of Massachusetts Lowell, Lowell, MA, USA.
| | - Frédéric J J Chain
- Department of Biological Sciences, University of Massachusetts Lowell, Lowell, MA, USA
| |
Collapse
|
5
|
Mukherjee C, Leys EJ. Strain-Level Profiling of Oral Microbiota with Targeted Sequencing. Methods Mol Biol 2021; 2327:239-52. [PMID: 34410649 DOI: 10.1007/978-1-0716-1518-8_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
Targeted sequencing of one or more regions of the bacterial 16S rRNA gene fragment has emerged as a gold standard for investigating taxonomic diversity in complex microbial communities, such as those found in the oral cavity. While this approach is useful for identifying bacteria up to genus level, its ability to distinguish between many closely related oral species, or explore strain-level variations within each species, is very limited. Here we present an approach based on targeted sequencing the 16S-23S Intergenic Spacer Region (ISR) in the bacterial ribosomal operon for taxonomic characterization of microbial communities at a subspecies or strain level. This approach retains the advantages of 16S-based methods, such as easy library preparation, high throughput, short amplicon sizes, and low cost of sequencing, while providing subspecies-level resolution as a result of naturally higher genetic diversity present in the ISR compared to the 16S hypervariable regions. These advantages make it an excellent tool for high-resolution oral microbiota characterization.
Collapse
|
6
|
Catlett D, Son K, Liang C. ensembleTax: an R package for determinations of ensemble taxonomic assignments of phylogenetically-informative marker gene sequences. PeerJ 2021; 9:e11865. [PMID: 34395092 PMCID: PMC8320524 DOI: 10.7717/peerj.11865] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2021] [Accepted: 07/05/2021] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND High-throughput sequencing of phylogenetically informative marker genes is a widely used method to assess the diversity and composition of microbial communities. Taxonomic assignment of sampled marker gene sequences (referred to as amplicon sequence variants, or ASVs) imparts ecological significance to these genetic data. To assign taxonomy to an ASV, a taxonomic assignment algorithm compares the ASV to a collection of reference sequences (a reference database) with known taxonomic affiliations. However, many taxonomic assignment algorithms and reference databases are available, and the optimal algorithm and database for a particular scientific question is often unclear. Here, we present the ensembleTax R package, which provides an efficient framework for integrating taxonomic assignments predicted with any number of taxonomic assignment algorithms and reference databases to determine ensemble taxonomic assignments for ASVs. METHODS The ensembleTax R package relies on two core algorithms: taxmapper and assign.ensembleTax. The taxmapper algorithm maps taxonomic assignments derived from one reference database onto the taxonomic nomenclature (a set of taxonomic naming and ranking conventions) of another reference database. The assign.ensembleTax algorithm computes ensemble taxonomic assignments for each ASV in a data set based on any number of taxonomic assignments determined with independent methods. Various parameters allow analysts to prioritize obtaining either more ASVs with more predicted clade names or more robust clade name predictions supported by multiple independent methods in ensemble taxonomic assignments. RESULTS The ensembleTax R package is used to compute two sets of ensemble taxonomic assignments for a collection of protistan ASVs sampled from the coastal ocean. Comparisons of taxonomic assignments predicted by individual methods with those predicted by ensemble methods show that conservative implementations of the ensembleTax package minimize disagreements between taxonomic assignments predicted by individual and ensemble methods, but result in ASVs with fewer ranks assigned taxonomy. Less conservative implementations of the ensembleTax package result in an increased fraction of ASVs classified at all taxonomic ranks, but increase the number of ASVs for which ensemble assignments disagree with those predicted by individual methods. DISCUSSION We discuss how implementation of the ensembleTax R package may be optimized to address specific scientific objectives based on the results of the application of the ensembleTax package to marine protist communities. While further work is required to evaluate the accuracy of ensemble taxonomic assignments relative to taxonomic assignments predicted by individual methods, we also discuss scenarios where ensemble methods are expected to improve the accuracy of taxonomy prediction for ASVs.
Collapse
Affiliation(s)
- Dylan Catlett
- Earth Research Institute, University of California, Santa Barbara, Santa Barbara, CA, United States of America
| | - Kevin Son
- Earth Research Institute, University of California, Santa Barbara, Santa Barbara, CA, United States of America
| | - Connie Liang
- Earth Research Institute, University of California, Santa Barbara, Santa Barbara, CA, United States of America
| |
Collapse
|
7
|
Abstract
BACKGROUND Taxonomic assignment is a key step in the identification of human viral pathogens. Current tools for taxonomic assignment from sequencing reads based on alignment or alignment-free k-mer approaches may not perform optimally in cases where the sequences diverge significantly from the reference sequences. Furthermore, many tools may not incorporate the genomic coverage of assigned reads as part of overall likelihood of a correct taxonomic assignment for a sample. RESULTS In this paper, we describe the development of a pipeline that incorporates a multi-task learning model based on convolutional neural network (MT-CNN) and a Bayesian ranking approach to identify and rank the most likely human virus from sequence reads. For taxonomic assignment of reads, the MT-CNN model outperformed Kraken 2, Centrifuge, and Bowtie 2 on reads generated from simulated divergent HIV-1 genomes and was more sensitive in identifying SARS as the closest relation in four RNA sequencing datasets for SARS-CoV-2 virus. For genomic region assignment of assigned reads, the MT-CNN model performed competitively compared with Bowtie 2 and the region assignments were used for estimation of genomic coverage that was incorporated into a naïve Bayesian network together with the proportion of taxonomic assignments to rank the likelihood of candidate human viruses from sequence data. CONCLUSIONS We have developed a pipeline that combines a novel MT-CNN model that is able to identify viruses with divergent sequences together with assignment of the genomic region, with a Bayesian approach to ranking of taxonomic assignments by taking into account both the number of assigned reads and genomic coverage. The pipeline is available at GitHub via https://github.com/MaHaoran627/CNN_Virus .
Collapse
Affiliation(s)
- Haoran Ma
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 117592 Singapore, Singapore
| | - Tin Wee Tan
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 117592 Singapore, Singapore
- National Supercomputing Centre (NSCC), 138632 Singapore, Singapore
| | - Kenneth Hon Kim Ban
- Department of Biochemistry, Yong Loo Lin School of Medicine, National University of Singapore, 117592 Singapore, Singapore
- National Supercomputing Centre (NSCC), 138632 Singapore, Singapore
| |
Collapse
|
8
|
Stefani F, Bencherif K, Sabourin S, Hadj-Sahraoui AL, Banchini C, Séguin S, Dalpé Y. Taxonomic assignment of arbuscular mycorrhizal fungi in an 18S metagenomic dataset: a case study with saltcedar (Tamarix aphylla). Mycorrhiza 2020; 30:243-255. [PMID: 32180012 DOI: 10.1007/s00572-020-00946-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/16/2019] [Accepted: 03/06/2020] [Indexed: 06/10/2023]
Affiliation(s)
- Franck Stefani
- Agriculture and Agri-Food Canada, Ottawa Research and Development Centre, 960 Carling Avenue, Ottawa, ON, K1A 0C6, Canada.
| | - Karima Bencherif
- Faculté des Sciences de la Nature et de la Vie, Université de Djelfa, Route de Moudjbara, BP 3117, 17000, Djelfa, Algeria
| | - Stéphanie Sabourin
- Agriculture and Agri-Food Canada, Ottawa Research and Development Centre, 960 Carling Avenue, Ottawa, ON, K1A 0C6, Canada
| | - Anissa Lounès Hadj-Sahraoui
- UR 4492 - UCEIV - Unité de Chimie Environnementale et Interactions sur le Vivant, SFR Condorcet FR CNRS 3417, Université Littoral Côte d'Opale, F-62228, Calais Cedex, France
| | - Claudia Banchini
- Agriculture and Agri-Food Canada, Ottawa Research and Development Centre, 960 Carling Avenue, Ottawa, ON, K1A 0C6, Canada
| | - Sylvie Séguin
- Agriculture and Agri-Food Canada, Ottawa Research and Development Centre, 960 Carling Avenue, Ottawa, ON, K1A 0C6, Canada
| | - Yolande Dalpé
- Agriculture and Agri-Food Canada, Ottawa Research and Development Centre, 960 Carling Avenue, Ottawa, ON, K1A 0C6, Canada
| |
Collapse
|
9
|
Wylezich C, Belka A, Hanke D, Beer M, Blome S, Höper D. Metagenomics for broad and improved parasite detection: a proof-of-concept study using swine faecal samples. Int J Parasitol 2019; 49:769-777. [PMID: 31361998 DOI: 10.1016/j.ijpara.2019.04.007] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2019] [Revised: 04/18/2019] [Accepted: 04/24/2019] [Indexed: 01/10/2023]
Abstract
Efficient and reliable identification of emerging pathogens is crucial for the design and implementation of timely and proportionate control strategies. This is difficult if the pathogen is so far unknown or only distantly related with known pathogens. Diagnostic metagenomics - an undirected, broad and sensitive method for the efficient identification of pathogens - was frequently used for virus and bacteria detection, but seldom applied to parasite identification. Here, metagenomics datasets prepared from swine faeces using an unbiased sample processing approach with RNA serving as starting material were re-analysed with respect to parasite detection. The taxonomic identification tool RIEMS, used for initial detection, provided basic hints on potential pathogens contained in the datasets. The suspected parasites/intestinal protists (Blastocystis, Entamoeba, Iodamoeba, Neobalantidium, Tetratrichomonas) were verified using subsequently applied reference mapping analyses on the base of rRNA sequences. Nearly full-length gene sequences could be extracted from the RNA-derived datasets. In the case of Blastocystis, subtyping was possible with subtype (ST)15 discovered for the first known time in swine faeces. Using RIEMS, some of the suspected candidates turned out to be false-positives caused by the poor status of sequences in publicly available databases. Altogether, 11 different species/STs of parasites/intestinal protists were detected in 34 out of 41 datasets extracted from metagenomics data. The approach operates without any primer bias that typically hampers the analysis of amplicon-based approaches, and allows the detection and taxonomic classification including subtyping of protist and metazoan endobionts (parasites, commensals or mutualists) based on an abundant biomarker, the 18S rRNA. The generic nature of the approach also allows evaluation of interdependencies that induce mutualistic or pathogenic effects that are often not clear for many intestinal protists and perhaps other parasites. Thus, metagenomics has the potential for generic pathogen identification beyond the characterisation of viruses and bacteria when starting from RNA instead of DNA.
Collapse
Affiliation(s)
- Claudia Wylezich
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut (FLI), Südufer 10, 17493 Greifswald-Insel Riems, Germany.
| | - Ariane Belka
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut (FLI), Südufer 10, 17493 Greifswald-Insel Riems, Germany
| | - Dennis Hanke
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut (FLI), Südufer 10, 17493 Greifswald-Insel Riems, Germany
| | - Martin Beer
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut (FLI), Südufer 10, 17493 Greifswald-Insel Riems, Germany
| | - Sandra Blome
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut (FLI), Südufer 10, 17493 Greifswald-Insel Riems, Germany
| | - Dirk Höper
- Institute of Diagnostic Virology, Friedrich-Loeffler-Institut (FLI), Südufer 10, 17493 Greifswald-Insel Riems, Germany.
| |
Collapse
|
10
|
Henderson G, Yilmaz P, Kumar S, Forster RJ, Kelly WJ, Leahy SC, Guan LL, Janssen PH. Improved taxonomic assignment of rumen bacterial 16S rRNA sequences using a revised SILVA taxonomic framework. PeerJ 2019; 7:e6496. [PMID: 30863673 PMCID: PMC6407505 DOI: 10.7717/peerj.6496] [Citation(s) in RCA: 46] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2018] [Accepted: 01/21/2019] [Indexed: 11/20/2022] Open
Abstract
The taxonomy and associated nomenclature of many taxa of rumen bacteria are poorly defined within databases of 16S rRNA genes. This lack of resolution results in inadequate definition of microbial community structures, with large parts of the community designated as incertae sedis, unclassified, or uncultured within families, orders, or even classes. We have begun resolving these poorly-defined groups of rumen bacteria, based on our desire to name these for use in microbial community profiling. We used the previously-reported global rumen census (GRC) dataset consisting of >4.5 million partial bacterial 16S rRNA gene sequences amplified from 684 rumen samples and representing a wide range of animal hosts and diets. Representative sequences from the 8,985 largest operational units (groups of sequence sharing >97% sequence similarity, and covering 97.8% of all sequences in the GRC dataset) were used to identify 241 pre-defined clusters (mainly at genus or family level) of abundant rumen bacteria in the ARB SILVA 119 framework. A total of 99 of these clusters (containing 63.8% of all GRC sequences) had no unique or had inadequate taxonomic identifiers, and each was given a unique nomenclature. We assessed this improved framework by comparing taxonomic assignments of bacterial 16S rRNA gene sequence data in the GRC dataset with those made using the original SILVA 119 framework, and three other frameworks. The two SILVA frameworks performed best at assigning sequences to genus-level taxa. The SILVA 119 framework allowed 55.4% of the sequence data to be assigned to 751 uniquely identifiable genus-level groups. The improved framework increased this to 87.1% of all sequences being assigned to one of 871 uniquely identifiable genus-level groups. The new designations were included in the SILVA 123 release (https://www.arb-silva.de/documentation/release-123/) and will be perpetuated in future releases.
Collapse
Affiliation(s)
- Gemma Henderson
- Grasslands Research Centre, AgResearch, Palmerston North, New Zealand
| | - Pelin Yilmaz
- Microbial Genomics and Bioinformatics Research Group, Max Planck Institute for Marine Microbiology, Bremen, Germany
| | - Sandeep Kumar
- Grasslands Research Centre, AgResearch, Palmerston North, New Zealand
| | - Robert J Forster
- Lethbridge Research and Development Centre, Agriculture and Agri-Food Canada, Lethbridge, AB, Canada
| | - William J Kelly
- Grasslands Research Centre, AgResearch, Palmerston North, New Zealand
| | - Sinead C Leahy
- Grasslands Research Centre, AgResearch, Palmerston North, New Zealand
| | - Le Luo Guan
- Department of Agricultural, Food and Nutritional Science, University of Alberta, Edmonton, AB, Canada
| | - Peter H Janssen
- Grasslands Research Centre, AgResearch, Palmerston North, New Zealand
| |
Collapse
|
11
|
Abstract
BACKGROUND With the advances in the next-generation sequencing technologies, researchers can now rapidly examine the composition of samples from humans and their surroundings. To enhance the accuracy of taxonomy assignments in metagenomic samples, we developed a method that allows multiple mismatch probabilities from different genomes. RESULTS We extended the algorithm of taxonomic assignment of metagenomic sequence reads (TAMER) by developing an improved method that can set a different mismatch probability for each genome rather than imposing a single parameter for all genomes, thereby obtaining a greater degree of accuracy. This method, which we call TADIP (Taxonomic Assignment of metagenomics based on DIfferent Probabilities), was comprehensively tested in simulated and real datasets. The results support that TADIP improved the performance of TAMER especially in large sample size datasets with high complexity. CONCLUSIONS TADIP was developed as a statistical model to improve the estimate accuracy of taxonomy assignments. Based on its varying mismatch probability setting and correlated variance matrix setting, its performance was enhanced for high complexity samples when compared with TAMER.
Collapse
Affiliation(s)
- Yujing Yao
- Department of Biostatistics, Columbia University, New York, NY, USA
| | - Zhezhen Jin
- Department of Biostatistics, Columbia University, New York, NY, USA
| | - Joseph H Lee
- Sergievsky Center, Taub Institute, and Departments of Epidemiology and Neurology, Columbia University, New York, NY, USA. .,Sergievsky Center, Columbia University, 630 West 168th Street, P&S Unit 16, New York, NY, 10032, USA.
| |
Collapse
|
12
|
Murali A, Bhargava A, Wright ES. IDTAXA: a novel approach for accurate taxonomic classification of microbiome sequences. Microbiome 2018; 6:140. [PMID: 30092815 PMCID: PMC6085705 DOI: 10.1186/s40168-018-0521-5] [Citation(s) in RCA: 239] [Impact Index Per Article: 39.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 03/21/2018] [Accepted: 07/25/2018] [Indexed: 05/11/2023]
Abstract
BACKGROUND Microbiome studies often involve sequencing a marker gene to identify the microorganisms in samples of interest. Sequence classification is a critical component of this process, whereby sequences are assigned to a reference taxonomy containing known sequence representatives of many microbial groups. Previous studies have shown that existing classification programs often assign sequences to reference groups even if they belong to novel taxonomic groups that are absent from the reference taxonomy. This high rate of "over classification" is particularly detrimental in microbiome studies because reference taxonomies are far from comprehensive. RESULTS Here, we introduce IDTAXA, a novel approach to taxonomic classification that employs principles from machine learning to reduce over classification errors. Using multiple reference taxonomies, we demonstrate that IDTAXA has higher accuracy than popular classifiers such as BLAST, MAPSeq, QIIME, SINTAX, SPINGO, and the RDP Classifier. Similarly, IDTAXA yields far fewer over classifications on Illumina mock microbial community data when the expected taxa are absent from the training set. Furthermore, IDTAXA offers many practical advantages over other classifiers, such as maintaining low error rates across varying input sequence lengths and withholding classifications from input sequences composed of random nucleotides or repeats. CONCLUSIONS IDTAXA's classifications may lead to different conclusions in microbiome studies because of the substantially reduced number of taxa that are incorrectly identified through over classification. Although misclassification error is relatively minor, we believe that many remaining misclassifications are likely caused by errors in the reference taxonomy. We describe how IDTAXA is able to identify many putative mislabeling errors in reference taxonomies, enabling training sets to be automatically corrected by eliminating spurious sequences. IDTAXA is part of the DECIPHER package for the R programming language, available through the Bioconductor repository or accessible online ( http://DECIPHER.codes ).
Collapse
Affiliation(s)
- Adithya Murali
- Department of Computer Sciences, University of Wisconsin-Madison, Madison, WI 53715 USA
| | - Aniruddha Bhargava
- Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, WI 53715 USA
| | - Erik S. Wright
- Department of Biomedical Informatics, Pittsburgh Center for Evolutionary Biology and Medicine, School of Medicine, University of Pittsburgh, 426 Bridgeside Point II, 450 Technology Dr, Pittsburgh, PA 15219 USA
| |
Collapse
|
13
|
Zheng Q, Bartow-McKenney C, Meisel JS, Grice EA. HmmUFOtu: An HMM and phylogenetic placement based ultra-fast taxonomic assignment and OTU picking tool for microbiome amplicon sequencing studies. Genome Biol 2018; 19:82. [PMID: 29950165 PMCID: PMC6020470 DOI: 10.1186/s13059-018-1450-0] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2017] [Accepted: 05/09/2018] [Indexed: 02/01/2023] Open
Abstract
Culture-independent analysis of microbial communities frequently relies on amplification and sequencing of the prokaryotic 16S ribosomal RNA gene. Typical analysis pipelines group sequences into operational taxonomic units (OTUs) to infer taxonomic and phylogenetic relationships. Here, we present HmmUFOtu, a novel tool for processing microbiome amplicon sequencing data, which performs rapid per-read phylogenetic placement, followed by phylogenetically informed clustering into OTUs and taxonomy assignment. Compared to standard pipelines, HmmUFOtu more accurately and reliably recapitulates microbial community diversity and composition in simulated and real datasets without relying on heuristics or sacrificing speed or accuracy.
Collapse
Affiliation(s)
- Qi Zheng
- Department of Dermatology and Microbiology, Perelman School of Medicine, University of Pennsylvania, 421 Curie Blvd, BRB 1046/7, Philadelphia, PA 19104 USA
| | - Casey Bartow-McKenney
- Genomics and Computational Biology Program, Department of Dermatology, University of Pennsylvania, Philadelphia, USA
| | - Jacquelyn S. Meisel
- Genomics and Computational Biology Program, Department of Dermatology, University of Pennsylvania, Philadelphia, USA
| | - Elizabeth A. Grice
- Department of Dermatology and Microbiology, Perelman School of Medicine, University of Pennsylvania, 421 Curie Blvd, BRB 1046/7, Philadelphia, PA 19104 USA
- Genomics and Computational Biology Program, Department of Dermatology, University of Pennsylvania, Philadelphia, USA
| |
Collapse
|
14
|
Balech B, Sandionigi A, Manzari C, Trucchi E, Tullo A, Licciulli F, Grillo G, Sbisà E, De Felici S, Saccone C, D'Erchia AM, Cesaroni D, Casiraghi M, Vicario S. Tackling critical parameters in metazoan meta-barcoding experiments: a preliminary study based on coxI DNA barcode. PeerJ 2018; 6:e4845. [PMID: 29915686 PMCID: PMC6004112 DOI: 10.7717/peerj.4845] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2017] [Accepted: 05/04/2018] [Indexed: 11/21/2022] Open
Abstract
Nowadays DNA meta-barcoding is a powerful instrument capable of quickly discovering the biodiversity of an environmental sample by integrating the DNA barcoding approach with High Throughput Sequencing technologies. It mainly consists of the parallel reading of informative genomic fragment/s able to discriminate living entities. Although this approach has been widely studied, it still needs optimization in some necessary steps requested in its advanced accomplishment. A fundamental element concerns the standardization of bioinformatic analyses pipelines. The aim of the present study was to underline a number of critical parameters of laboratory material preparation and taxonomic assignment pipelines in DNA meta-barcoding experiments using the cytochrome oxidase subunit-I (coxI) barcode region, known as a suitable molecular marker for animal species identification. We compared nine taxonomic assignment pipelines, including a custom in-house method, based on Hidden Markov Models. Moreover, we evaluated the potential influence of universal primers amplification bias in qPCR, as well as the correlation between GC content with taxonomic assignment results. The pipelines were tested on a community of known terrestrial invertebrates collected by pitfall traps from a chestnut forest in Italy. Although the present analysis was not exhaustive and needs additional investigation, our results suggest some potential improvements in laboratory material preparation and the introduction of additional parameters in taxonomic assignment pipelines. These include the correct setup of OTU clustering threshold, the calibration of GC content affecting sequencing quality and taxonomic classification, as well as the evaluation of PCR primers amplification bias on the final biodiversity pattern. Thus, careful attention and further validation/optimization of the above-mentioned variables would be required in a DNA meta-barcoding experimental routine.
Collapse
Affiliation(s)
- Bachir Balech
- Istituto di Biomembrane, Bioenergetica e Biotecnologie Molecolari-Consiglio Nazionale delle Ricerche, Bari, Italy.,Dipartimento di Biologia, Università degli studi di Bari 'Aldo Moro', Bari, Italy
| | - Anna Sandionigi
- Dipartimento di Biotecnologie e Bioscienze-Zooplantlab, Università degli studi di Milano Bicocca, Milan, Italy
| | - Caterina Manzari
- Istituto di Biomembrane, Bioenergetica e Biotecnologie Molecolari-Consiglio Nazionale delle Ricerche, Bari, Italy
| | - Emiliano Trucchi
- Dipartimento di Biologia, Università di Roma Tor Vergata, Rome, Italy
| | - Apollonia Tullo
- Istituto di Biomembrane, Bioenergetica e Biotecnologie Molecolari-Consiglio Nazionale delle Ricerche, Bari, Italy.,Istituto di Tecnologie Biomediche-Consiglio Nazionale delle Ricerche, Bari, Italy
| | - Flavio Licciulli
- Istituto di Tecnologie Biomediche-Consiglio Nazionale delle Ricerche, Bari, Italy
| | - Giorgio Grillo
- Istituto di Tecnologie Biomediche-Consiglio Nazionale delle Ricerche, Bari, Italy
| | - Elisabetta Sbisà
- Istituto di Tecnologie Biomediche-Consiglio Nazionale delle Ricerche, Bari, Italy
| | - Stefano De Felici
- Dipartimento di Biologia, Università di Roma Tor Vergata, Rome, Italy.,Istituto di Biologia Agroambientale e Forestale-Consiglio Nazionale delle Ricerche, Rome, Italy
| | - Cecilia Saccone
- Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari 'Aldo Moro', Bari, Italy
| | - Anna Maria D'Erchia
- Dipartimento di Bioscienze, Biotecnologie e Biofarmaceutica, Università degli Studi di Bari 'Aldo Moro', Bari, Italy
| | | | - Maurizio Casiraghi
- Dipartimento di Biotecnologie e Bioscienze-Zooplantlab, Università degli studi di Milano Bicocca, Milan, Italy
| | - Saverio Vicario
- Istituto sull'Inquinamento Atmosferico-Consiglio Nazionale delle Ricerche, Bari, Italy
| |
Collapse
|
15
|
Abstract
When performing bioforensic casework, it is important to be able to reliably detect the presence of a particular organism in a metagenomic sample, even if the organism is only present in a trace amount. For this task, it is common to use a sequence classification program that determines the taxonomic affiliation of individual sequence reads by comparing them to reference database sequences. As metagenomic data sets often consist of millions or billions of reads that need to be compared to reference databases containing millions of sequences, such sequence classification programs typically use search heuristics and databases with reduced sequence diversity to speed up the analysis, which can lead to incorrect assignments. Thus, in a bioforensic setting where correct assignments are paramount, assignments of interest made by "first-pass" classifiers should be confirmed using the most precise methods and comprehensive databases available. In this study we present a BLAST-based method for validating the assignments made by less precise sequence classification programs, with optimal parameters for filtering of BLAST results determined via simulation of sequence reads from genomes of interest, and we apply the method to the detection of four pathogenic organisms. The software implementing the method is open source and freely available.
Collapse
Affiliation(s)
- Adam L. Bazinet
- National Biodefense Analysis and Countermeasures Center, Fort Detrick, MD, USA
| | - Brian D. Ondov
- National Biodefense Analysis and Countermeasures Center, Fort Detrick, MD, USA
- National Human Genome Research Institute, Bethesda, MD, USA
| | - Daniel D. Sommer
- National Biodefense Analysis and Countermeasures Center, Fort Detrick, MD, USA
| | | |
Collapse
|
16
|
Abstract
With the advent of low-cost, high-throughput sequencing, taxonomic profiling of complex microbial communities through 16S rRNA marker gene surveys has received widespread interest, uncovering a wealth of information concerning the bacterial composition of microbial communities, as well as their association with health and disease. On the other hand, little is known concerning the eukaryotic components of microbiomes. Such components include single-celled parasites and multicellular worms that are known to adversely impact the health of millions of people worldwide. Current molecular methods to detect eukaryotic microbes rely on the use of directed PCR analyses that are limited by their inability to inform beyond the taxon targeted. With increasing interest to develop equivalent marker-based surveys as used for bacteria, this chapter presents a stepwise protocol to characterize the diversity of eukaryotic microbes in a sample, using amplicon sequencing of hypervariable regions in the eukaryotic 18S rRNA gene.
Collapse
|
17
|
Beisser D, Graupner N, Grossmann L, Timm H, Boenigk J, Rahmann S. TaxMapper: an analysis tool, reference database and workflow for metatranscriptome analysis of eukaryotic microorganisms. BMC Genomics 2017; 18:787. [PMID: 29037173 PMCID: PMC5644092 DOI: 10.1186/s12864-017-4168-6] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2017] [Accepted: 10/05/2017] [Indexed: 12/17/2022] Open
Abstract
Background High-throughput sequencing (HTS) technologies are increasingly applied to analyse complex microbial ecosystems by mRNA sequencing of whole communities, also known as metatranscriptome sequencing. This approach is at the moment largely limited to prokaryotic communities and communities of few eukaryotic species with sequenced genomes. For eukaryotes the analysis is hindered mainly by a low and fragmented coverage of the reference databases to infer the community composition, but also by lack of automated workflows for the task. Results From the databases of the National Center for Biotechnology Information and Marine Microbial Eukaryote Transcriptome Sequencing Project, 142 references were selected in such a way that the taxa represent the main lineages within each of the seven supergroups of eukaryotes and possess predominantly complete transcriptomes or genomes. From these references, we created an annotated microeukaryotic reference database. We developed a tool called TaxMapper for a reliably mapping of sequencing reads against this database and filtering of unreliable assignments. For filtering, a classifier was trained and tested on each of the following: sequences of taxa in the database, sequences of taxa related to those in the database, and random sequences. Additionally, TaxMapper is part of a metatranscriptomic Snakemake workflow developed to perform quality assessment, functional and taxonomic annotation and (multivariate) statistical analysis including environmental data. The workflow is provided and described in detail to empower researchers to apply it for metatranscriptome analysis of any environmental sample. Conclusions TaxMapper shows superior performance compared to standard approaches, resulting in a higher number of true positive taxonomic assignments. Both the TaxMapper tool and the workflow are available as open-source code at Bitbucket under the MIT license: https://bitbucket.org/dbeisser/taxmapperand as a Bioconda package: https://bioconda.github.io/recipes/taxmapper/README.html. Electronic supplementary material The online version of this article (doi:10.1186/s12864-017-4168-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Daniela Beisser
- Biodiversity, University of Duisburg-Essen, Universitätsstr. 5, Essen, 45141, Germany.
| | - Nadine Graupner
- Biodiversity, University of Duisburg-Essen, Universitätsstr. 5, Essen, 45141, Germany
| | - Lars Grossmann
- Biodiversity, University of Duisburg-Essen, Universitätsstr. 5, Essen, 45141, Germany
| | - Henning Timm
- Genome Informatics, University of Duisburg-Essen, University Hospital Essen, Hufelandstr. 55, Essen, 45147, Germany
| | - Jens Boenigk
- Biodiversity, University of Duisburg-Essen, Universitätsstr. 5, Essen, 45141, Germany
| | - Sven Rahmann
- Genome Informatics, University of Duisburg-Essen, University Hospital Essen, Hufelandstr. 55, Essen, 45147, Germany
| |
Collapse
|
18
|
Abstract
BACKGROUND Metagenomic sequencing studies are becoming increasingly popular with prominent examples including the sequencing of human microbiomes and diverse environments. A fundamental computational problem in this context is read classification; i.e. the assignment of each read to a taxonomic label. Due to the large number of reads produced by modern high-throughput sequencing technologies and the rapidly increasing number of available reference genomes software tools for fast and accurate metagenomic read classification are urgently needed. RESULTS We present cuCLARK, a read-level classifier for CUDA-enabled GPUs, based on the fast and accurate classification of metagenomic sequences using reduced k-mers (CLARK) method. Using the processing power of a single Titan X GPU, cuCLARK can reach classification speeds of up to 50 million reads per minute. Corresponding speedups for species- (genus-)level classification range between 3.2 and 6.6 (3.7 and 6.4) compared to multi-threaded CLARK executed on a 16-core Xeon CPU workstation. CONCLUSION cuCLARK can perform metagenomic read classification at superior speeds on CUDA-enabled GPUs. It is free software licensed under GPL and can be downloaded at https://github.com/funatiq/cuclark free of charge.
Collapse
Affiliation(s)
- Robin Kobus
- Institute of Computer Science, Johannes Gutenberg University Mainz, Staudingerweg 9, Mainz, 55435 Germany
| | - Christian Hundt
- Institute of Computer Science, Johannes Gutenberg University Mainz, Staudingerweg 9, Mainz, 55435 Germany
| | - André Müller
- Institute of Computer Science, Johannes Gutenberg University Mainz, Staudingerweg 9, Mainz, 55435 Germany
| | - Bertil Schmidt
- Institute of Computer Science, Johannes Gutenberg University Mainz, Staudingerweg 9, Mainz, 55435 Germany
| |
Collapse
|
19
|
Abstract
Metagenomics has revolutionized microbiological studies during the past decade and provided new insights into the diversity, dynamics, and metabolic potential of natural microbial communities. However, metagenomics still represents a field in development, and standardized tools and approaches to handle and compare metagenomes have not been established yet. An important reason accounting for the latter is the continuous changes in the type of sequencing data available, for example, long versus short sequencing reads. Here, we provide a guide to bioinformatic pipelines developed to accomplish the following tasks, focusing primarily on those developed by our team: (i) assemble a metagenomic dataset; (ii) determine the level of sequence coverage obtained and the amount of sequencing required to obtain complete coverage; (iii) identify the taxonomic affiliation of a metagenomic read or assembled contig; and (iv) determine differentially abundant genes, pathways, and species between different datasets. Most of these pipelines do not depend on the type of sequences available or can be easily adjusted to fit different types of sequences, and are freely available (for instance, through our lab Web site: http://www.enve-omics.gatech.edu/). The limitations of current approaches, as well as the computational aspects that can be further improved, will also be briefly discussed. The work presented here provides practical guidelines on how to perform metagenomic analysis of microbial communities characterized by varied levels of diversity and establishes approaches to handle the resulting data, independent of the sequencing platform employed.
Collapse
Affiliation(s)
- Chengwei Luo
- Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, Georgia, USA; School of Biology, Georgia Institute of Technology, Atlanta, Georgia, USA; School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| | | | | |
Collapse
|