1
|
Meyer F, Robertson G, Deng ZL, Koslicki D, Gurevich A, McHardy AC. CAMI Benchmarking Portal: online evaluation and ranking of metagenomic software. Nucleic Acids Res 2025:gkaf369. [PMID: 40331433 DOI: 10.1093/nar/gkaf369] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2025] [Revised: 04/17/2025] [Accepted: 04/23/2025] [Indexed: 05/08/2025] Open
Abstract
Finding appropriate software and parameter settings to process shotgun metagenome data is essential for meaningful metagenomic analyses. To enable objective and comprehensive benchmarking of metagenomic software, the community-led initiative for the Critical Assessment of Metagenome Interpretation (CAMI) promotes standards and best practices. Since 2015, CAMI has provided comprehensive datasets, benchmarking guidelines, and challenges. However, benchmarking had to be conducted offline, requiring substantial time and technical expertise and leading to gaps in results between challenges. We introduce the CAMI Benchmarking Portal-a central repository of CAMI resources and web server for the evaluation and ranking of metagenome assembly, binning, and taxonomic profiling software. The portal simplifies evaluation, enabling users to easily compare their results with previous and other users' submissions through a variety of metrics and visualizations. As a demonstration, we benchmark software performance on the marine dataset of the CAMI II challenge. The portal currently hosts 28 675 results and is freely available at https://cami-challenge.org/.
Collapse
Affiliation(s)
- Fernando Meyer
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research (HZI), 38124 Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, 38106 Braunschweig, Germany
- Initiative for the Critical Assessment of Metagenome Interpretation (CAMI )
| | - Gary Robertson
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research (HZI), 38124 Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, 38106 Braunschweig, Germany
- Initiative for the Critical Assessment of Metagenome Interpretation (CAMI )
| | - Zhi-Luo Deng
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research (HZI), 38124 Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, 38106 Braunschweig, Germany
- Initiative for the Critical Assessment of Metagenome Interpretation (CAMI )
| | - David Koslicki
- Initiative for the Critical Assessment of Metagenome Interpretation (CAMI )
- Computer Science and Engineering, Penn State University, University Park, PA 16802, United States
- Biology, Penn State University , University Park, PA 16802, United States
| | - Alexey Gurevich
- Initiative for the Critical Assessment of Metagenome Interpretation (CAMI )
- Helmholtz Institute for Pharmaceutical Research Saarland (HIPS), Helmholtz Centre for Infection Research (HZI), 66123 Saarbrücken, Germany
- Center for Bioinformatics Saar and Saarland University, Saarland Informatics Campus, 66123 Saarbrücken, Germany
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research (HZI), 38124 Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, 38106 Braunschweig, Germany
- Initiative for the Critical Assessment of Metagenome Interpretation (CAMI )
- German Center for Infection Research (DZIF), partner site Hannover Braunschweig, 38124 Braunschweig, Germany
- Cluster of Excellence RESIST (EXC 2155), Hannover Medical School, 30625 Hannover, Germany
| |
Collapse
|
2
|
Alser M, Eudine J, Mutlu O. Taming large-scale genomic analyses via sparsified genomics. Nat Commun 2025; 16:876. [PMID: 39837860 PMCID: PMC11751491 DOI: 10.1038/s41467-024-55762-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 12/20/2024] [Indexed: 01/23/2025] Open
Abstract
Searching for similar genomic sequences is an essential and fundamental step in biomedical research. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable faster and memory-efficient processing of the sparsified, shorter genomic sequences, while providing comparable accuracy to processing non-sparsified sequences. Sparsified genomics provides benefits to many genomic analyses and has broad applicability. Sparsifying genomic sequences accelerates the state-of-the-art read mapper (minimap2) by 2.57-5.38x, 1.13-2.78x, and 3.52-6.28x using real Illumina, HiFi, and ONT reads, respectively, while providing comparable memory footprint, 2x smaller index size, and more correctly detected variations compared to minimap2. Sparsifying genomic sequences makes containment search through very large genomes and large databases 72.7-75.88x (1.62-1.9x when indexing is preprocessed) faster and 723.3x more storage-efficient than searching through non-sparsified genomic sequences (with CMash and KMC3). Sparsifying genomic sequences enables robust microbiome discovery by providing 54.15-61.88x (1.58-1.71x when indexing is preprocessed) faster and 720x more storage-efficient taxonomic profiling of metagenomic samples over the state-of-the-art tool (Metalign).
Collapse
Affiliation(s)
- Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland.
- Department of Computer Science, Georgia State University, Atlanta, GA, USA.
- Department of Clinical Pharmacy, University of Southern California, LA, CA, USA.
| | - Julien Eudine
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| |
Collapse
|
3
|
Şapcı AOB, Mirarab S. Memory-bound k-mer selection for large and evolutionarily diverse reference libraries. Genome Res 2024; 34:1455-1467. [PMID: 39209553 PMCID: PMC11529837 DOI: 10.1101/gr.279339.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2024] [Accepted: 08/06/2024] [Indexed: 09/04/2024]
Abstract
Using k-mers to find sequence matches is increasingly used in many bioinformatic applications, including metagenomic sequence classification. The accuracy of these downstream applications relies on the density of the reference databases, which are rapidly growing. Although the increased density provides hope for improvements in accuracy, scalability is a concern. Reference k-mers are kept in the memory during the query time, and saving all k-mers of these ever-expanding databases is fast becoming impractical. Several strategies for subsampling have been proposed, including minimizers and finding taxon-specific k-mers. However, we contend that these strategies are inadequate, especially when reference sets are taxonomically imbalanced, as are most microbial libraries. In this paper, we explore approaches for selecting a fixed-size subset of k-mers present in an ultra-large data set to include in a library such that the classification of reads suffers the least. Our experiments demonstrate the limitations of existing approaches, especially for novel and poorly sampled groups. We propose a library construction algorithm called k-mer RANKer (KRANK) that combines several components, including a hierarchical selection strategy with adaptive size restrictions and an equitable coverage strategy. We implement KRANK in highly optimized code and combine it with the locality-sensitive hashing classifier CONSULT-II to build a taxonomic classification and profiling method. On several benchmarks, KRANK k-mer selection significantly reduces memory consumption with minimal loss in classification accuracy. We show in extensive analyses based on CAMI benchmarks that KRANK outperforms k-mer-based alternatives in terms of taxonomic profiling and comes close to the best marker-based methods in terms of accuracy.
Collapse
Affiliation(s)
- Ali Osman Berk Şapcı
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, California 92093, USA
| | - Siavash Mirarab
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, California 92093, USA;
- Department of Electrical and Computer Engineering, University of California, San Diego, California 92093, USA
| |
Collapse
|
4
|
Shaw J, Yu YW. Rapid species-level metagenome profiling and containment estimation with sylph. Nat Biotechnol 2024:10.1038/s41587-024-02412-y. [PMID: 39379646 DOI: 10.1038/s41587-024-02412-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 08/28/2024] [Indexed: 10/10/2024]
Abstract
Profiling metagenomes against databases allows for the detection and quantification of microorganisms, even at low abundances where assembly is not possible. We introduce sylph, a species-level metagenome profiler that estimates genome-to-metagenome containment average nucleotide identity (ANI) through zero-inflated Poisson k-mer statistics, enabling ANI-based taxa detection. On the Critical Assessment of Metagenome Interpretation II (CAMI2) Marine dataset, sylph was the most accurate profiling method of seven tested. For multisample profiling, sylph took >10-fold less central processing unit time compared to Kraken2 and used 30-fold less memory. Sylph's ANI estimates provided an orthogonal signal to abundance, allowing for an ANI-based metagenome-wide association study for Parkinson disease (PD) against 289,232 genomes while confirming known butyrate-PD associations at the strain level. Sylph took <1 min and 16 GB of random-access memory to profile metagenomes against 85,205 prokaryotic and 2,917,516 viral genomes, detecting 30-fold more viral sequences in the human gut compared to RefSeq. Sylph offers precise, efficient profiling with accurate containment ANI estimation even for low-coverage genomes.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, Ontario, Canada.
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario, Canada.
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
5
|
Irankhah L, Khorsand B, Naghibzadeh M, Savadi A. Analyzing the performance of short-read classification tools on metagenomic samples toward proper diagnosis of diseases. J Bioinform Comput Biol 2024:2450012. [PMID: 39287058 DOI: 10.1142/s0219720024500124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/19/2024]
Abstract
Accurate knowledge of the genome, virus and bacteria that have invaded our bodies is crucial for diagnosing many human diseases. The field of bioinformatics encompasses the complex computational methods required for this purpose. Metagenomics employs next-generation sequencing (NGS) technology to study and identify microbial communities in environmental samples. This technique allows for the measurement of the relative abundance of different microbes. Various tools are available for detecting bacterial species in sequenced metagenomic samples. In this study, we focus on well-known taxonomic classification tools such as MetaPhlAn4, Centrifuge, Kraken2, and Bracken, and evaluate their performance at the species level using synthetic and real datasets. The results indicate that MetaPhlAn4 exhibited high precision in identifying species in the simulated dataset, while Kraken2 had the best area under the precision-recall curve (AUPR) performance. Centrifuge, Kraken2, and Bracken showed accurate estimation of species abundances, unlike MetaPhlAn4, which had a higher L2 distance. In the real dataset analysis with samples from an inflammatory bowel disease (IBD) research, MetaPhlAn4, and Kraken2 had faster execution times, with differences in performance at family and species levels among the tools. Enterobacteriaceae and Pasteurellaceae were highlighted as the most abundant families by Centrifuge, Kraken2, and MetaPhlAn4, with variations in abundance among ulcerative colitis (UC), Crohn's disease (CD), and control non-IBD (CN) groups. Escherichia coli (E. coli) has the highest abundance among Enterobacteriaceae species in the CD and UC groups in comparison with the CN group. Bracken overestimated E. coli abundance, emphasizing result interpretation caution. The findings of this research can assist in selecting the appropriate short-read classifier, thereby aiding in the diagnosis of target diseases.
Collapse
Affiliation(s)
- Leili Irankhah
- Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Babak Khorsand
- Department of Neurology, University of California Irvine, CA, USA
| | - Mahmoud Naghibzadeh
- Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
| | - Abdorreza Savadi
- Computer Engineering Department, Ferdowsi University of Mashhad, Mashhad, Iran
| |
Collapse
|
6
|
Zachariasen T, Russel J, Petersen C, Vestergaard GA, Shah S, Atienza Lopez P, Passali M, Turvey SE, Sørensen SJ, Lund O, Stokholm J, Brejnrod A, Thorsen J. MAGinator enables accurate profiling of de novo MAGs with strain-level phylogenies. Nat Commun 2024; 15:5734. [PMID: 38977664 PMCID: PMC11231285 DOI: 10.1038/s41467-024-49958-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2023] [Accepted: 06/21/2024] [Indexed: 07/10/2024] Open
Abstract
Metagenomic sequencing has provided great advantages in the characterisation of microbiomes, but currently available analysis tools lack the ability to combine subspecies-level taxonomic resolution and accurate abundance estimation with functional profiling of assembled genomes. To define the microbiome and its associations with human health, improved tools are needed to enable comprehensive understanding of the microbial composition and elucidation of the phylogenetic and functional relationships between the microbes. Here, we present MAGinator, a freely available tool, tailored for profiling of shotgun metagenomics datasets. MAGinator provides de novo identification of subspecies-level microbes and accurate abundance estimates of metagenome-assembled genomes (MAGs). MAGinator utilises the information from both gene- and contig-based methods yielding insight into both taxonomic profiles and the origin of genes and genetic content, used for inference of functional content of each sample by host organism. Additionally, MAGinator facilitates the reconstruction of phylogenetic relationships between the MAGs, providing a framework to identify clade-level differences.
Collapse
Affiliation(s)
- Trine Zachariasen
- Department of Health and Technology, Section of Bioinformatics, Technical University of Denmark, Lyngby, Denmark.
| | - Jakob Russel
- Department of Biology, Section of Microbiology, University of Copenhagen, Copenhagen, Denmark
| | - Charisse Petersen
- Department of Pediatrics, BC Children's Hospital, University of British Columbia, 950 West 28th Avenue, Vancouver, BC, Canada
| | - Gisle A Vestergaard
- Department of Health and Technology, Section of Bioinformatics, Technical University of Denmark, Lyngby, Denmark
| | - Shiraz Shah
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
| | - Pablo Atienza Lopez
- Danish Multiple Sclerosis Center, Department of Neurology, Copenhagen University Hospital, Rigshospitalet-Glostrup, Glostrup, Denmark
- Department of Food Science, University of Copenhagen, Copenhagen, Denmark
| | - Moschoula Passali
- Danish Multiple Sclerosis Center, Department of Neurology, Copenhagen University Hospital, Rigshospitalet-Glostrup, Glostrup, Denmark
| | - Stuart E Turvey
- Department of Pediatrics, BC Children's Hospital, University of British Columbia, 950 West 28th Avenue, Vancouver, BC, Canada
| | - Søren J Sørensen
- Department of Biology, Section of Microbiology, University of Copenhagen, Copenhagen, Denmark
| | - Ole Lund
- Department of Health and Technology, Section of Bioinformatics, Technical University of Denmark, Lyngby, Denmark
| | - Jakob Stokholm
- Department of Biology, Section of Microbiology, University of Copenhagen, Copenhagen, Denmark
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
| | - Asker Brejnrod
- Department of Health and Technology, Section of Bioinformatics, Technical University of Denmark, Lyngby, Denmark
| | - Jonathan Thorsen
- COPSAC, Copenhagen Prospective Studies on Asthma in Childhood, Herlev and Gentofte Hospital, University of Copenhagen, Copenhagen, Denmark
| |
Collapse
|
7
|
Şapcı AOB, Rachtman E, Mirarab S. CONSULT-II: accurate taxonomic identification and profiling using locality-sensitive hashing. Bioinformatics 2024; 40:btae150. [PMID: 38492564 PMCID: PMC10985673 DOI: 10.1093/bioinformatics/btae150] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 02/17/2024] [Accepted: 03/14/2024] [Indexed: 03/18/2024] Open
Abstract
MOTIVATION Taxonomic classification of short reads and taxonomic profiling of metagenomic samples are well-studied yet challenging problems. The presence of species belonging to groups without close representation in a reference dataset is particularly challenging. While k-mer-based methods have performed well in terms of running time and accuracy, they tend to have reduced accuracy for such novel species. Thus, there is a growing need for methods that combine the scalability of k-mers with increased sensitivity. RESULTS Here, we show that using locality-sensitive hashing (LSH) can increase the sensitivity of the k-mer-based search. Our method, which combines LSH with several heuristics techniques including soft lowest common ancestor labeling and voting, is more accurate than alternatives in both taxonomic classification of individual reads and abundance profiling. AVAILABILITY AND IMPLEMENTATION CONSULT-II is implemented in C++, and the software, together with reference libraries, is publicly available on GitHub https://github.com/bo1929/CONSULT-II.
Collapse
Affiliation(s)
- Ali Osman Berk Şapcı
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, CA 92093, United States
| | - Eleonora Rachtman
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, CA 92093, United States
| | - Siavash Mirarab
- Bioinformatics and Systems Biology Graduate Program, University of California, San Diego, CA 92093, United States
- Department of Electrical and Computer Engineering, University of California, San Diego, CA 92093, United States
| |
Collapse
|
8
|
Martiny HM, Pyrounakis N, Petersen TN, Lukjančenko O, Aarestrup FM, Clausen PTLC, Munk P. ARGprofiler-a pipeline for large-scale analysis of antimicrobial resistance genes and their flanking regions in metagenomic datasets. Bioinformatics 2024; 40:btae086. [PMID: 38377397 PMCID: PMC10918635 DOI: 10.1093/bioinformatics/btae086] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/10/2023] [Revised: 12/11/2023] [Accepted: 02/19/2024] [Indexed: 02/22/2024] Open
Abstract
MOTIVATION Analyzing metagenomic data can be highly valuable for understanding the function and distribution of antimicrobial resistance genes (ARGs). However, there is a need for standardized and reproducible workflows to ensure the comparability of studies, as the current options involve various tools and reference databases, each designed with a specific purpose in mind. RESULTS In this work, we have created the workflow ARGprofiler to process large amounts of raw sequencing reads for studying the composition, distribution, and function of ARGs. ARGprofiler tackles the challenge of deciding which reference database to use by providing the PanRes database of 14 078 unique ARGs that combines several existing collections into one. Our pipeline is designed to not only produce abundance tables of genes and microbes but also to reconstruct the flanking regions of ARGs with ARGextender. ARGextender is a bioinformatic approach combining KMA and SPAdes to recruit reads for a targeted de novo assembly. While our aim is on ARGs, the pipeline also creates Mash sketches for fast searching and comparisons of sequencing runs. AVAILABILITY AND IMPLEMENTATION The ARGprofiler pipeline is a Snakemake workflow that supports the reuse of metagenomic sequencing data and is easily installable and maintained at https://github.com/genomicepidemiology/ARGprofiler.
Collapse
Affiliation(s)
- Hannah-Marie Martiny
- Research Group for Genomic Epidemiology, Technical University of Denmark, Henrik Danms Allé, Bygning 204, Kongens Lyngby 2800, Denmark
| | - Nikiforos Pyrounakis
- Research Group for Genomic Epidemiology, Technical University of Denmark, Henrik Danms Allé, Bygning 204, Kongens Lyngby 2800, Denmark
| | - Thomas N Petersen
- Research Group for Genomic Epidemiology, Technical University of Denmark, Henrik Danms Allé, Bygning 204, Kongens Lyngby 2800, Denmark
| | - Oksana Lukjančenko
- Research Group for Genomic Epidemiology, Technical University of Denmark, Henrik Danms Allé, Bygning 204, Kongens Lyngby 2800, Denmark
| | - Frank M Aarestrup
- Research Group for Genomic Epidemiology, Technical University of Denmark, Henrik Danms Allé, Bygning 204, Kongens Lyngby 2800, Denmark
| | - Philip T L C Clausen
- Research Group for Genomic Epidemiology, Technical University of Denmark, Henrik Danms Allé, Bygning 204, Kongens Lyngby 2800, Denmark
| | - Patrick Munk
- Research Group for Genomic Epidemiology, Technical University of Denmark, Henrik Danms Allé, Bygning 204, Kongens Lyngby 2800, Denmark
| |
Collapse
|
9
|
Koslicki D, White S, Ma C, Novikov A. YACHT: an ANI-based statistical test to detect microbial presence/absence in a metagenomic sample. Bioinformatics 2024; 40:btae047. [PMID: 38268451 PMCID: PMC10868342 DOI: 10.1093/bioinformatics/btae047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 01/05/2024] [Accepted: 01/22/2024] [Indexed: 01/26/2024] Open
Abstract
MOTIVATION In metagenomics, the study of environmentally associated microbial communities from their sampled DNA, one of the most fundamental computational tasks is that of determining which genomes from a reference database are present or absent in a given sample metagenome. Existing tools generally return point estimates, with no associated confidence or uncertainty associated with it. This has led to practitioners experiencing difficulty when interpreting the results from these tools, particularly for low-abundance organisms as these often reside in the "noisy tail" of incorrect predictions. Furthermore, few tools account for the fact that reference databases are often incomplete and rarely, if ever, contain exact replicas of genomes present in an environmentally derived metagenome. RESULTS We present solutions for these issues by introducing the algorithm YACHT: Yes/No Answers to Community membership via Hypothesis Testing. This approach introduces a statistical framework that accounts for sequence divergence between the reference and sample genomes, in terms of ANI, as well as incomplete sequencing depth, thus providing a hypothesis test for determining the presence or absence of a reference genome in a sample. After introducing our approach, we quantify its statistical power and how this changes with varying parameters. Subsequently, we perform extensive experiments using both simulated and real data to confirm the accuracy and scalability of this approach. AVAILABILITY AND IMPLEMENTATION The source code implementing this approach is available via Conda and at https://github.com/KoslickiLab/YACHT. We also provide the code for reproducing experiments at https://github.com/KoslickiLab/YACHT-reproducibles.
Collapse
Affiliation(s)
- David Koslicki
- Department of Computer Science and Engineering, Pennsylvania State University, State College, PA 16802, United States
- Department of Biology, Pennsylvania State University, State College, PA 16802, United States
- Huck Institutes of the Life Sciences, Pennsylvania State University, State College, PA 16802, USA
- One Health Microbiome Center, Pennsylvania State University, State College, PA 16802, United States
| | - Stephen White
- Department of Mathematics, Pennsylvania State University, State College, PA 16802, United States
| | - Chunyu Ma
- Huck Institutes of the Life Sciences, Pennsylvania State University, State College, PA 16802, USA
| | - Alexei Novikov
- Department of Mathematics, Pennsylvania State University, State College, PA 16802, United States
| |
Collapse
|
10
|
Valencia EM, Maki KA, Dootz JN, Barb JJ. Mock community taxonomic classification performance of publicly available shotgun metagenomics pipelines. Sci Data 2024; 11:81. [PMID: 38233447 PMCID: PMC10794705 DOI: 10.1038/s41597-023-02877-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 12/22/2023] [Indexed: 01/19/2024] Open
Abstract
Shotgun metagenomic sequencing comprehensively samples the DNA of a microbial sample. Choosing the best bioinformatics processing package can be daunting due to the wide variety of tools available. Here, we assessed publicly available shotgun metagenomics processing packages/pipelines including bioBakery, Just a Microbiology System (JAMS), Whole metaGenome Sequence Assembly V2 (WGSA2), and Woltka using 19 publicly available mock community samples and a set of five constructed pathogenic gut microbiome samples. Also included is a workflow for labelling bacterial scientific names with NCBI taxonomy identifiers for better resolution in assessing results. The Aitchison distance, a sensitivity metric, and total False Positive Relative Abundance were used for accuracy assessments for all pipelines and mock samples. Overall, bioBakery4 performed the best with most of the accuracy metrics, while JAMS and WGSA2, had the highest sensitivities. Furthermore, bioBakery is commonly used and only requires a basic knowledge of command line usage. This work provides an unbiased assessment of shotgun metagenomics packages and presents results assessing the performance of the packages using mock community sequence data.
Collapse
Affiliation(s)
- E Michael Valencia
- Translational Biobehavioral and Health Disparities Branch, National Institutes of Health Clinical Center, Bethesda, MD, 20814, USA
| | - Katherine A Maki
- Translational Biobehavioral and Health Disparities Branch, National Institutes of Health Clinical Center, Bethesda, MD, 20814, USA
| | - Jennifer N Dootz
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA
| | - Jennifer J Barb
- Translational Biobehavioral and Health Disparities Branch, National Institutes of Health Clinical Center, Bethesda, MD, 20814, USA.
| |
Collapse
|
11
|
Walsh LH, Coakley M, Walsh AM, O'Toole PW, Cotter PD. Bioinformatic approaches for studying the microbiome of fermented food. Crit Rev Microbiol 2023; 49:693-725. [PMID: 36287644 DOI: 10.1080/1040841x.2022.2132850] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 08/11/2022] [Accepted: 09/28/2022] [Indexed: 11/03/2022]
Abstract
High-throughput DNA sequencing-based approaches continue to revolutionise our understanding of microbial ecosystems, including those associated with fermented foods. Metagenomic and metatranscriptomic approaches are state-of-the-art biological profiling methods and are employed to investigate a wide variety of characteristics of microbial communities, such as taxonomic membership, gene content and the range and level at which these genes are expressed. Individual groups and consortia of researchers are utilising these approaches to produce increasingly large and complex datasets, representing vast populations of microorganisms. There is a corresponding requirement for the development and application of appropriate bioinformatic tools and pipelines to interpret this data. This review critically analyses the tools and pipelines that have been used or that could be applied to the analysis of metagenomic and metatranscriptomic data from fermented foods. In addition, we critically analyse a number of studies of fermented foods in which these tools have previously been applied, to highlight the insights that these approaches can provide.
Collapse
Affiliation(s)
- Liam H Walsh
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
- School of Microbiology, University College Cork, Ireland
| | - Mairéad Coakley
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
| | - Aaron M Walsh
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
| | - Paul W O'Toole
- School of Microbiology, University College Cork, Ireland
- APC Microbiome Ireland, University College Cork, Ireland
| | - Paul D Cotter
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
- APC Microbiome Ireland, University College Cork, Ireland
- VistaMilk SFI Research Centre, Teagasc, Moorepark, Fermoy, Cork, Ireland
| |
Collapse
|
12
|
Huttenhower C, Finn RD, McHardy AC. Challenges and opportunities in sharing microbiome data and analyses. Nat Microbiol 2023; 8:1960-1970. [PMID: 37783751 DOI: 10.1038/s41564-023-01484-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Accepted: 08/28/2023] [Indexed: 10/04/2023]
Abstract
Microbiome data, metadata and analytical workflows have become 'big' in terms of volume and complexity. Although the infrastructure and technologies to share data have been established, the interdisciplinary and multi-omic nature of the field can make resources difficult to identify and use. Following best practices for data deposition requires substantial effort, with sometimes little obvious reward. Gaps remain where microbiome-specific resources for data sharing or reproducibility do not yet exist. We outline available best practices, challenges to their adoption and opportunities in data sharing in microbiome research. We showcase examples of best practices and advocate for their enforcement and incentivization for data sharing. This includes recognition of data curation and sharing endeavours by individuals, institutions, journals and funders. Opportunities for progress include enabling microbiome-specific databases to incorporate future methods for data analysis, integration and reuse.
Collapse
Affiliation(s)
- Curtis Huttenhower
- Harvard Chan Microbiome in Public Health Center, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Departments of Biostatistics and Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Alice Carolyn McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.
| |
Collapse
|
13
|
Blanco-Míguez A, Beghini F, Cumbo F, McIver LJ, Thompson KN, Zolfo M, Manghi P, Dubois L, Huang KD, Thomas AM, Nickols WA, Piccinno G, Piperni E, Punčochář M, Valles-Colomer M, Tett A, Giordano F, Davies R, Wolf J, Berry SE, Spector TD, Franzosa EA, Pasolli E, Asnicar F, Huttenhower C, Segata N. Extending and improving metagenomic taxonomic profiling with uncharacterized species using MetaPhlAn 4. Nat Biotechnol 2023; 41:1633-1644. [PMID: 36823356 PMCID: PMC10635831 DOI: 10.1038/s41587-023-01688-w] [Citation(s) in RCA: 412] [Impact Index Per Article: 206.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 01/20/2023] [Indexed: 02/25/2023]
Abstract
Metagenomic assembly enables new organism discovery from microbial communities, but it can only capture few abundant organisms from most metagenomes. Here we present MetaPhlAn 4, which integrates information from metagenome assemblies and microbial isolate genomes for more comprehensive metagenomic taxonomic profiling. From a curated collection of 1.01 M prokaryotic reference and metagenome-assembled genomes, we define unique marker genes for 26,970 species-level genome bins, 4,992 of them taxonomically unidentified at the species level. MetaPhlAn 4 explains ~20% more reads in most international human gut microbiomes and >40% in less-characterized environments such as the rumen microbiome and proves more accurate than available alternatives on synthetic evaluations while also reliably quantifying organisms with no cultured isolates. Application of the method to >24,500 metagenomes highlights previously undetected species to be strong biomarkers for host conditions and lifestyles in human and mouse microbiomes and shows that even previously uncharacterized species can be genetically profiled at the resolution of single microbial strains.
Collapse
Affiliation(s)
| | | | - Fabio Cumbo
- Department CIBIO, University of Trento, Trento, Italy
| | - Lauren J McIver
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Kelsey N Thompson
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Moreno Zolfo
- Department CIBIO, University of Trento, Trento, Italy
| | - Paolo Manghi
- Department CIBIO, University of Trento, Trento, Italy
| | | | - Kun D Huang
- Department CIBIO, University of Trento, Trento, Italy
| | | | - William A Nickols
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | | | - Elisa Piperni
- Department CIBIO, University of Trento, Trento, Italy
- IEO, European Institute of Oncology IRCCS, Milan, Italy
| | | | | | - Adrian Tett
- Department CIBIO, University of Trento, Trento, Italy
- Centre for Microbiology and Environmental Systems Science, University of Vienna, Vienna, Austria
| | | | | | | | - Sarah E Berry
- Department of Nutritional Sciences, King's College London, London, UK
| | - Tim D Spector
- Department of Twin Research, King's College London, London, UK
| | - Eric A Franzosa
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Edoardo Pasolli
- Department of Agricultural Sciences, University of Naples, Naples, Italy
| | | | - Curtis Huttenhower
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Nicola Segata
- Department CIBIO, University of Trento, Trento, Italy.
- IEO, European Institute of Oncology IRCCS, Milan, Italy.
| |
Collapse
|
14
|
Yang Y, Deng Y, Liu L, Yin X, Xu X, Wang D, Zhang T. Establishing reference material for the quest towards standardization in environmental microbial metagenomic studies. WATER RESEARCH 2023; 245:120641. [PMID: 37748344 DOI: 10.1016/j.watres.2023.120641] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Revised: 09/02/2023] [Accepted: 09/15/2023] [Indexed: 09/27/2023]
Abstract
Breakthroughs in DNA-based technologies, especially in metagenomic sequencing, have drastically enhanced researchers' ability to explore environmental microbiome and the associated interplays within. However, as new methodologies are being actively developed for improvements in different aspects, metagenomic workflows become diversified and heterogeneous. Through a single-variable control approach, we quantified the microbial profiling variations arising from 6 common technical variables associated with metagenomic workflows for both simple and complex samples. The incurred variations were constantly the lowest in replicates of DNA isolation and DNA sequencing library construction. Different DNA extraction kits often caused the highest variation among all the tested variables. Additionally, sequencing run batch was an important source of variability for targeted platforms. As such, the development of an environmental reference material for complex environmental samples could be beneficial in benchmarking accrued non-biological variability within and between protocols and insuring reliable and reproducible sequencing outputs immediately upstream of bioinformatic analysis. To develop an environment reference material, sequencing of a well-homogenized environmental sample composed of activated sludge was performed using different pre-analytical assays in replications. In parallel, a certified mock community was processed and sequenced. Assays were ranked based on the reconstruction of the theoretical mock community profile. The reproducibility of the best-performing assay and the microbial profile of the reference material were further ascertained. We propose the adoption of our complex environmental reference material, which could reflect the degree of diversity in environmental microbiome studies, to facilitate accurate, reproducible, and comparable environmental metagenomics-based studies.
Collapse
Affiliation(s)
- Yu Yang
- Department of Civil Engineering, Environmental Microbiome Engineering and Biotechnology Laboratory, Centre for Environmental Engineering Research, The University of Hong Kong, Hong Kong, China
| | - Yu Deng
- Department of Civil Engineering, Environmental Microbiome Engineering and Biotechnology Laboratory, Centre for Environmental Engineering Research, The University of Hong Kong, Hong Kong, China
| | - Lei Liu
- Department of Civil Engineering, Environmental Microbiome Engineering and Biotechnology Laboratory, Centre for Environmental Engineering Research, The University of Hong Kong, Hong Kong, China
| | - Xiaole Yin
- Department of Civil Engineering, Environmental Microbiome Engineering and Biotechnology Laboratory, Centre for Environmental Engineering Research, The University of Hong Kong, Hong Kong, China
| | - Xiaoqing Xu
- Department of Civil Engineering, Environmental Microbiome Engineering and Biotechnology Laboratory, Centre for Environmental Engineering Research, The University of Hong Kong, Hong Kong, China
| | - Dou Wang
- Department of Civil Engineering, Environmental Microbiome Engineering and Biotechnology Laboratory, Centre for Environmental Engineering Research, The University of Hong Kong, Hong Kong, China
| | - Tong Zhang
- Department of Civil Engineering, Environmental Microbiome Engineering and Biotechnology Laboratory, Centre for Environmental Engineering Research, The University of Hong Kong, Hong Kong, China; School of Public Health, Li Ka Shing Faculty of Medicine, The University of Hong Kong, Sassoon Road, Hong Kong SAR, China; Macau Institute for Applied Research in Medicine and Health, Macau University of Science and Technology, Macau SAR, China.
| |
Collapse
|
15
|
Pusadkar V, Azad RK. Benchmarking Metagenomic Classifiers on Simulated Ancient and Modern Metagenomic Data. Microorganisms 2023; 11:2478. [PMID: 37894136 PMCID: PMC10609333 DOI: 10.3390/microorganisms11102478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Revised: 09/28/2023] [Accepted: 09/29/2023] [Indexed: 10/29/2023] Open
Abstract
Taxonomic profiling of ancient metagenomic samples is challenging due to the accumulation of specific damage patterns on DNA over time. Although a number of methods for metagenome profiling have been developed, most of them have been assessed on modern metagenomes or simulated metagenomes mimicking modern metagenomes. Further, a comparative assessment of metagenome profilers on simulated metagenomes representing a spectrum of degradation depth, from the extremity of ancient (most degraded) to current or modern (not degraded) metagenomes, has not yet been performed. To understand the strengths and weaknesses of different metagenome profilers, we performed their comprehensive evaluation on simulated metagenomes representing human dental calculus microbiome, with the level of DNA damage successively raised to mimic modern to ancient metagenomes. All classes of profilers, namely, DNA-to-DNA, DNA-to-protein, and DNA-to-marker comparison-based profilers were evaluated on metagenomes with varying levels of damage simulating deamination, fragmentation, and contamination. Our results revealed that, compared to deamination and fragmentation, human and environmental contamination of ancient DNA (with modern DNA) has the most pronounced effect on the performance of each profiler. Further, the DNA-to-DNA (e.g., Kraken2, Bracken) and DNA-to-marker (e.g., MetaPhlAn4) based profiling approaches showed complementary strengths, which can be leveraged to elevate the state-of-the-art of ancient metagenome profiling.
Collapse
Affiliation(s)
- Vaidehi Pusadkar
- Department of Biological Sciences, University of North Texas, Denton, TX 76203, USA;
- BioDiscovery Institute, University of North Texas, Denton, TX 76203, USA
| | - Rajeev K. Azad
- Department of Biological Sciences, University of North Texas, Denton, TX 76203, USA;
- BioDiscovery Institute, University of North Texas, Denton, TX 76203, USA
- Department of Mathematics, University of North Texas, Denton, TX 76203, USA
| |
Collapse
|
16
|
Medina-Chávez NO, Viladomat-Jasso M, Zarza E, Islas-Robles A, Valdivia-Anistro J, Thalasso-Siret F, Eguiarte LE, Olmedo-Álvarez G, Souza V, De la Torre-Zavala S. A Transiently Hypersaline Microbial Mat Harbors a Diverse and Stable Archaeal Community in the Cuatro Cienegas Basin, Mexico. ASTROBIOLOGY 2023; 23:796-811. [PMID: 37279013 DOI: 10.1089/ast.2021.0047] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Microbial mats are biologically diverse communities that are analogs to some of the earliest ecosystems on Earth. In this study, we describe a unique transiently hypersaline microbial mat uncovered in a shallow pond within the Cuatro Cienegas Basin (CCB) in northern México. The CCB is an endemism-rich site that harbors living stromatolites that have been studied to understand the conditions of the Precambrian Earth. These microbial mats form elastic domes filled with biogenic gas, and the mats have a relatively large and stable subpopulation of archaea. For this reason, this site has been termed archaean domes (AD). The AD microbial community was analyzed by metagenomics over three seasons. The mat exhibited a highly diverse prokaryotic community dominated by bacteria. Bacterial sequences are represented in 37 phyla, mainly Proteobacteria, Firmicutes, and Actinobacteria, that together comprised >50% of the sequences from the mat. Archaea represented up to 5% of the retrieved sequences, with up to 230 different archaeal species that belong to 5 phyla (Euryarchaeota, Crenarchaeota, Thaumarchaeota, Korarchaeota, and Nanoarchaeota). The archaeal taxa showed low variation despite fluctuations in water and nutrient availability. In addition, predicted functions highlight stress responses to extreme conditions present in the AD, including salinity, pH, and water/drought fluctuation. The observed complexity of the AD mat thriving in high pH and fluctuating water and salt conditions within the CCB provides an extant model of great value for evolutionary studies, as well as a suitable analog to the early Earth and Mars.
Collapse
Affiliation(s)
- Nahui-Olin Medina-Chávez
- Ecology, Evolution and Behavior, University of Minnesota, St. Paul, Minnesota, USA
- Universidad Autónoma de Nuevo León, Facultad de Ciencias Biológicas, Instituto de Biotecnología, San Nicolás de los Garza, México
| | | | - Eugenia Zarza
- Departamento de Ciencias de la Sustentabilidad, El Colegio de la Frontera Sur, Tapachula, Mexico
- Consejo Nacional de Ciencia y Tecnología, Ciudad de México, México
| | - Africa Islas-Robles
- Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados del I.P.N. Campus Irapuato, Irapuato, México
| | - Jorge Valdivia-Anistro
- Unidad Multidisciplinaria de Investigación Experimental Zaragoza, Facultad de Estudios Superiores Zaragoza, UNAM, Ciudad de México, México
| | - Frédéric Thalasso-Siret
- Departamento de Biotecnología y Bioingeniería, Centro de Investigación y de Estudios Avanzados del Instituto Politécnico Nacional, Ciudad de México, Mexico
| | - Luis E Eguiarte
- Departamento de Ecología Evolutiva, Instituto de Ecología, UNAM, Ciudad de México, México
- Centro de Estudios del Cuaternario de Fuego-Patagonia y Antártica (CEQUA), Punta Arenas, Chile
| | - Gabriela Olmedo-Álvarez
- Departamento de Ingeniería Genética, Centro de Investigación y de Estudios Avanzados del I.P.N. Campus Irapuato, Irapuato, México
| | - Valeria Souza
- Departamento de Ecología Evolutiva, Instituto de Ecología, UNAM, Ciudad de México, México
- Centro de Estudios del Cuaternario de Fuego-Patagonia y Antártica (CEQUA), Punta Arenas, Chile
| | - Susana De la Torre-Zavala
- Universidad Autónoma de Nuevo León, Facultad de Ciencias Biológicas, Instituto de Biotecnología, San Nicolás de los Garza, México
| |
Collapse
|
17
|
Bazant W, Blevins AS, Crouch K, Beiting DP. Improved eukaryotic detection compatible with large-scale automated analysis of metagenomes. MICROBIOME 2023; 11:72. [PMID: 37032329 PMCID: PMC10084625 DOI: 10.1186/s40168-023-01505-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 06/07/2022] [Accepted: 02/24/2023] [Indexed: 06/19/2023]
Abstract
BACKGROUND Eukaryotes such as fungi and protists frequently accompany bacteria and archaea in microbial communities. Unfortunately, their presence is difficult to study with "shotgun" metagenomic sequencing since prokaryotic signals dominate in most environments. Recent methods for eukaryotic detection use eukaryote-specific marker genes, but they do not incorporate strategies to handle the presence of eukaryotes that are not represented in the reference marker gene set, and they are not compatible with web-based tools for downstream analysis. RESULTS Here, we present CORRAL (for Clustering Of Related Reference ALignments), a tool for the identification of eukaryotes in shotgun metagenomic data based on alignments to eukaryote-specific marker genes and Markov clustering. Using a combination of simulated datasets, mock community standards, and large publicly available human microbiome studies, we demonstrate that our method is not only sensitive and accurate but is also capable of inferring the presence of eukaryotes not included in the marker gene reference, such as novel strains. Finally, we deploy CORRAL on our MicrobiomeDB.org resource, producing an atlas of eukaryotes present in various environments of the human body and linking their presence to study covariates. CONCLUSIONS CORRAL allows eukaryotic detection to be automated and carried out at scale. Implementation of CORRAL in MicrobiomeDB.org creates a running atlas of microbial eukaryotes in metagenomic studies. Since our approach is independent of the reference used, it may be applicable to other contexts where shotgun metagenomic reads are matched against redundant but non-exhaustive databases, such as the identification of bacterial virulence genes or taxonomic classification of viral reads. Video Abstract.
Collapse
Affiliation(s)
- Wojtek Bazant
- Institute of Infection, Immunity and Inflammation, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, UK
| | - Ann S Blevins
- Department of Pathobiology, School of Veterinary Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA
| | - Kathryn Crouch
- Institute of Infection, Immunity and Inflammation, College of Medical, Veterinary and Life Sciences, University of Glasgow, Glasgow, UK.
| | - Daniel P Beiting
- Department of Pathobiology, School of Veterinary Medicine, University of Pennsylvania, Philadelphia, PA, 19104, USA.
| |
Collapse
|
18
|
Schackart KE, Graham JB, Ponsero AJ, Hurwitz BL. Evaluation of computational phage detection tools for metagenomic datasets. Front Microbiol 2023; 14:1078760. [PMID: 36760501 PMCID: PMC9902911 DOI: 10.3389/fmicb.2023.1078760] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2022] [Accepted: 01/09/2023] [Indexed: 01/25/2023] Open
Abstract
Introduction As new computational tools for detecting phage in metagenomes are being rapidly developed, a critical need has emerged to develop systematic benchmarks. Methods In this study, we surveyed 19 metagenomic phage detection tools, 9 of which could be installed and run at scale. Those 9 tools were assessed on several benchmark challenges. Fragmented reference genomes are used to assess the effects of fragment length, low viral content, phage taxonomy, robustness to eukaryotic contamination, and computational resource usage. Simulated metagenomes are used to assess the effects of sequencing and assembly quality on the tool performances. Finally, real human gut metagenomes and viromes are used to assess the differences and similarities in the phage communities predicted by the tools. Results We find that the various tools yield strikingly different results. Generally, tools that use a homology approach (VirSorter, MARVEL, viralVerify, VIBRANT, and VirSorter2) demonstrate low false positive rates and robustness to eukaryotic contamination. Conversely, tools that use a sequence composition approach (VirFinder, DeepVirFinder, Seeker), and MetaPhinder, have higher sensitivity, including to phages with less representation in reference databases. These differences led to widely differing predicted phage communities in human gut metagenomes, with nearly 80% of contigs being marked as phage by at least one tool and a maximum overlap of 38.8% between any two tools. While the results were more consistent among the tools on viromes, the differences in results were still significant, with a maximum overlap of 60.65%. Discussion: Importantly, the benchmark datasets developed in this study are publicly available and reusable to enable the future comparability of new tools developed.
Collapse
Affiliation(s)
- Kenneth E. Schackart
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
| | - Jessica B. Graham
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
| | - Alise J. Ponsero
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
- Human Microbiome Research Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Bonnie L. Hurwitz
- Department of Biosystems Engineering, The University of Arizona, Tucson, AZ, United States
- BIO5 Institute, The University of Arizona, Tucson, AZ, United States
| |
Collapse
|
19
|
Player RA, Aguinaldo AM, Merritt BB, Maszkiewicz LN, Adeyemo OE, Forsyth ER, Verratti KJ, Chee BW, Grady SL, Bradburne CE. The META tool optimizes metagenomic analyses across sequencing platforms and classifiers. FRONTIERS IN BIOINFORMATICS 2023; 2:969247. [PMID: 36685333 PMCID: PMC9852826 DOI: 10.3389/fbinf.2022.969247] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2022] [Accepted: 12/14/2022] [Indexed: 01/09/2023] Open
Abstract
A major challenge in the field of metagenomics is the selection of the correct combination of sequencing platform and downstream metagenomic analysis algorithm, or "classifier". Here, we present the Metagenomic Evaluation Tool Analyzer (META), which produces simulated data and facilitates platform and algorithm selection for any given metagenomic use case. META-generated in silico read data are modular, scalable, and reflect user-defined community profiles, while the downstream analysis is done using a variety of metagenomic classifiers. Reported results include information on resource utilization, time-to-answer, and performance. Real-world data can also be analyzed using selected classifiers and results benchmarked against simulations. To test the utility of the META software, simulated data was compared to real-world viral and bacterial metagenomic samples run on four different sequencers and analyzed using 12 metagenomic classifiers. Lastly, we introduce "META Score": a unified, quantitative value which rates an analytic classifier's ability to both identify and count taxa in a representative sample.
Collapse
Affiliation(s)
- Robert A. Player
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States
| | | | - Brian B. Merritt
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States
| | - Lisa N. Maszkiewicz
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States
| | | | - Ellen R. Forsyth
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States
| | | | - Brant W. Chee
- Division of General Internal Medicine, Johns Hopkins School of Medicine, Baltimore, MD, United States,Armstrong Institute for Patient Safety and Quality, Johns Hopkins School of Medicine, Baltimore, MD, United States
| | - Sarah L. Grady
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States
| | - Christopher E. Bradburne
- Applied Physics Laboratory, Johns Hopkins University, Laurel, MD, United States,McKusick-Nathans Department of Genetic Medicine, Johns Hopkins School of Medicine, Baltimore, MD, United States,*Correspondence: Christopher E. Bradburne,
| |
Collapse
|
20
|
Shen W, Xiang H, Huang T, Tang H, Peng M, Cai D, Hu P, Ren H. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics 2023; 39:btac845. [PMID: 36579886 PMCID: PMC9828150 DOI: 10.1093/bioinformatics/btac845] [Citation(s) in RCA: 23] [Impact Index Per Article: 11.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 12/17/2022] [Accepted: 12/28/2022] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION The growing number of microbial reference genomes enables the improvement of metagenomic profiling accuracy but also imposes greater requirements on the indexing efficiency, database size and runtime of taxonomic profilers. Additionally, most profilers focus mainly on bacterial, archaeal and fungal populations, while less attention is paid to viral communities. RESULTS We present KMCP (K-mer-based Metagenomic Classification and Profiling), a novel k-mer-based metagenomic profiling tool that utilizes genome coverage information by splitting the reference genomes into chunks and stores k-mers in a modified and optimized Compact Bit-Sliced Signature Index for fast alignment-free sequence searching. KMCP combines k-mer similarity and genome coverage information to reduce the false positive rate of k-mer-based taxonomic classification and profiling methods. Benchmarking results based on simulated and real data demonstrate that KMCP, despite a longer running time than all other methods, not only allows the accurate taxonomic profiling of prokaryotic and viral populations but also provides more confident pathogen detection in clinical samples of low depth. AVAILABILITY AND IMPLEMENTATION The software is open-source under the MIT license and available at https://github.com/shenwei356/kmcp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wei Shen
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hongyan Xiang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Tianquan Huang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hui Tang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Mingli Peng
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Dachuan Cai
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Peng Hu
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hong Ren
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| |
Collapse
|
21
|
Sarwal V, Brito J, Mangul S, Koslicki D. TAMPA: interpretable analysis and visualization of metagenomics-based taxon abundance profiles. Gigascience 2022; 12:giad008. [PMID: 36852763 PMCID: PMC9972184 DOI: 10.1093/gigascience/giad008] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2022] [Revised: 11/12/2022] [Accepted: 02/02/2023] [Indexed: 03/01/2023] Open
Abstract
BACKGROUND Metagenomic taxonomic profiling aims to predict the identity and relative abundance of taxa in a given whole-genome sequencing metagenomic sample. A recent surge in computational methods that aim to accurately estimate taxonomic profiles, called taxonomic profilers, has motivated community-driven efforts to create standardized benchmarking datasets and platforms, standardized taxonomic profile formats, and a benchmarking platform to assess tool performance. While this standardization is essential, there is currently a lack of tools to visualize the standardized output of the many existing taxonomic profilers. Thus, benchmarking studies rely on a single-value metrics to compare performance of tools and compare to benchmarking datasets. This is one of the major problems in analyzing metagenomic profiling data, since single metrics, such as the F1 score, fail to capture the biological differences between the datasets. FINDINGS Here we report the development of TAMPA (Taxonomic metagenome profiling evaluation), a robust and easy-to-use method that allows scientists to easily interpret and interact with taxonomic profiles produced by the many different taxonomic profiler methods beyond the standard metrics used by the scientific community. We demonstrate the unique ability of TAMPA to generate a novel biological hypothesis by highlighting the taxonomic differences between samples otherwise missed by commonly utilized metrics. CONCLUSION In this study, we show that TAMPA can help visualize the output of taxonomic profilers, enabling biologists to effectively choose the most appropriate profiling method to use on their metagenomics data. TAMPA is available on GitHub, Bioconda, and Galaxy Toolshed at https://github.com/dkoslicki/TAMPA and is released under the MIT license.
Collapse
Affiliation(s)
- Varuni Sarwal
- Department of Computer Science, University of California–Los Angeles, Los Angeles, CA 90095, USA
| | - Jaqueline Brito
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences,University of Southern California, Los Angeles, CA 90089, USA
| | - Serghei Mangul
- Titus Family Department of Clinical Pharmacy, USC Alfred E. Mann School of Pharmacy and Pharmaceutical Sciences,University of Southern California, Los Angeles, CA 90089, USA
- Department of Quantitative and Computational Biology, USC Dornsife College of Letters, Arts and Sciences, University of Southern California, Los Angeles, CA 90089, USA
| | - David Koslicki
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA
- Department of Biology, The Pennsylvania State University, University Park, PA 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA 16802, USA
| |
Collapse
|
22
|
Portik DM, Brown CT, Pierce-Ward NT. Evaluation of taxonomic classification and profiling methods for long-read shotgun metagenomic sequencing datasets. BMC Bioinformatics 2022; 23:541. [PMID: 36513983 PMCID: PMC9749362 DOI: 10.1186/s12859-022-05103-0] [Citation(s) in RCA: 44] [Impact Index Per Article: 14.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 12/07/2022] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Long-read shotgun metagenomic sequencing is gaining in popularity and offers many advantages over short-read sequencing. The higher information content in long reads is useful for a variety of metagenomics analyses, including taxonomic classification and profiling. The development of long-read specific tools for taxonomic classification is accelerating, yet there is a lack of information regarding their relative performance. Here, we perform a critical benchmarking study using 11 methods, including five methods designed specifically for long reads. We applied these tools to several mock community datasets generated using Pacific Biosciences (PacBio) HiFi or Oxford Nanopore Technology sequencing, and evaluated their performance based on read utilization, detection metrics, and relative abundance estimates. RESULTS Our results show that long-read classifiers generally performed best. Several short-read classification and profiling methods produced many false positives (particularly at lower abundances), required heavy filtering to achieve acceptable precision (at the cost of reduced recall), and produced inaccurate abundance estimates. By contrast, two long-read methods (BugSeq, MEGAN-LR & DIAMOND) and one generalized method (sourmash) displayed high precision and recall without any filtering required. Furthermore, in the PacBio HiFi datasets these methods detected all species down to the 0.1% abundance level with high precision. Some long-read methods, such as MetaMaps and MMseqs2, required moderate filtering to reduce false positives to resemble the precision and recall of the top-performing methods. We found read quality affected performance for methods relying on protein prediction or exact k-mer matching, and these methods performed better with PacBio HiFi datasets. We also found that long-read datasets with a large proportion of shorter reads (< 2 kb length) resulted in lower precision and worse abundance estimates, relative to length-filtered datasets. Finally, for classification methods, we found that the long-read datasets produced significantly better results than short-read datasets, demonstrating clear advantages for long-read metagenomic sequencing. CONCLUSIONS Our critical assessment of available methods provides best-practice recommendations for current research using long reads and establishes a baseline for future benchmarking studies.
Collapse
Affiliation(s)
- Daniel M. Portik
- grid.423340.20000 0004 0640 9878Pacific Biosciences, 1305 O’Brien Dr, Menlo Park, CA 93025 USA
| | - C. Titus Brown
- grid.27860.3b0000 0004 1936 9684Department of Population Health and Reproduction, University of California Davis, Davis, CA USA
| | - N. Tessa Pierce-Ward
- grid.27860.3b0000 0004 1936 9684Department of Population Health and Reproduction, University of California Davis, Davis, CA USA
| |
Collapse
|
23
|
Ruscheweyh HJ, Milanese A, Paoli L, Karcher N, Clayssen Q, Keller MI, Wirbel J, Bork P, Mende DR, Zeller G, Sunagawa S. Cultivation-independent genomes greatly expand taxonomic-profiling capabilities of mOTUs across various environments. MICROBIOME 2022; 10:212. [PMID: 36464731 PMCID: PMC9721005 DOI: 10.1186/s40168-022-01410-z] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 02/17/2022] [Accepted: 11/03/2022] [Indexed: 06/17/2023]
Abstract
BACKGROUND Taxonomic profiling is a fundamental task in microbiome research that aims to detect and quantify the relative abundance of microorganisms in biological samples. Available methods using shotgun metagenomic data generally depend on the deposition of sequenced and taxonomically annotated genomes, usually from cultures of isolated strains, in reference databases (reference genomes). However, the majority of microorganisms have not been cultured yet. Thus, a substantial fraction of microbial community members remains unaccounted for during taxonomic profiling, particularly in samples from underexplored environments. To address this issue, we developed the mOTU profiler, a tool that enables reference genome-independent species-level profiling of metagenomes. As such, it supports the identification and quantification of both "known" and "unknown" species based on a set of select marker genes. RESULTS We present mOTUs3, a command line tool that enables the profiling of metagenomes for >33,000 species-level operational taxonomic units. To achieve this, we leveraged the reconstruction of >600,000 draft genomes, most of which are metagenome-assembled genomes (MAGs), from diverse microbiomes, including soil, freshwater systems, and the gastrointestinal tract of ruminants and other animals, which we found to be underrepresented by reference genomes. Overall, two thirds of all species-level taxa lacked a reference genome. The cumulative relative abundance of these newly included taxa was low in well-studied microbiomes, such as the human body sites (6-11%). By contrast, they accounted for substantial proportions (ocean, freshwater, soil: 43-63%) or even the majority (pig, fish, cattle: 60-80%) of the relative abundance across diverse non-human-associated microbiomes. Using community-developed benchmarks and datasets, we found mOTUs3 to be more accurate than other methods and to be more congruent with 16S rRNA gene-based methods for taxonomic profiling. Furthermore, we demonstrate that mOTUs3 increases the resolution of well-known microbial groups into species-level taxa and helps identify new differentially abundant taxa in comparative metagenomic studies. CONCLUSIONS We developed mOTUs3 to enable accurate species-level profiling of metagenomes. Compared to other methods, it provides a more comprehensive view of prokaryotic community diversity, in particular for currently underexplored microbiomes. To facilitate comparative analyses by the research community, it is released with >11,000 precomputed profiles for publicly available metagenomes and is freely available at: https://github.com/motu-tool/mOTUs . Video Abstract.
Collapse
Affiliation(s)
- Hans-Joachim Ruscheweyh
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
| | - Alessio Milanese
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Lucas Paoli
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
| | - Nicolai Karcher
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Quentin Clayssen
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
| | - Marisa Isabell Keller
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Jakob Wirbel
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Peer Bork
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
- Max Delbrück Centre for Molecular Medicine, Robert-Rössle-Str. 10, 13092 Berlin, Germany
- Department of Bioinformatics, Biocenter, University of Würzburg, Am Hubland, 97074 Würzburg, Germany
| | - Daniel R. Mende
- Department of Medical Microbiology, Amsterdam UMC, University of Amsterdam, Amsterdam, The Netherlands
| | - Georg Zeller
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
| | - Shinichi Sunagawa
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
| |
Collapse
|
24
|
Govender KN, Eyre DW. Benchmarking taxonomic classifiers with Illumina and Nanopore sequence data for clinical metagenomic diagnostic applications. Microb Genom 2022; 8. [PMID: 36269282 PMCID: PMC9676057 DOI: 10.1099/mgen.0.000886] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Culture-independent metagenomic detection of microbial species has the potential to provide rapid and precise real-time diagnostic results. However, it is potentially limited by sequencing and taxonomic classification errors. We use simulated and real-world data to benchmark rates of species misclassification using 100 reference genomes for each of the ten common bloodstream pathogens and six frequent blood-culture contaminants (n=1568, only 68 genomes were available for Micrococcus luteus). Simulating both with and without sequencing error for both the Illumina and Oxford Nanopore platforms, we evaluated commonly used classification tools including Kraken2, Bracken and Centrifuge, utilizing mini (8 GB) and standard (30–50 GB) databases. Bracken with the standard database performed best, the median percentage of reads across both sequencing platforms identified correctly to the species level was 97.8% (IQR 92.7:99.0) [range 5:100]. For Kraken2 with a mini database, a commonly used combination, median species-level identification was 86.4% (IQR 50.5:93.7) [range 4.3:100]. Classification performance varied by species, with Escherichia coli being more challenging to classify correctly (probability of reads being assigned to the correct species: 56.1–96.0%, varying by tool used). Human read misclassification was negligible. By filtering out shorter Nanopore reads we found performance similar or superior to Illumina sequencing, despite higher sequencing error rates. Misclassification was more common when the misclassified species had a higher average nucleotide identity to the true species. Our findings highlight taxonomic misclassification of sequencing data occurs and varies by sequencing and analysis workflow. To account for ‘bioinformatic contamination’ we present a contamination catalogue that can be used in metagenomic pipelines to ensure accurate results that can support clinical decision making.
Collapse
Affiliation(s)
- Kumeren N Govender
- Nuffield Department of Medicine, John Radcliffe Hospital, University of Oxford, Oxford, UK
| | - David W Eyre
- Nuffield Department of Medicine, John Radcliffe Hospital, University of Oxford, Oxford, UK.,Big Data Institute, Nuffield Department of Population Health, University of Oxford, Oxford, UK
| |
Collapse
|
25
|
Crowdsourced benchmarking of taxonomic metagenome profilers: lessons learned from the sbv IMPROVER Microbiomics challenge. BMC Genomics 2022; 23:624. [PMID: 36042406 PMCID: PMC9429340 DOI: 10.1186/s12864-022-08803-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2022] [Accepted: 07/25/2022] [Indexed: 11/10/2022] Open
Abstract
Background Selection of optimal computational strategies for analyzing metagenomics data is a decisive step in determining the microbial composition of a sample, and this procedure is complex because of the numerous tools currently available. The aim of this research was to summarize the results of crowdsourced sbv IMPROVER Microbiomics Challenge designed to evaluate the performance of off-the-shelf metagenomics software as well as to investigate the robustness of these results by the extended post-challenge analysis. In total 21 off-the-shelf taxonomic metagenome profiling pipelines were benchmarked for their capacity to identify the microbiome composition at various taxon levels across 104 shotgun metagenomics datasets of bacterial genomes (representative of various microbiome samples) from public databases. Performance was determined by comparing predicted taxonomy profiles with the gold standard. Results Most taxonomic profilers performed homogeneously well at the phylum level but generated intermediate and heterogeneous scores at the genus and species levels, respectively. kmer-based pipelines using Kraken with and without Bracken or using CLARK-S performed best overall, but they exhibited lower precision than the two marker-gene-based methods MetaPhlAn and mOTU. Filtering out the 1% least abundance species—which were not reliably predicted—helped increase the performance of most profilers by increasing precision but at the cost of recall. However, the use of adaptive filtering thresholds determined from the sample’s Shannon index increased the performance of most kmer-based profilers while mitigating the tradeoff between precision and recall. Conclusions kmer-based metagenomic pipelines using Kraken/Bracken or CLARK-S performed most robustly across a large variety of microbiome datasets. Removing non-reliably predicted low-abundance species by using diversity-dependent adaptive filtering thresholds further enhanced the performance of these tools. This work demonstrates the applicability of computational pipelines for accurately determining taxonomic profiles in clinical and environmental contexts and exemplifies the power of crowdsourcing for unbiased evaluation. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08803-2.
Collapse
|
26
|
Czech L, Stamatakis A, Dunthorn M, Barbera P. Metagenomic Analysis Using Phylogenetic Placement-A Review of the First Decade. FRONTIERS IN BIOINFORMATICS 2022; 2:871393. [PMID: 36304302 PMCID: PMC9580882 DOI: 10.3389/fbinf.2022.871393] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2022] [Accepted: 04/11/2022] [Indexed: 12/20/2022] Open
Abstract
Phylogenetic placement refers to a family of tools and methods to analyze, visualize, and interpret the tsunami of metagenomic sequencing data generated by high-throughput sequencing. Compared to alternative (e. g., similarity-based) methods, it puts metabarcoding sequences into a phylogenetic context using a set of known reference sequences and taking evolutionary history into account. Thereby, one can increase the accuracy of metagenomic surveys and eliminate the requirement for having exact or close matches with existing sequence databases. Phylogenetic placement constitutes a valuable analysis tool per se, but also entails a plethora of downstream tools to interpret its results. A common use case is to analyze species communities obtained from metagenomic sequencing, for example via taxonomic assignment, diversity quantification, sample comparison, and identification of correlations with environmental variables. In this review, we provide an overview over the methods developed during the first 10 years. In particular, the goals of this review are 1) to motivate the usage of phylogenetic placement and illustrate some of its use cases, 2) to outline the full workflow, from raw sequences to publishable figures, including best practices, 3) to introduce the most common tools and methods and their capabilities, 4) to point out common placement pitfalls and misconceptions, 5) to showcase typical placement-based analyses, and how they can help to analyze, visualize, and interpret phylogenetic placement data.
Collapse
Affiliation(s)
- Lucas Czech
- Department of Plant Biology, Carnegie Institution for Science, Stanford, CA, United States
| | - Alexandros Stamatakis
- Computational Molecular Evolution Group, Heidelberg Institute for Theoretical Studies, Heidelberg, Germany
- Institute for Theoretical Informatics, Karlsruhe Institute of Technology, Karlsruhe, Germany
| | - Micah Dunthorn
- Natural History Museum, University of Oslo, Oslo, Norway
| | | |
Collapse
|
27
|
Meyer F, Fritz A, Deng ZL, Koslicki D, Lesker TR, Gurevich A, Robertson G, Alser M, Antipov D, Beghini F, Bertrand D, Brito JJ, Brown CT, Buchmann J, Buluç A, Chen B, Chikhi R, Clausen PTLC, Cristian A, Dabrowski PW, Darling AE, Egan R, Eskin E, Georganas E, Goltsman E, Gray MA, Hansen LH, Hofmeyr S, Huang P, Irber L, Jia H, Jørgensen TS, Kieser SD, Klemetsen T, Kola A, Kolmogorov M, Korobeynikov A, Kwan J, LaPierre N, Lemaitre C, Li C, Limasset A, Malcher-Miranda F, Mangul S, Marcelino VR, Marchet C, Marijon P, Meleshko D, Mende DR, Milanese A, Nagarajan N, Nissen J, Nurk S, Oliker L, Paoli L, Peterlongo P, Piro VC, Porter JS, Rasmussen S, Rees ER, Reinert K, Renard B, Robertsen EM, Rosen GL, Ruscheweyh HJ, Sarwal V, Segata N, Seiler E, Shi L, Sun F, Sunagawa S, Sørensen SJ, Thomas A, Tong C, Trajkovski M, Tremblay J, Uritskiy G, Vicedomini R, Wang Z, Wang Z, Wang Z, Warren A, Willassen NP, Yelick K, You R, Zeller G, Zhao Z, Zhu S, Zhu J, Garrido-Oter R, Gastmeier P, Hacquard S, Häußler S, Khaledi A, Maechler F, Mesny F, Radutoiu S, Schulze-Lefert P, Smit N, Strowig T, et alMeyer F, Fritz A, Deng ZL, Koslicki D, Lesker TR, Gurevich A, Robertson G, Alser M, Antipov D, Beghini F, Bertrand D, Brito JJ, Brown CT, Buchmann J, Buluç A, Chen B, Chikhi R, Clausen PTLC, Cristian A, Dabrowski PW, Darling AE, Egan R, Eskin E, Georganas E, Goltsman E, Gray MA, Hansen LH, Hofmeyr S, Huang P, Irber L, Jia H, Jørgensen TS, Kieser SD, Klemetsen T, Kola A, Kolmogorov M, Korobeynikov A, Kwan J, LaPierre N, Lemaitre C, Li C, Limasset A, Malcher-Miranda F, Mangul S, Marcelino VR, Marchet C, Marijon P, Meleshko D, Mende DR, Milanese A, Nagarajan N, Nissen J, Nurk S, Oliker L, Paoli L, Peterlongo P, Piro VC, Porter JS, Rasmussen S, Rees ER, Reinert K, Renard B, Robertsen EM, Rosen GL, Ruscheweyh HJ, Sarwal V, Segata N, Seiler E, Shi L, Sun F, Sunagawa S, Sørensen SJ, Thomas A, Tong C, Trajkovski M, Tremblay J, Uritskiy G, Vicedomini R, Wang Z, Wang Z, Wang Z, Warren A, Willassen NP, Yelick K, You R, Zeller G, Zhao Z, Zhu S, Zhu J, Garrido-Oter R, Gastmeier P, Hacquard S, Häußler S, Khaledi A, Maechler F, Mesny F, Radutoiu S, Schulze-Lefert P, Smit N, Strowig T, Bremges A, Sczyrba A, McHardy AC. Critical Assessment of Metagenome Interpretation: the second round of challenges. Nat Methods 2022; 19:429-440. [PMID: 35396482 PMCID: PMC9007738 DOI: 10.1038/s41592-022-01431-4] [Show More Authors] [Citation(s) in RCA: 155] [Impact Index Per Article: 51.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2021] [Accepted: 02/14/2022] [Indexed: 12/20/2022]
Abstract
Evaluating metagenomic software is key for optimizing metagenome interpretation and focus of the Initiative for the Critical Assessment of Metagenome Interpretation (CAMI). The CAMI II challenge engaged the community to assess methods on realistic and complex datasets with long- and short-read sequences, created computationally from around 1,700 new and known genomes, as well as 600 new plasmids and viruses. Here we analyze 5,002 results by 76 program versions. Substantial improvements were seen in assembly, some due to long-read data. Related strains still were challenging for assembly and genome recovery through binning, as was assembly quality for the latter. Profilers markedly matured, with taxon profilers and binners excelling at higher bacterial ranks, but underperforming for viruses and Archaea. Clinical pathogen detection results revealed a need to improve reproducibility. Runtime and memory usage analyses identified efficient programs, including top performers with other metrics. The results identify challenges and guide researchers in selecting methods for analyses. This study presents the results of the second round of the Critical Assessment of Metagenome Interpretation challenges (CAMI II), which is a community-driven effort for comprehensively benchmarking tools for metagenomics data analysis.
Collapse
Affiliation(s)
- Fernando Meyer
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Adrian Fritz
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.,German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany
| | - Zhi-Luo Deng
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.,Cluster of Excellence RESIST (EXC 2155), Hannover Medical School, Hannover, Germany
| | | | - Till Robin Lesker
- German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany.,Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Gary Robertson
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany
| | - Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Dmitry Antipov
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
| | | | | | | | | | - Jan Buchmann
- Institute for Biological Data Science, Heinrich-Heine-University, Düsseldorf, Germany
| | - Aydin Buluç
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Bo Chen
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | | | - Philip T L C Clausen
- National Food Institute, Division of Global Surveillance, Technical University of Denmark, Lyngby, Denmark
| | - Alexandru Cristian
- Drexel University, Philadelphia, PA, USA.,Google Inc., Philadelphia, PA, USA
| | - Piotr Wojciech Dabrowski
- Robert Koch-Institut, Berlin, Germany.,Hochschule für Technik und Wirtschaft Berlin, Berlin, Germany
| | | | - Rob Egan
- DOE Joint Genome Institute, Berkeley, CA, USA.,Lawrence Berkeley National Laboratories, Berkeley, CA, USA
| | - Eleazar Eskin
- University of California, Los Angeles, Los Angeles, CA, USA
| | | | - Eugene Goltsman
- DOE Joint Genome Institute, Berkeley, CA, USA.,Lawrence Berkeley National Laboratories, Berkeley, CA, USA
| | - Melissa A Gray
- Drexel University, Philadelphia, PA, USA.,Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Philadelphia, PA, USA
| | - Lars Hestbjerg Hansen
- University of Copenhagen, Department of Plant and Environmental Science, Frederiksberg, Denmark
| | - Steven Hofmeyr
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Pingqin Huang
- School of Computer Science, Fudan University, Shanghai, China
| | - Luiz Irber
- University of California, Davis, Davis, CA, USA
| | - Huijue Jia
- BGI-Shenzhen, Shenzhen, China.,Shenzhen Key Laboratory of Human Commensal Microorganisms and Health Research, BGI-Shenzhen, Shenzhen, China
| | - Tue Sparholt Jørgensen
- Technical University of Denmark, Novo Nordisk Foundation Center for Biosustainability, Lyngby, Denmark.,Aarhus University, Department of Environmental Science, Roskilde, Denmark
| | - Silas D Kieser
- Department of Cell Physiology and Metabolism, Faculty of Medicine, University of Geneva, Geneva, Switzerland.,Swiss Institute of Bioinformatics, Geneva, Switzerland
| | | | - Axel Kola
- Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Mikhail Kolmogorov
- Department of Computer Science and Engineering, University of California San Diego, San Diego, CA, USA
| | - Anton Korobeynikov
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia.,Department of Statistical Modelling, Saint Petersburg State University, Saint Petersburg, Russia
| | - Jason Kwan
- University of Wisconsin-Madison, Madison, WI, USA
| | | | | | - Chenhao Li
- Genome Institute of Singapore, Singapore, Singapore
| | | | - Fabio Malcher-Miranda
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
| | | | - Vanessa R Marcelino
- Sydney Medical School, The University of Sydney, Sydney, Australia.,Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, Australia
| | | | - Pierre Marijon
- Department of Computer Science, Inria, University of Lille, CNRS, Lille, France
| | - Dmitry Meleshko
- Center for Algorithmic Biotechnology, Saint Petersburg State University, Saint Petersburg, Russia
| | - Daniel R Mende
- Amsterdam University Medical Center, Amsterdam, the Netherlands
| | - Alessio Milanese
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland.,Structural and Computational Biology Unit, EMBL, Heidelberg, Germany
| | - Niranjan Nagarajan
- Genome Institute of Singapore, A*STAR, Singapore, Singapore.,National University of Singapore, Singapore, Singapore
| | | | - Sergey Nurk
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Leonid Oliker
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Lucas Paoli
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland
| | | | - Vitor C Piro
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany
| | | | - Simon Rasmussen
- Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark
| | - Evan R Rees
- University of Wisconsin-Madison, Madison, WI, USA
| | - Knut Reinert
- Institute for Bioinformatics, FU Berlin, Berlin, Germany
| | - Bernhard Renard
- Hasso Plattner Institute, Digital Engineering Faculty, University of Potsdam, Potsdam, Germany.,Bioinformatics Unit (MF1), Robert Koch Institute, Berlin, Germany
| | | | - Gail L Rosen
- Drexel University, Philadelphia, PA, USA.,Ecological and Evolutionary Signal-Processing and Informatics Laboratory, Philadelphia, PA, USA.,Center for Biological Discovery from Big Data, Philadelphia, PA, USA
| | - Hans-Joachim Ruscheweyh
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland
| | - Varuni Sarwal
- University of California, Los Angeles, Los Angeles, CA, USA
| | - Nicola Segata
- Department CIBIO, University of Trento, Trento, Italy
| | - Enrico Seiler
- Institute for Bioinformatics, FU Berlin, Berlin, Germany
| | - Lizhen Shi
- Florida Polytechnic University, Lakeland, FL, USA
| | - Fengzhu Sun
- Quantitative and Computational Biology Department, University of Southern California, Los Angeles, CA, USA
| | - Shinichi Sunagawa
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, Zürich, Switzerland
| | | | - Ashleigh Thomas
- DOE Joint Genome Institute, Berkeley, CA, USA.,University of British Columbia, Vancouver, British Columbia, Canada
| | | | - Mirko Trajkovski
- Department of Cell Physiology and Metabolism, Faculty of Medicine, University of Geneva, Geneva, Switzerland.,Diabetes Center, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Julien Tremblay
- Energy, Mining and Environment, National Research Council Canada, Montreal, Quebec, Canada
| | | | | | - Zhengyang Wang
- School of Computer Science, Fudan University, Shanghai, China
| | - Ziye Wang
- School of Mathematical Sciences, Fudan University, Shanghai, China
| | - Zhong Wang
- Department of Energy Joint Genome Institute, Berkeley, CA, USA.,Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,School of Natural Sciences, University of California at Merced, Merced, CA, USA
| | | | | | - Katherine Yelick
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.,University of California, Berkeley, Berkeley, CA, USA
| | - Ronghui You
- School of Computer Science, Fudan University, Shanghai, China
| | - Georg Zeller
- Structural and Computational Biology Unit, EMBL, Heidelberg, Germany
| | | | - Shanfeng Zhu
- Institute of Science and Technology for Brain-Inspired Intelligence, Fudan University, Shanghai, China.,Key Laboratory of Computational Neuroscience and Brain-Inspired Intelligence (Fudan University), Ministry of Education, Shanghai, China
| | - Jie Zhu
- BGI-Shenzhen, Shenzhen, China.,Shenzhen Key Laboratory of Human Commensal Microorganisms and Health Research, BGI-Shenzhen, Shenzhen, China
| | | | | | | | - Susanne Häußler
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Ariane Khaledi
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | | | - Fantin Mesny
- Max Planck Institute for Plant Breeding Research, Köln, Germany
| | | | | | - Nathiana Smit
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Till Strowig
- Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Andreas Bremges
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany
| | - Alexander Sczyrba
- Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, Germany
| | - Alice Carolyn McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany. .,Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany. .,German Center for Infection Research (DZIF), Hannover-Braunschweig Site, Braunschweig, Germany. .,Cluster of Excellence RESIST (EXC 2155), Hannover Medical School, Hannover, Germany.
| |
Collapse
|
28
|
Performance of Five Metagenomic Classifiers for Virus Pathogen Detection Using Respiratory Samples from a Clinical Cohort. Pathogens 2022; 11:pathogens11030340. [PMID: 35335664 PMCID: PMC8953373 DOI: 10.3390/pathogens11030340] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2022] [Revised: 02/21/2022] [Accepted: 03/07/2022] [Indexed: 01/10/2023] Open
Abstract
Viral metagenomics is increasingly applied in clinical diagnostic settings for detection of pathogenic viruses. While several benchmarking studies have been published on the use of metagenomic classifiers for abundance and diversity profiling of bacterial populations, studies on the comparative performance of the classifiers for virus pathogen detection are scarce. In this study, metagenomic data sets (n = 88) from a clinical cohort of patients with respiratory complaints were used for comparison of the performance of five taxonomic classifiers: Centrifuge, Clark, Kaiju, Kraken2, and Genome Detective. A total of 1144 positive and negative PCR results for a total of 13 respiratory viruses were used as gold standard. Sensitivity and specificity of these classifiers ranged from 83 to 100% and 90 to 99%, respectively, and was dependent on the classification level and data pre-processing. Exclusion of human reads generally resulted in increased specificity. Normalization of read counts for genome length resulted in a minor effect on overall performance, however it negatively affected the detection of targets with read counts around detection level. Correlation of sequence read counts with PCR Ct-values varied per classifier, data pre-processing (R2 range 15.1–63.4%), and per virus, with outliers up to 3 log10 reads magnitude beyond the predicted read count for viruses with high sequence diversity. In this benchmarking study, sensitivity and specificity were within the ranges of use for diagnostic practice when the cut-off for defining a positive result was considered per classifier.
Collapse
|
29
|
Chandrakumar I, Gauthier NPG, Nelson C, Bonsall MB, Locher K, Charles M, MacDonald C, Krajden M, Manges AR, Chorlton SD. BugSplit enables genome-resolved metagenomics through highly accurate taxonomic binning of metagenomic assemblies. Commun Biol 2022; 5:151. [PMID: 35194141 PMCID: PMC8864044 DOI: 10.1038/s42003-022-03114-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2021] [Accepted: 02/03/2022] [Indexed: 11/13/2022] Open
Abstract
A large gap remains between sequencing a microbial community and characterizing all of the organisms inside of it. Here we develop a novel method to taxonomically bin metagenomic assemblies through alignment of contigs against a reference database. We show that this workflow, BugSplit, bins metagenome-assembled contigs to species with a 33% absolute improvement in F1-score when compared to alternative tools. We perform nanopore mNGS on patients with COVID-19, and using a reference database predating COVID-19, demonstrate that BugSplit's taxonomic binning enables sensitive and specific detection of a novel coronavirus not possible with other approaches. When applied to nanopore mNGS data from cases of Klebsiella pneumoniae and Neisseria gonorrhoeae infection, BugSplit's taxonomic binning accurately separates pathogen sequences from those of the host and microbiota, and unlocks the possibility of sequence typing, in silico serotyping, and antimicrobial resistance prediction of each organism within a sample. BugSplit is available at https://bugseq.com/academic .
Collapse
Affiliation(s)
| | - Nick P G Gauthier
- Department of Microbiology and Immunology, University of British Columbia, Vancouver, BC, Canada
| | - Cassidy Nelson
- Mathematical Ecology Research Group, Department of Zoology, University of Oxford, Oxford, UK
| | - Michael B Bonsall
- Mathematical Ecology Research Group, Department of Zoology, University of Oxford, Oxford, UK
| | - Kerstin Locher
- Division of Medical Microbiology, Vancouver General Hospital, Vancouver, BC, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, Canada
| | - Marthe Charles
- Division of Medical Microbiology, Vancouver General Hospital, Vancouver, BC, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, Canada
| | - Clayton MacDonald
- Division of Medical Microbiology, Vancouver General Hospital, Vancouver, BC, Canada
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, Canada
| | - Mel Krajden
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, Canada
- British Columbia Centre for Disease Control, Vancouver, BC, Canada
| | - Amee R Manges
- British Columbia Centre for Disease Control, Vancouver, BC, Canada
- School of Population and Public Health, University of British Columbia, Vancouver, BC, Canada
| | - Samuel D Chorlton
- BugSeq Bioinformatics Inc, Vancouver, BC, Canada.
- Department of Pathology and Laboratory Medicine, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
30
|
Siekaniec G, Roux E, Lemane T, Guédon E, Nicolas J. Identification of isolated or mixed strains from long reads: a challenge met on Streptococcus thermophilus using a MinION sequencer. Microb Genom 2021; 7. [PMID: 34812718 PMCID: PMC8743539 DOI: 10.1099/mgen.0.000654] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
This study aimed to provide efficient recognition of bacterial strains on personal computers from MinION (Nanopore) long read data. Thanks to the fall in sequencing costs, the identification of bacteria can now proceed by whole genome sequencing. MinION is a fast, but highly error-prone sequencing device and it is a challenge to successfully identify the strain content of unknown simple or complex microbial samples. It is heavily constrained by memory management and fast access to the read and genome fragments. Our strategy involves three steps: indexing of known genomic sequences for a given or several bacterial species; a request process to assign a read to a strain by matching it to the closest reference genomes; and a final step looking for a minimum set of strains that best explains the observed reads. We have applied our method, called ORI, on 77 strains of Streptococcus thermophilus. We worked on several genomic distances and obtained a detailed classification of the strains, together with a criterion that allows merging of what we termed 'sibling' strains, only separated by a few mutations. Overall, isolated strains can be safely recognized from MinION data. For mixtures of several non-sibling strains, results depend on strain abundance.
Collapse
Affiliation(s)
- Grégoire Siekaniec
- Univ Rennes, INRIA, Campus de Beaulieu 35042 Rennes cedex, Rennes, France
- INRAE, Institut Agro, STLO, F-35000, Rennes, France
| | - Emeline Roux
- Univ Rennes, INRIA, Campus de Beaulieu 35042 Rennes cedex, Rennes, France
- CALBINOTOX (Composés ALimentaire BIofonctionnalités et risques NeuTOXiques) EA7488 Université de Lorraine, France
| | - Téo Lemane
- Univ Rennes, INRIA, Campus de Beaulieu 35042 Rennes cedex, Rennes, France
| | - Eric Guédon
- INRAE, Institut Agro, STLO, F-35000, Rennes, France
- *Correspondence: Eric Guédon,
| | - Jacques Nicolas
- Univ Rennes, INRIA, Campus de Beaulieu 35042 Rennes cedex, Rennes, France
- *Correspondence: Jacques Nicolas,
| |
Collapse
|
31
|
Tian L, Ou J, Sun X, Miao Y, Pei J, Zhao L, Huang L. The discovery of pivotal fungus and major determinant factor shaping soil microbial community composition associated with rot root of American ginseng. PLANT SIGNALING & BEHAVIOR 2021; 16:1952372. [PMID: 34304705 PMCID: PMC8525955 DOI: 10.1080/15592324.2021.1952372] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/13/2021] [Revised: 07/01/2021] [Accepted: 07/02/2021] [Indexed: 06/13/2023]
Abstract
American ginseng, a valuable medicinal and food plant, is threatened by rot root, which affects its yield and quality. However, limited studies have investigated the changes in soil microbial community and physiochemical properties between healthy and rot root American ginseng. Here, high-throughput sequencing and soil physiochemical properties were used to characterize these changes. The soil physiochemical properties showed significance differences between the soil of healthy and rot root, in which the pH, available potassium, available phosphorus, soil organic carbon and soil organic matter were significantly higher in healthy root soil. Besides, fungal α-diversity was also higher in healthy root soil than that in rot root. Importantly, the dominant fungal genera differed between soils of healthy and rot root of American ginseng, and LEfSe further indicated that six fungal genera (Devriesia, Chrysosporium, Dichotomopilus, Pseudeurotium, Acaulium and Scedosporium) were significantly enriched in the soil of healthy plants, whereas six fungal genera (Gibellulopsis, Fusarium, Plectosphaerella, Tetracladium, Gibberella and Ilyonectri) were significantly enriched in the soil of rot root, suggesting that an increase in the relative abundance of these pathogenic fungi (Fusarium, Plectosphaerella, and Ilyonectri) may be associated with ginseng rot root. Notably, this study is the first to report that an increase in the relative abundances of Gibellulopsis and Gibberella in the rot root soil of American ginseng may be associated with the onset of rot root symptoms in this plant. The functional profile prediction showed that the there was a significantly Pathotrophs increase in the rot root soil compared with healthy root soil and Saprotrophs were more abundant in the healthy root soil. Finally, correlation analyses revealed that soil cation exchange capacity was an important factors affecting the composition of rot root of American ginseng soil microbial communities. This study not only used a new approach to explore the new fungal associated with rot root in American ginseng but also excavated the major soil physiochemical properties affecting the microbiome diversity, providing foundation for developing biocontrol strategies against rot root.
Collapse
Affiliation(s)
- Lixia Tian
- Key Research Laboratory of Traditional Chinese Medicine Resources Protection, Administration of Traditional Chinese Medicine, National Administration of Traditional Chinese Medicine, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Jiarong Ou
- Tongren Municipal People’s Hospital, Tongren, China
| | - Xiao Sun
- Key Research Laboratory of Traditional Chinese Medicine Resources Protection, Administration of Traditional Chinese Medicine, National Administration of Traditional Chinese Medicine, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Yujing Miao
- Key Research Laboratory of Traditional Chinese Medicine Resources Protection, Administration of Traditional Chinese Medicine, National Administration of Traditional Chinese Medicine, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Jin Pei
- College of Pharmacy, Chengdu University of Traditional Chinese Medicine, Chengdu, Sichuan, China
| | - Lei Zhao
- Central Medical District of Chinese PLA General Hospital, Beijing, China
| | - Linfang Huang
- Key Research Laboratory of Traditional Chinese Medicine Resources Protection, Administration of Traditional Chinese Medicine, National Administration of Traditional Chinese Medicine, Institute of Medicinal Plant Development, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| |
Collapse
|
32
|
Rachtman E, Bafna V, Mirarab S. CONSULT: accurate contamination removal using locality-sensitive hashing. NAR Genom Bioinform 2021; 3:lqab071. [PMID: 34377979 PMCID: PMC8340999 DOI: 10.1093/nargab/lqab071] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2021] [Revised: 06/30/2021] [Accepted: 07/19/2021] [Indexed: 12/27/2022] Open
Abstract
A fundamental question appears in many bioinformatics applications: Does a sequencing read belong to a large dataset of genomes from some broad taxonomic group, even when the closest match in the set is evolutionarily divergent from the query? For example, low-coverage genome sequencing (skimming) projects either assemble the organelle genome or compute genomic distances directly from unassembled reads. Using unassembled reads needs contamination detection because samples often include reads from unintended groups of species. Similarly, assembling the organelle genome needs distinguishing organelle and nuclear reads. While k-mer-based methods have shown promise in read-matching, prior studies have shown that existing methods are insufficiently sensitive for contamination detection. Here, we introduce a new read-matching tool called CONSULT that tests whether k-mers from a query fall within a user-specified distance of the reference dataset using locality-sensitive hashing. Taking advantage of large memory machines available nowadays, CONSULT libraries accommodate tens of thousands of microbial species. Our results show that CONSULT has higher true-positive and lower false-positive rates of contamination detection than leading methods such as Kraken-II and improves distance calculation from genome skims. We also demonstrate that CONSULT can distinguish organelle reads from nuclear reads, leading to dramatic improvements in skim-based mitochondrial assemblies.
Collapse
Affiliation(s)
- Eleonora Rachtman
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA 92093, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, UC San Diego, CA 92093, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, CA 92093, USA
| |
Collapse
|
33
|
Weging S, Gogol-Döring A, Grosse I. Taxonomic analysis of metagenomic data with kASA. Nucleic Acids Res 2021; 49:e68. [PMID: 33784400 PMCID: PMC8266618 DOI: 10.1093/nar/gkab200] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 03/10/2021] [Indexed: 11/14/2022] Open
Abstract
The taxonomic analysis of sequencing data has become important in many areas of life sciences. However, currently available tools for that purpose either consume large amounts of RAM or yield insufficient quality and robustness. Here, we present kASA, a k-mer based tool capable of identifying and profiling metagenomic DNA or protein sequences with high computational efficiency and a user-definable memory footprint. We ensure both high sensitivity and precision by using an amino acid-like encoding of k-mers together with a range of multiple k’s. Custom algorithms and data structures optimized for external memory storage enable a full-scale taxonomic analysis without compromise on laptop, desktop, and HPCC.
Collapse
Affiliation(s)
- Silvio Weging
- Institute of Computer Science, Martin-Luther University Halle-Wittenberg, Von-Seckendorff-Platz 1, Halle, Germany
| | - Andreas Gogol-Döring
- Department of Mathematics, Natural Sciences and Computer Science, TH Mittelhessen University of Applied Sciences, Wiesenstraße 14, Gießen, Germany
| | - Ivo Grosse
- Institute of Computer Science, Martin-Luther University Halle-Wittenberg, Von-Seckendorff-Platz 1, Halle, Germany
| |
Collapse
|
34
|
Sun Z, Huang S, Zhang M, Zhu Q, Haiminen N, Carrieri AP, Vázquez-Baeza Y, Parida L, Kim HC, Knight R, Liu YY. Challenges in benchmarking metagenomic profilers. Nat Methods 2021; 18:618-626. [PMID: 33986544 PMCID: PMC8184642 DOI: 10.1038/s41592-021-01141-3] [Citation(s) in RCA: 65] [Impact Index Per Article: 16.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2020] [Accepted: 04/02/2021] [Indexed: 02/02/2023]
Abstract
Accurate microbial identification and abundance estimation are crucial for metagenomics analysis. Various methods for classification of metagenomic data and estimation of taxonomic profiles, broadly referred to as metagenomic profilers, have been developed. Nevertheless, benchmarking of metagenomic profilers remains challenging because some tools are designed to report relative sequence abundance while others report relative taxonomic abundance. Here we show how misleading conclusions can be drawn by neglecting this distinction between relative abundance types when benchmarking metagenomic profilers. Moreover, we show compelling evidence that interchanging sequence abundance and taxonomic abundance will influence both per-sample summary statistics and cross-sample comparisons. We suggest that the microbiome research community pay attention to potentially misleading biological conclusions arising from this issue when benchmarking metagenomic profilers, by carefully considering the type of abundance data that were analyzed and interpreted and clearly stating the strategy used for metagenomic profiling.
Collapse
Affiliation(s)
- Zheng Sun
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
| | - Shi Huang
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Meng Zhang
- Key Laboratory of Dairy Biotechnology and Engineering, Ministry of Education, Inner Mongolia Agricultural University, Hohhot, China
| | - Qiyun Zhu
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Niina Haiminen
- IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
| | | | - Yoshiki Vázquez-Baeza
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, La Jolla, CA, USA
| | - Laxmi Parida
- IBM T. J. Watson Research Center, Yorktown Heights, NY, USA
| | - Ho-Cheol Kim
- AI and Cognitive Software, IBM Research-Almaden, San Jose, CA, USA
| | - Rob Knight
- Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA.
- Center for Microbiome Innovation, Jacobs School of Engineering, University of California, San Diego, La Jolla, CA, USA.
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA.
- Department of Bioengineering, University of California, San Diego, La Jolla, CA, USA.
| | - Yang-Yu Liu
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
35
|
Beghini F, McIver LJ, Blanco-Míguez A, Dubois L, Asnicar F, Maharjan S, Mailyan A, Manghi P, Scholz M, Thomas AM, Valles-Colomer M, Weingart G, Zhang Y, Zolfo M, Huttenhower C, Franzosa EA, Segata N. Integrating taxonomic, functional, and strain-level profiling of diverse microbial communities with bioBakery 3. eLife 2021; 10:65088. [PMID: 33944776 PMCID: PMC8096432 DOI: 10.7554/elife.65088] [Citation(s) in RCA: 1110] [Impact Index Per Article: 277.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2020] [Accepted: 04/21/2021] [Indexed: 02/06/2023] Open
Abstract
Culture-independent analyses of microbial communities have progressed dramatically in the last decade, particularly due to advances in methods for biological profiling via shotgun metagenomics. Opportunities for improvement continue to accelerate, with greater access to multi-omics, microbial reference genomes, and strain-level diversity. To leverage these, we present bioBakery 3, a set of integrated, improved methods for taxonomic, strain-level, functional, and phylogenetic profiling of metagenomes newly developed to build on the largest set of reference sequences now available. Compared to current alternatives, MetaPhlAn 3 increases the accuracy of taxonomic profiling, and HUMAnN 3 improves that of functional potential and activity. These methods detected novel disease-microbiome links in applications to CRC (1262 metagenomes) and IBD (1635 metagenomes and 817 metatranscriptomes). Strain-level profiling of an additional 4077 metagenomes with StrainPhlAn 3 and PanPhlAn 3 unraveled the phylogenetic and functional structure of the common gut microbe Ruminococcus bromii, previously described by only 15 isolate genomes. With open-source implementations and cloud-deployable reproducible workflows, the bioBakery 3 platform can help researchers deepen the resolution, scale, and accuracy of multi-omic profiling for microbial community studies.
Collapse
Affiliation(s)
| | - Lauren J McIver
- Harvard T.H. Chan School of Public Health, Boston, United States
| | | | | | | | - Sagun Maharjan
- Harvard T.H. Chan School of Public Health, Boston, United States.,The Broad Institute of MIT and Harvard, Cambridge, United States
| | - Ana Mailyan
- Harvard T.H. Chan School of Public Health, Boston, United States.,The Broad Institute of MIT and Harvard, Cambridge, United States
| | - Paolo Manghi
- Department CIBIO, University of Trento, Trento, Italy
| | - Matthias Scholz
- Department of Food Quality and Nutrition, Research and Innovation Center, Edmund Mach Foundation, San Michele all'Adige, Italy
| | | | | | - George Weingart
- Harvard T.H. Chan School of Public Health, Boston, United States.,The Broad Institute of MIT and Harvard, Cambridge, United States
| | - Yancong Zhang
- Harvard T.H. Chan School of Public Health, Boston, United States.,The Broad Institute of MIT and Harvard, Cambridge, United States
| | - Moreno Zolfo
- Department CIBIO, University of Trento, Trento, Italy
| | - Curtis Huttenhower
- Harvard T.H. Chan School of Public Health, Boston, United States.,The Broad Institute of MIT and Harvard, Cambridge, United States
| | - Eric A Franzosa
- Harvard T.H. Chan School of Public Health, Boston, United States.,The Broad Institute of MIT and Harvard, Cambridge, United States
| | - Nicola Segata
- Department CIBIO, University of Trento, Trento, Italy.,IEO, European Institute of Oncology IRCCS, Milan, Italy
| |
Collapse
|
36
|
Meyer F, Lesker TR, Koslicki D, Fritz A, Gurevich A, Darling AE, Sczyrba A, Bremges A, McHardy AC. Tutorial: assessing metagenomics software with the CAMI benchmarking toolkit. Nat Protoc 2021; 16:1785-1801. [PMID: 33649565 DOI: 10.1038/s41596-020-00480-3] [Citation(s) in RCA: 29] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2020] [Accepted: 11/26/2020] [Indexed: 01/31/2023]
Abstract
Computational methods are key in microbiome research, and obtaining a quantitative and unbiased performance estimate is important for method developers and applied researchers. For meaningful comparisons between methods, to identify best practices and common use cases, and to reduce overhead in benchmarking, it is necessary to have standardized datasets, procedures and metrics for evaluation. In this tutorial, we describe emerging standards in computational meta-omics benchmarking derived and agreed upon by a larger community of researchers. Specifically, we outline recent efforts by the Critical Assessment of Metagenome Interpretation (CAMI) initiative, which supplies method developers and applied researchers with exhaustive quantitative data about software performance in realistic scenarios and organizes community-driven benchmarking challenges. We explain the most relevant evaluation metrics for assessing metagenome assembly, binning and profiling results, and provide step-by-step instructions on how to generate them. The instructions use simulated mouse gut metagenome data released in preparation for the second round of CAMI challenges and showcase the use of a repository of tool results for CAMI datasets. This tutorial will serve as a reference for the community and facilitate informative and reproducible benchmarking in microbiome research.
Collapse
Affiliation(s)
- Fernando Meyer
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Till-Robin Lesker
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,German Center for Infection Research (DZIF), Braunschweig, Germany
| | - David Koslicki
- Computer Science and Engineering, Biology, and The Huck Institutes of the Life Sciences, Penn State University, State College, PA, USA
| | - Adrian Fritz
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany
| | - Alexey Gurevich
- Center for Algorithmic Biotechnology, St. Petersburg State University, St. Petersburg, Russia
| | - Aaron E Darling
- The ithree institute, University of Technology Sydney, Sydney, Australia
| | - Alexander Sczyrba
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Andreas Bremges
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.,German Center for Infection Research (DZIF), Braunschweig, Germany
| | - Alice C McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
| |
Collapse
|
37
|
Snipen L, Angell IL, Rognes T, Rudi K. Reduced metagenome sequencing for strain-resolution taxonomic profiles. MICROBIOME 2021; 9:79. [PMID: 33781324 PMCID: PMC8008692 DOI: 10.1186/s40168-021-01019-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/16/2020] [Accepted: 02/02/2021] [Indexed: 05/05/2023]
Abstract
BACKGROUND Studies of shifts in microbial community composition has many applications. For studies at species or subspecies levels, the 16S amplicon sequencing lacks resolution and is often replaced by full shotgun sequencing. Due to higher costs, this restricts the number of samples sequenced. As an alternative to a full shotgun sequencing we have investigated the use of Reduced Metagenome Sequencing (RMS) to estimate the composition of a microbial community. This involves the use of double-digested restriction-associated DNA sequencing, which means only a smaller fraction of the genomes are sequenced. The read sets obtained by this approach have properties different from both amplicon and shotgun data, and analysis pipelines for both can either not be used at all or not explore the full potential of RMS data. RESULTS We suggest a procedure for analyzing such data, based on fragment clustering and the use of a constrained ordinary least square de-convolution for estimating the relative abundance of all community members. Mock community datasets show the potential to clearly separate strains even when the 16S is 100% identical, and genome-wide differences is < 0.02, indicating RMS has a very high resolution. From a simulation study, we compare RMS to shotgun sequencing and show that we get improved abundance estimates when the community has many very closely related genomes. From a real dataset of infant guts, we show that RMS is capable of detecting a strain diversity gradient for Escherichia coli across time. CONCLUSION We find that RMS is a good alternative to either metabarcoding or shotgun sequencing when it comes to resolving microbial communities at the strain level. Like shotgun metagenomics, it requires a good database of reference genomes and is well suited for studies of the human gut or other communities where many reference genomes exist. A data analysis pipeline is offered, as an R package at https://github.com/larssnip/microRMS . Video abstract.
Collapse
Affiliation(s)
- Lars Snipen
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, P.O. Box 5003, NO-1432 Ås, Norway
| | - Inga-Leena Angell
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, P.O. Box 5003, NO-1432 Ås, Norway
| | - Torbjørn Rognes
- Department of Informatics, University of Oslo, P.O. Box 1080, Blindern, NO-0316 Oslo, Norway
| | - Knut Rudi
- Department of Chemistry, Biotechnology and Food Sciences, Norwegian University of Life Sciences, P.O. Box 5003, NO-1432 Ås, Norway
| |
Collapse
|
38
|
Wei ZG, Zhang XD, Cao M, Liu F, Qian Y, Zhang SW. Comparison of Methods for Picking the Operational Taxonomic Units From Amplicon Sequences. Front Microbiol 2021; 12:644012. [PMID: 33841367 PMCID: PMC8024490 DOI: 10.3389/fmicb.2021.644012] [Citation(s) in RCA: 18] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2020] [Accepted: 02/17/2021] [Indexed: 12/31/2022] Open
Abstract
With the advent of next-generation sequencing technology, it has become convenient and cost efficient to thoroughly characterize the microbial diversity and taxonomic composition in various environmental samples. Millions of sequencing data can be generated, and how to utilize this enormous sequence resource has become a critical concern for microbial ecologists. One particular challenge is the OTUs (operational taxonomic units) picking in 16S rRNA sequence analysis. Lucky, this challenge can be directly addressed by sequence clustering that attempts to group similar sequences. Therefore, numerous clustering methods have been proposed to help to cluster 16S rRNA sequences into OTUs. However, each method has its clustering mechanism, and different methods produce diverse outputs. Even a slight parameter change for the same method can also generate distinct results, and how to choose an appropriate method has become a challenge for inexperienced users. A lot of time and resources can be wasted in selecting clustering tools and analyzing the clustering results. In this study, we introduced the recent advance of clustering methods for OTUs picking, which mainly focus on three aspects: (i) the principles of existing clustering algorithms, (ii) benchmark dataset construction for OTU picking and evaluation metrics, and (iii) the performance of different methods with various distance thresholds on benchmark datasets. This paper aims to assist biological researchers to select the reasonable clustering methods for analyzing their collected sequences and help algorithm developers to design more efficient sequences clustering methods.
Collapse
Affiliation(s)
- Ze-Gang Wei
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi’an, China
| | - Xiao-Dan Zhang
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Ming Cao
- Faculty of Electronic and Information Engineering, Xi’an Jiaotong University, Xi’an, China
- School of Mathematics and Statistics, Shaanxi Xueqian Normal University, Xi’an, China
| | - Fei Liu
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Yu Qian
- Institute of Physics and Optoelectronics Technology, Baoji University of Arts and Sciences, Baoji, China
| | - Shao-Wu Zhang
- Key Laboratory of Information Fusion Technology of Ministry of Education, School of Automation, Northwestern Polytechnical University, Xi’an, China
| |
Collapse
|
39
|
Plyusnin I, Kant R, Jääskeläinen AJ, Sironen T, Holm L, Vapalahti O, Smura T. Novel NGS pipeline for virus discovery from a wide spectrum of hosts and sample types. Virus Evol 2020; 6:veaa091. [PMID: 33408878 PMCID: PMC7772471 DOI: 10.1093/ve/veaa091] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open
Abstract
The study of the microbiome data holds great potential for elucidating the biological and metabolic functioning of living organisms and their role in the environment. Metagenomic analyses have shown that humans, along with for example, domestic animals, wildlife and arthropods, are colonized by an immense community of viruses. The current Coronavirus pandemic (COVID-19) heightens the need to rapidly detect previously unknown viruses in an unbiased way. The increasing availability of metagenomic data in this era of next-generation sequencing (NGS), along with increasingly affordable sequencing technologies, highlight the need for reliable and comprehensive methods to manage such data. In this article, we present a novel bioinformatics pipeline called LAZYPIPE for identifying both previously known and novel viruses in host associated or environmental samples and give examples of virus discovery based on it. LAZYPIPE is a Unix-based pipeline for automated assembling and taxonomic profiling of NGS libraries implemented as a collection of C++, Perl, and R scripts.
Collapse
Affiliation(s)
- Ilya Plyusnin
- Institute of Biotechnology, University of Helsinki, Helsinki 00014, Finland
| | - Ravi Kant
- Department of Veterinary Bioscience, University of Helsinki, Helsinki 00014, Finland
| | - Anne J Jääskeläinen
- Department of Virology and Immunology, University of Helsinki and Helsinki University Hospital, Helsinki 00014, Finland
| | - Tarja Sironen
- Department of Veterinary Bioscience, University of Helsinki, Helsinki 00014, Finland
| | - Liisa Holm
- Institute of Biotechnology, University of Helsinki, Helsinki 00014, Finland
| | - Olli Vapalahti
- Department of Veterinary Bioscience, University of Helsinki, Helsinki 00014, Finland
| | - Teemu Smura
- Department of Virology, University of Helsinki, Helsinki 00014, Finland
| |
Collapse
|
40
|
LaPierre N, Alser M, Eskin E, Koslicki D, Mangul S. Metalign: efficient alignment-based metagenomic profiling via containment min hash. Genome Biol 2020; 21:242. [PMID: 32912225 PMCID: PMC7488264 DOI: 10.1186/s13059-020-02159-0] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 08/26/2020] [Indexed: 12/31/2022] Open
Abstract
Metagenomic profiling, predicting the presence and relative abundances of microbes in a sample, is a critical first step in microbiome analysis. Alignment-based approaches are often considered accurate yet computationally infeasible. Here, we present a novel method, Metalign, that performs efficient and accurate alignment-based metagenomic profiling. We use a novel containment min hash approach to pre-filter the reference database prior to alignment and then process both uniquely aligned and multi-aligned reads to produce accurate abundance estimates. In performance evaluations on both real and simulated datasets, Metalign is the only method evaluated that maintained high performance and competitive running time across all datasets.
Collapse
Affiliation(s)
- Nathan LaPierre
- Department of Computer Science, University of California, Los Angeles, CA, 90095, USA.
| | - Mohammed Alser
- Department of Computer Science, ETH Zurich, Rämistrasse 101, CH-8092, Zurich, Switzerland
| | - Eleazar Eskin
- Department of Computer Science, University of California, Los Angeles, CA, 90095, USA
- Department of Computational Medicine, University of California, Los Angeles, CA, 90095, USA
- Department of Human Genetics, University of California, Los Angeles, CA, 90095, USA
| | - David Koslicki
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA, USA.
- Department of Biology, The Pennsylvania State University, University Park, PA, USA.
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park,, PA, USA.
| | - Serghei Mangul
- Department of Clinical Pharmacy, University of Southern California, Los Angeles, CA, 90089, USA.
| |
Collapse
|
41
|
Yue Y, Huang H, Qi Z, Dou HM, Liu XY, Han TF, Chen Y, Song XJ, Zhang YH, Tu J. Evaluating metagenomics tools for genome binning with real metagenomic datasets and CAMI datasets. BMC Bioinformatics 2020; 21:334. [PMID: 32723290 PMCID: PMC7469296 DOI: 10.1186/s12859-020-03667-3] [Citation(s) in RCA: 48] [Impact Index Per Article: 9.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2019] [Accepted: 07/16/2020] [Indexed: 12/13/2022] Open
Abstract
Background Shotgun metagenomics based on untargeted sequencing can explore the taxonomic profile and the function of unknown microorganisms in samples, and complement the shortage of amplicon sequencing. Binning assembled sequences into individual groups, which represent microbial genomes, is the key step and a major challenge in metagenomic research. Both supervised and unsupervised machine learning methods have been employed in binning. Genome binning belonging to unsupervised method clusters contigs into individual genome bins by machine learning methods without the assistance of any reference databases. So far a lot of genome binning tools have emerged. Evaluating these genome tools is of great significance to microbiological research. In this study, we evaluate 15 genome binning tools containing 12 original binning tools and 3 refining binning tools by comparing the performance of these tools on chicken gut metagenomic datasets and the first CAMI challenge datasets. Results For chicken gut metagenomic datasets, original genome binner MetaBat, Groopm2 and Autometa performed better than other original binner, and MetaWrap combined the binning results of them generated the most high-quality genome bins. For CAMI datasets, Groopm2 achieved the highest purity (> 0.9) with good completeness (> 0.8), and reconstructed the most high-quality genome bins among original genome binners. Compared with Groopm2, MetaBat2 had similar performance with higher completeness and lower purity. Genome refining binners DASTool predicated the most high-quality genome bins among all genomes binners. Most genome binner performed well for unique strains. Nonetheless, reconstructing common strains still is a substantial challenge for all genome binner. Conclusions In conclusion, we tested a set of currently available, state-of-the-art metagenomics hybrid binning tools and provided a guide for selecting tools for metagenomic binning by comparing range of purity, completeness, adjusted rand index, and the number of high-quality reconstructed bins. Furthermore, available information for future binning strategy were concluded.
Collapse
Affiliation(s)
- Yi Yue
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China. .,School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China.
| | - Hao Huang
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China.,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China.,School of Animal Science and Technology, Anhui Agricultural University, Hefei, 230036, China
| | - Zhao Qi
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China.,School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Hui-Min Dou
- School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Xin-Yi Liu
- School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China
| | - Tian-Fei Han
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China.,School of Animal Science and Technology, Anhui Agricultural University, Hefei, 230036, China
| | - Yue Chen
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China.,School of Animal Science and Technology, Anhui Agricultural University, Hefei, 230036, China
| | - Xiang-Jun Song
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China.,School of Animal Science and Technology, Anhui Agricultural University, Hefei, 230036, China
| | - You-Hua Zhang
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China. .,School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Life Sciences, Anhui Agricultural University, Hefei, 230036, China.
| | - Jian Tu
- Anhui Province Key Laboratory of Veterinary Pathobiology and Disease Control, Anhui Agricultural University, Hefei, 230036, China. .,School of Information & Computer, Anhui Agricultural University, Hefei, 230036, China. .,School of Animal Science and Technology, Anhui Agricultural University, Hefei, 230036, China.
| |
Collapse
|
42
|
Seppey M, Manni M, Zdobnov EM. LEMMI: a continuous benchmarking platform for metagenomics classifiers. Genome Res 2020; 30:1208-1216. [PMID: 32616517 PMCID: PMC7462069 DOI: 10.1101/gr.260398.119] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Accepted: 06/25/2020] [Indexed: 11/24/2022]
Abstract
Studies of microbiomes are booming, along with the diversity of computational approaches to make sense out of the sequencing data and the volumes of accumulated microbial genotypes. A swift evaluation of newly published methods and their improvements against established tools is necessary to reduce the time between the methods' release and their adoption in microbiome analyses. The LEMMI platform offers a novel approach for benchmarking software dedicated to metagenome composition assessments based on read classification. It enables the integration of newly published methods in an independent and centralized benchmark designed to be continuously open to new submissions. This allows developers to be proactive regarding comparative evaluations and guarantees that any promising methods can be assessed side by side with established tools quickly after their release. Moreover, LEMMI enforces an effective distribution through software containers to ensure long-term availability of all methods. Here, we detail the LEMMI workflow and discuss the performances of some previously unevaluated tools. We see this platform eventually as a community-driven effort in which method developers can showcase novel approaches and get unbiased benchmarks for publications, and users can make informed choices and obtain standardized and easy-to-use tools.
Collapse
Affiliation(s)
- Mathieu Seppey
- Department of Genetic Medicine and Development, University of Geneva Medical School and Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland
| | - Mosè Manni
- Department of Genetic Medicine and Development, University of Geneva Medical School and Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland
| | - Evgeny M Zdobnov
- Department of Genetic Medicine and Development, University of Geneva Medical School and Swiss Institute of Bioinformatics, 1211 Geneva, Switzerland
| |
Collapse
|
43
|
Abstract
MOTIVATION Consider a simple computational problem. The inputs are (i) the set of mixed reads generated from a sample that combines two organisms and (ii) separate sets of reads for several reference genomes of known origins. The goal is to find the two organisms that constitute the mixed sample. When constituents are absent from the reference set, we seek to phylogenetically position them with respect to the underlying tree of the reference species. This simple yet fundamental problem (which we call phylogenetic double-placement) has enjoyed surprisingly little attention in the literature. As genome skimming (low-pass sequencing of genomes at low coverage, precluding assembly) becomes more prevalent, this problem finds wide-ranging applications in areas as varied as biodiversity research, food production and provenance, and evolutionary reconstruction. RESULTS We introduce a model that relates distances between a mixed sample and reference species to the distances between constituents and reference species. Our model is based on Jaccard indices computed between each sample represented as k-mer sets. The model, built on several assumptions and approximations, allows us to formalize the phylogenetic double-placement problem as a non-convex optimization problem that decomposes mixture distances and performs phylogenetic placement simultaneously. Using a variety of techniques, we are able to solve this optimization problem numerically. We test the resulting method, called MIxed Sample Analysis tool (MISA), on a varied set of simulated and biological datasets. Despite all the assumptions used, the method performs remarkably well in practice. AVAILABILITY AND IMPLEMENTATION The software and data are available at https://github.com/balabanmetin/misa and https://github.com/balabanmetin/misa-data. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Metin Balaban
- Bioinformatics and Systems Biology Department, University of California San Diego, San Diego, CA 92093, USA
| | - Siavash Mirarab
- Electrical and Computer Engineering Department, University of California San Diego, San Diego, CA 92093, USA
| |
Collapse
|
44
|
Sim M, Lee J, Lee D, Kwon D, Kim J. TAMA: improved metagenomic sequence classification through meta-analysis. BMC Bioinformatics 2020; 21:185. [PMID: 32397982 PMCID: PMC7218625 DOI: 10.1186/s12859-020-3533-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2020] [Accepted: 05/05/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Microorganisms are important occupants of many different environments. Identifying the composition of microbes and estimating their abundance promote understanding of interactions of microbes in environmental samples. To understand their environments more deeply, the composition of microorganisms in environmental samples has been studied using metagenomes, which are the collections of genomes of the microorganisms. Although many tools have been developed for taxonomy analysis based on different algorithms, variability of analysis outputs of existing tools from the same input metagenome datasets is the main obstacle for many researchers in this field. RESULTS Here, we present a novel meta-analysis tool for metagenome taxonomy analysis, called TAMA, by intelligently integrating outputs from three different taxonomy analysis tools. Using an integrated reference database, TAMA performs taxonomy assignment for input metagenome reads based on a meta-score by integrating scores of taxonomy assignment from different taxonomy classification tools. TAMA outperformed existing tools when evaluated using various benchmark datasets. It was also successfully applied to obtain relative species abundance profiles and difference in composition of microorganisms in two types of cheese metagenome and human gut metagenome. CONCLUSION TAMA can be easily installed and used for metagenome read classification and the prediction of relative species abundance from multiple numbers and types of metagenome read samples. TAMA can be used to more accurately uncover the composition of microorganisms in metagenome samples collected from various environments, especially when the use of a single taxonomy analysis tool is unreliable. TAMA is an open source tool, and can be downloaded at https://github.com/jkimlab/TAMA.
Collapse
Affiliation(s)
- Mikang Sim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Jongin Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Daehwan Lee
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Daehong Kwon
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea
| | - Jaebum Kim
- Department of Biomedical Science and Engineering, Konkuk University, Seoul, 05029, Republic of Korea.
| |
Collapse
|
45
|
Liu Y, Bible PW, Zou B, Liang Q, Dong C, Wen X, Li Y, Ge X, Li X, Deng X, Ma R, Guo S, Liang J, Chen T, Pan W, Liu L, Chen W, Wang X, Wei L. CSMD: a computational subtraction-based microbiome discovery pipeline for species-level characterization of clinical metagenomic samples. Bioinformatics 2020; 36:1577-1583. [PMID: 31626280 DOI: 10.1093/bioinformatics/btz790] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2019] [Revised: 09/22/2019] [Accepted: 10/16/2019] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Microbiome analyses of clinical samples with low microbial biomass are challenging because of the very small quantities of microbial DNA relative to the human host, ubiquitous contaminating DNA in sequencing experiments and the large and rapidly growing microbial reference databases. RESULTS We present computational subtraction-based microbiome discovery (CSMD), a bioinformatics pipeline specifically developed to generate accurate species-level microbiome profiles for clinical samples with low microbial loads. CSMD applies strategies for the maximal elimination of host sequences with minimal loss of microbial signal and effectively detects microorganisms present in the sample with minimal false positives using a stepwise convergent solution. CSMD was benchmarked in a comparative evaluation with other classic tools on previously published well-characterized datasets. It showed higher sensitivity and specificity in host sequence removal and higher specificity in microbial identification, which led to more accurate abundance estimation. All these features are integrated into a free and easy-to-use tool. Additionally, CSMD applied to cell-free plasma DNA showed that microbial diversity within these samples is substantially broader than previously believed. AVAILABILITY AND IMPLEMENTATION CSMD is freely available at https://github.com/liuyu8721/csmd. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yu Liu
- Department of Statistical Science, School of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China.,State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Paul W Bible
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China.,College of Arts and Sciences, Marian University, Indianapolis, IN 46222, USA
| | - Bin Zou
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Qiaoxing Liang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Cong Dong
- College of Chemistry, Sun Yat-Sen University, Guangzhou 510275, China
| | - Xiaofeng Wen
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Yan Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Xiaofei Ge
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Xifang Li
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Xiuli Deng
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Rong Ma
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Shixin Guo
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Juanran Liang
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Tingting Chen
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| | - Wenliang Pan
- Department of Statistical Science, School of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China
| | - Lixin Liu
- College of Chemistry, Sun Yat-Sen University, Guangzhou 510275, China
| | - Wei Chen
- Department of Biostatistics, Graduate School of Public Health, University of Pittsburgh, Pittsburgh, PA 15261, USA.,Division of Pulmonary Medicine, Department of Pediatrics, Children's Hospital of Pittsburgh of UPMC, University of Pittsburgh, Pittsburgh, PA 15224, USA
| | - Xueqin Wang
- Department of Statistical Science, School of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China.,Southern China Research Center of Statistical Science, Sun Yat-Sen University, Guangzhou 510275, China
| | - Lai Wei
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangzhou 510060, China
| |
Collapse
|
46
|
Rachtman E, Balaban M, Bafna V, Mirarab S. The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters. Mol Ecol Resour 2020; 20. [PMID: 31943790 DOI: 10.1111/1755-0998.13135] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 12/22/2019] [Accepted: 01/05/2020] [Indexed: 11/27/2022]
Abstract
The ability to detect the identity of a sample obtained from its environment is a cornerstone of molecular ecological research. Thanks to the falling price of shotgun sequencing, genome skimming, the acquisition of short reads spread across the genome at low coverage, is emerging as an alternative to traditional barcoding. By obtaining far more data across the whole genome, skimming has the promise to increase the precision of sample identification beyond traditional barcoding while keeping the costs manageable. While methods for assembly-free sample identification based on genome skims are now available, little is known about how these methods react to the presence of DNA from organisms other than the target species. In this paper, we show that the accuracy of distances computed between a pair of genome skims based on k-mer similarity can degrade dramatically if the skims include contaminant reads; i.e., any reads originating from other organisms. We establish a theoretical model of the impact of contamination. We then suggest and evaluate a solution to the contamination problem: Query reads in a genome skim against an extensive database of possible contaminants (e.g., all microbial organisms) and filter out any read that matches. We evaluate the effectiveness of this strategy when implemented using Kraken-II, in detailed analyses. Our results show substantial improvements in accuracy as a result of filtering but also point to limitations, including a need for relatively close matches in the contaminant database.
Collapse
Affiliation(s)
- Eleonora Rachtman
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA, USA
| | - Metin Balaban
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, UC San Diego, CA, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, CA, USA
| |
Collapse
|
47
|
Ye SH, Siddle KJ, Park DJ, Sabeti PC. Benchmarking Metagenomics Tools for Taxonomic Classification. Cell 2019; 178:779-794. [PMID: 31398336 PMCID: PMC6716367 DOI: 10.1016/j.cell.2019.07.010] [Citation(s) in RCA: 305] [Impact Index Per Article: 50.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2019] [Revised: 06/18/2019] [Accepted: 07/08/2019] [Indexed: 01/17/2023]
Abstract
Metagenomic sequencing is revolutionizing the detection and characterization of microbial species, and a wide variety of software tools are available to perform taxonomic classification of these data. The fast pace of development of these tools and the complexity of metagenomic data make it important that researchers are able to benchmark their performance. Here, we review current approaches for metagenomic analysis and evaluate the performance of 20 metagenomic classifiers using simulated and experimental datasets. We describe the key metrics used to assess performance, offer a framework for the comparison of additional classifiers, and discuss the future of metagenomic data analysis.
Collapse
Affiliation(s)
- Simon H Ye
- Harvard-MIT Health Sciences and Technology, Massachusetts Institute of Technology, Cambridge, MA 02139, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| | - Katherine J Siddle
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Center for Systems Biology, Department of Organismal and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA
| | - Daniel J Park
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Pardis C Sabeti
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA; Center for Systems Biology, Department of Organismal and Evolutionary Biology, Harvard University, Cambridge, MA 02138, USA; Department of Immunology and Infectious Disease, Harvard School of Public Health, Boston, MA 02115, USA; Howard Hughes Medical Institute (HHMI), Chevy Chase, MD 20815, USA
| |
Collapse
|