1
|
Plaza-Díaz J, Fernández MF, García F, Chueca N, Fontana L, Álvarez-Mercado AI. Comparison of Three DNA Isolation Methods and Two Sequencing Techniques for the Study of the Human Microbiota. Life (Basel) 2025; 15:599. [PMID: 40283154 PMCID: PMC12028492 DOI: 10.3390/life15040599] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2025] [Revised: 03/21/2025] [Accepted: 04/02/2025] [Indexed: 04/07/2025] Open
Abstract
Breast cancer is the most commonly diagnosed cancer in women and the second leading cause of female death. Altered interactions between the host and the gut microbiota appear to play an influential role in carcinogenesis. Several studies have shown different signatures of the gut microbiota in patients with breast cancer compared to healthy women. Currently, there is disagreement regarding the different DNA isolation and sequencing methodologies for studies on the human microbiota, given that they can influence the interpretation of the results obtained. The goal of this work was to compare (1) three different DNA extraction strategies to minimize the impact of human DNA, and (2) two sequencing strategies (16S rRNA and shotgun) to identify discrepancies in microbiome results. We made use of breast tissue and fecal samples from both healthy women and breast cancer patients who participated in the MICROMA study (reference NCT03885648). DNA was isolated by means of mechanical lysis, trypsin, or saponin. The amount of eukaryotic DNA isolated using the trypsin and saponin methods was lower compared to the mechanical lysis method (mechanical lysis, 89.11 ± 2.32%; trypsin method, 82.63 ± 1.23%; saponin method, 80.53 ± 4.09%). In samples with a predominance of prokaryotic cells, such as feces, 16S rRNA sequencing was the most advantageous approach. For other tissues, which are expected to have a more complex microbial composition, the need for an in-depth evaluation of the multifactorial interaction between the various components of the microbiota makes shotgun sequencing the most appropriate method. As for the three extraction methods evaluated, when sequencing samples other than stool, the trypsin method is the most convenient. For fecal samples, where contamination by host DNA is low, no prior treatment is necessary.
Collapse
Affiliation(s)
- Julio Plaza-Díaz
- Institute of Biosanitary Research (ibs.GRANADA), San Cecilio University Clinical Hospital, 18012 Granada, Spain; (J.P.-D.); (M.F.F.); (F.G.); (N.C.)
- School of Health Sciences, International University of La Rioja, 26001 Logroño, Spain
| | - Mariana F. Fernández
- Institute of Biosanitary Research (ibs.GRANADA), San Cecilio University Clinical Hospital, 18012 Granada, Spain; (J.P.-D.); (M.F.F.); (F.G.); (N.C.)
- Spanish Consortium for Research on Epidemiology and Public Health (CIBERESP), 28029 Madrid, Spain
- Department of Radiology and Physical Medicine, School of Medicine, University of Granada, 18016 Granada, Spain
| | - Federico García
- Institute of Biosanitary Research (ibs.GRANADA), San Cecilio University Clinical Hospital, 18012 Granada, Spain; (J.P.-D.); (M.F.F.); (F.G.); (N.C.)
- Microbiology Unit, San Cecilio University Clinical Hospital, 18016 Granada, Spain
- Spanish Consortium for Research on Infectious Diseases (CIBERINFEC), 28029 Madrid, Spain
| | - Natalia Chueca
- Institute of Biosanitary Research (ibs.GRANADA), San Cecilio University Clinical Hospital, 18012 Granada, Spain; (J.P.-D.); (M.F.F.); (F.G.); (N.C.)
- Microbiology Unit, San Cecilio University Clinical Hospital, 18016 Granada, Spain
- Spanish Consortium for Research on Infectious Diseases (CIBERINFEC), 28029 Madrid, Spain
| | - Luis Fontana
- Institute of Biosanitary Research (ibs.GRANADA), San Cecilio University Clinical Hospital, 18012 Granada, Spain; (J.P.-D.); (M.F.F.); (F.G.); (N.C.)
- Department of Biochemistry and Molecular Biology II, School of Pharmacy, University of Granada, 18071 Granada, Spain
- Institute of Nutrition and Food Technology “José Matáix”, Centre of Biomedical Research, University of Granada, 18016 Granada, Spain
| | - Ana I. Álvarez-Mercado
- Institute of Biosanitary Research (ibs.GRANADA), San Cecilio University Clinical Hospital, 18012 Granada, Spain; (J.P.-D.); (M.F.F.); (F.G.); (N.C.)
- Institute of Nutrition and Food Technology “José Matáix”, Centre of Biomedical Research, University of Granada, 18016 Granada, Spain
- Department Pharmacology, School of Pharmacy, 18071 Granada, Spain
| |
Collapse
|
2
|
Herazo-Álvarez J, Mora M, Cuadros-Orellana S, Vilches-Ponce K, Hernández-García R. A review of neural networks for metagenomic binning. Brief Bioinform 2025; 26:bbaf065. [PMID: 40131312 PMCID: PMC11934572 DOI: 10.1093/bib/bbaf065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Revised: 01/02/2025] [Accepted: 03/07/2025] [Indexed: 03/26/2025] Open
Abstract
One of the main goals of metagenomic studies is to describe the taxonomic diversity of microbial communities. A crucial step in metagenomic analysis is metagenomic binning, which involves the (supervised) classification or (unsupervised) clustering of metagenomic sequences. Various machine learning models have been applied to address this task. In this review, the contributions of artificial neural networks (ANN) in the context of metagenomic binning are detailed, addressing both supervised, unsupervised, and semi-supervised approaches. 34 ANN-based binning tools are systematically compared, detailing their architectures, input features, datasets, advantages, disadvantages, and other relevant aspects. The findings reveal that deep learning approaches, such as convolutional neural networks and autoencoders, achieve higher accuracy and scalability than traditional methods. Gaps in benchmarking practices are highlighted, and future directions are proposed, including standardized datasets and optimization of architectures, for third-generation sequencing. This review provides support to researchers in identifying trends and selecting suitable tools for the metagenomic binning problem.
Collapse
Affiliation(s)
- Jair Herazo-Álvarez
- Doctorado en Modelamiento Matemático Aplicado, Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Marco Mora
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Departamento de Computación e Industrias, Facultad de Ciencias de la Ingeniería, Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Sara Cuadros-Orellana
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Centro de Biotecnología de los Recursos Naturales (CENBio), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Karina Vilches-Ponce
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
| | - Ruber Hernández-García
- Laboratory of Technological Research in Pattern Recognition (LITRP), Universidad Católica del Maule, Talca, Maule 3480564, Chile
- Departamento de Computación e Industrias, Facultad de Ciencias de la Ingeniería, Universidad Católica del Maule, Talca, Maule 3480564, Chile
| |
Collapse
|
3
|
Ramos Lopez D, Flores FJ, Espindola AS. MeStanG-Resource for High-Throughput Sequencing Standard Data Sets Generation for Bioinformatic Methods Evaluation and Validation. BIOLOGY 2025; 14:69. [PMID: 39857299 PMCID: PMC11762867 DOI: 10.3390/biology14010069] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/12/2024] [Revised: 01/10/2025] [Accepted: 01/11/2025] [Indexed: 01/27/2025]
Abstract
Metagenomics analysis has enabled the measurement of the microbiome diversity in environmental samples without prior targeted enrichment. Functional and phylogenetic studies based on microbial diversity retrieved using HTS platforms have advanced from detecting known organisms and discovering unknown species to applications in disease diagnostics. Robust validation processes are essential for test reliability, requiring standard samples and databases deriving from real samples and in silico generated artificial controls. We propose a MeStanG as a resource for generating HTS Nanopore data sets to evaluate present and emerging bioinformatics pipelines. MeStanG allows samples to be designed with user-defined organism abundances expressed as number of reads, reference sequences, and predetermined or custom errors by sequencing profiles. The simulator pipeline was evaluated by analyzing its output mock metagenomic samples containing known read abundances using read mapping, genome assembly, and taxonomic classification on three scenarios: a bacterial community composed of nine different organisms, samples resembling pathogen-infected wheat plants, and a viral pathogen serial dilution sampling. The evaluation was able to report consistently the same organisms, and their read abundances as provided in the mock metagenomic sample design. Based on this performance and its novel capacity of generating exact number of reads, MeStanG can be used by scientists to develop mock metagenomic samples (artificial HTS data sets) to assess the diagnostic performance metrics of bioinformatic pipelines, allowing the user to choose predetermined or customized models for research and training.
Collapse
Affiliation(s)
- Daniel Ramos Lopez
- Institute for Biosecurity and Microbial Forensics (IBMF), Oklahoma State University, Stillwater, OK 74078, USA;
- Department of Entomology and Plant Pathology, Oklahoma State University, Stillwater, OK 74078, USA
| | - Francisco J. Flores
- Departamento de Ciencias de la Vida y la Agricultura, Universidad de las Fuerzas Armadas-ESPE, Sangolquí 171103, Ecuador;
- Centro de Investigación de Alimentos, CIAL, Facultad de Ciencias de la Ingeniería e Industrias, Universidad UTE, Quito 170527, Ecuador
| | - Andres S. Espindola
- Institute for Biosecurity and Microbial Forensics (IBMF), Oklahoma State University, Stillwater, OK 74078, USA;
- Department of Entomology and Plant Pathology, Oklahoma State University, Stillwater, OK 74078, USA
| |
Collapse
|
4
|
Kohnert E, Kreutz C. Computational Study Protocol: Leveraging Synthetic Data to Validate a Benchmark Study for Differential Abundance Tests for 16S Microbiome Sequencing Data. F1000Res 2025; 13:1180. [PMID: 39866725 PMCID: PMC11757917 DOI: 10.12688/f1000research.155230.2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 12/19/2024] [Indexed: 01/28/2025] Open
Abstract
Background Synthetic data's utility in benchmark studies depends on its ability to closely mimic real-world conditions and reproduce results obtained from experimental data. Building on Nearing et al.'s study (1), who assessed 14 differential abundance tests using 38 experimental 16S rRNA datasets in a case-control design, we are generating synthetic datasets that mimic the experimental data to verify their findings. We will employ statistical tests to rigorously assess the similarity between synthetic and experimental data and to validate the conclusions on the performance of these tests drawn by Nearing et al. (1). This protocol adheres to the SPIRIT guidelines, demonstrating how established reporting frameworks can support robust, transparent, and unbiased study planning. Methods We replicate Nearing et al.'s (1) methodology, incorporating synthetic data simulated using two distinct tools, mirroring the 38 experimental datasets. Equivalence tests will be conducted on a non-redundant subset of 46 data characteristics comparing synthetic and experimental data, complemented by principal component analysis for overall similarity assessment. The 14 differential abundance tests will be applied to synthetic and experimental datasets, evaluating the consistency of significant feature identification and the number of significant features per tool. Correlation analysis and multiple regression will explore how differences between synthetic and experimental data characteristics may affect the results. Conclusions Synthetic data enables the validation of findings through controlled experiments. We assess how well synthetic data replicates experimental data, try to validate previous findings with the most recent versions of the DA methods and delineate the strengths and limitations of synthetic data in benchmark studies. Moreover, to our knowledge this is the first computational benchmark study to systematically incorporate synthetic data for validating differential abundance methods while strictly adhering to a pre-specified study protocol following SPIRIT guidelines, contributing to transparency, reproducibility, and unbiased research.
Collapse
Affiliation(s)
- Eva Kohnert
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Baden-Württemberg, Germany
| | - Clemens Kreutz
- Institute of Medical Biometry and Statistics, Faculty of Medicine and Medical Center, University of Freiburg, Baden-Württemberg, Germany
| |
Collapse
|
5
|
Amaro-da-Cruz A, Rubio-Tomás T, Álvarez-Mercado AI. Specific microbiome patterns and their association with breast cancer: the intestinal microbiota as a potential biomarker and therapeutic strategy. Clin Transl Oncol 2025; 27:15-41. [PMID: 38890244 PMCID: PMC11735593 DOI: 10.1007/s12094-024-03554-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Accepted: 06/04/2024] [Indexed: 06/20/2024]
Abstract
Breast cancer (BC) is one of the most diagnosed cancers in women. Based on histological characteristics, they are classified as non-invasive, or in situ (tumors located within the milk ducts or milk lobules) and invasive. BC may develop from in situ carcinomas over time. Determining prognosis and predicting response to treatment are essential tools to manage this disease and reduce its incidence and mortality, as well as to promote personalized therapy for patients. However, over half of the cases are not associated with known risk factors. In addition, some patients develop resistance to treatment and relapse. Therefore, it is necessary to identify new biomarkers and treatment strategies that improve existing therapies. In this regard, the role of the microbiome is being researched as it could play a role in carcinogenesis and the efficacy of BC therapies. This review aims to describe specific microbiome patterns associated with BC. For this, a literature search was carried out in PubMed database using the MeSH terms "Breast Neoplasms" and "Gastrointestinal Microbiome", including 29 publications. Most of the studies have focused on characterizing the gut or breast tissue microbiome of the patients. Likewise, studies in animal models and in vitro that investigated the impact of gut microbiota (GM) on BC treatments and the effects of the microbiome on tumor cells were included. Based on the results of the included articles, BC could be associated with an imbalance in the GM. This imbalance varied depending on molecular type, stage and grade of cancer, menopause, menarche, body mass index, and physical activity. However, a specific microbial profile could not be identified as a biomarker. On the other hand, some studies suggest that the GM may influence the efficacy of BC therapies. In addition, some microorganisms and bacterial metabolites could improve the effects of therapies or influence tumor development.
Collapse
Affiliation(s)
- Alba Amaro-da-Cruz
- Department of Chemical Engineering, Faculty of Science, University of Granada, 18071, Granada, Spain
| | - Teresa Rubio-Tomás
- Institute of Molecular Biology and Biotechnology, Foundation for Research and Technology-Hellas, Heraklion, Crete, Greece
| | - Ana I Álvarez-Mercado
- Instituto de Investigación Biosanitaria ibs.GRANADA, Complejo Hospitalario Universitario de Granada, 18014, Granada, Spain.
- Institute of Nutrition and Food Technology, Biomedical Research Center, University of Granada, 18016, Armilla, Spain.
- Department of Pharmacology School of Pharmacy, University of Granada, 18071, Granada, Spain.
| |
Collapse
|
6
|
Puller V, Plaza Oñate F, Prifti E, de Lahondès R. Impact of simulation and reference catalogues on the evaluation of taxonomic profiling pipelines. Microb Genom 2025; 11:001330. [PMID: 39804694 PMCID: PMC11728698 DOI: 10.1099/mgen.0.001330] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2024] [Accepted: 11/06/2024] [Indexed: 01/16/2025] Open
Abstract
Microbiome profiling tools rely on reference catalogues, which significantly affect their performance. Comparing them is, however, challenging, mainly due to differences in their native catalogues. In this study, we present a novel standardized benchmarking framework that makes such comparisons more accurate. We decided not to customize databases but to translate results to a common reference to use the tools with their native environment. Specifically, we conducted two realistic simulations of gut microbiome samples, each based on a specific taxonomic profiler, and used two different taxonomic references to project their results, namely the Genome Taxonomy Database and the Unified Human Gastrointestinal Genome. To demonstrate the importance of using such a framework, we evaluated four established profilers as well as the impact of the simulations and that of the common taxonomic references on the perceived performance of these profilers. Finally, we provide guidelines to enhance future profiler comparisons for human microbiome ecosystems: (i) use or create realistic simulations tailored to your biological context (BC), (ii) identify a common feature space suited to your BC and independent of the catalogues used by the profilers and (iii) apply a comprehensive set of metrics covering accuracy (sensitivity/precision), overall representativity (richness/Shannon) and quantification (UniFrac and/or Aitchison distance).
Collapse
Affiliation(s)
- Vadim Puller
- GMT Science 75 route de Lyons-La-Foret, Rouen F-76000, France
| | | | - Edi Prifti
- IRD, Sorbonne Université, Unité de Modélisation Mathématique et Informatique des Systèmes Complexes, UMMISCO, 32 Avenue Henri Varagnat, Bondy F-93143, France
- Sorbonne Université, INSERM, Nutrition et Obesities; Systemic Approaches, NutriOmique, AP-HP, Hôpital Pitié-Salpêtrière, 91 Boulevard de l’Hôpital, Paris F-75013, France
| | | |
Collapse
|
7
|
Chaabane F, Pillonel T, Bertelli C. MeSS and assembly_finder: a toolkit for in silico metagenomic sample generation. Bioinformatics 2024; 41:btae760. [PMID: 39739308 PMCID: PMC11755095 DOI: 10.1093/bioinformatics/btae760] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2024] [Revised: 11/17/2024] [Accepted: 12/30/2024] [Indexed: 01/02/2025] Open
Abstract
SUMMARY The intrinsic complexity of the microbiota combined with technical variability render shotgun metagenomics challenging to analyze for routine clinical or research applications. In silico data generation offers a controlled environment allowing for example to benchmark bioinformatics tools, to optimize study design, statistical power, or to validate targeted applications. Here, we propose assembly_finder and the Metagenomic Sequence Simulator (MeSS), two easy-to-use Bioconda packages, as part of a benchmarking toolkit to download genomes and simulate shotgun metagenomics samples, respectively. Outperforming existing tools in speed while requiring less memory, MeSS reproducibly generates accurate complex communities based on a list of taxonomic ranks and their abundance. AVAILABILITY AND IMPLEMENTATION All code is released under MIT License and is available on https://github.com/metagenlab/MeSS and https://github.com/metagenlab/assembly_finder.
Collapse
Affiliation(s)
- Farid Chaabane
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne, Lausanne, 1011, Switzerland
| | - Trestan Pillonel
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne, Lausanne, 1011, Switzerland
| | - Claire Bertelli
- Institute of Microbiology, Lausanne University Hospital and University of Lausanne, Lausanne, 1011, Switzerland
| |
Collapse
|
8
|
Sena F, Ingervo E, Khan S, Prjibelski A, Schmidt S, Tomescu A. Flowtigs: Safety in flow decompositions for assembly graphs. iScience 2024; 27:111208. [PMID: 39759024 PMCID: PMC11700653 DOI: 10.1016/j.isci.2024.111208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/22/2024] [Revised: 09/30/2024] [Accepted: 10/15/2024] [Indexed: 01/07/2025] Open
Abstract
A decomposition of a network flow is a set of weighted walks whose superposition equals the flow. In this article, we give a simple and linear-time-verifiable complete characterization (flowtigs) of walks that are safe in such general flow decompositions, i.e., that are subwalks of any possible flow decomposition. We provide an O(mn)-time algorithm that identifies all maximal flowtigs and represents them inside a compact structure. On the practical side, we study flowtigs in the use-case of metagenomic assembly. By using the species abundances as flow values of the metagenomic assembly graph, we can model the possible assembly solutions as flow decompositions into weighted closed walks. On simulated data, compared to reporting unitigs or maximal safe walks based only on the graph structure, reporting flowtigs results in a notably more contiguous assembly. On real data, we frame flowtigs as a heuristic and provide an algorithm that is guided by this heuristic.
Collapse
Affiliation(s)
| | | | - Shahbaz Khan
- Indian Institute of Technology Roorkee, Roorkee, India
| | | | | | | |
Collapse
|
9
|
Liu Y, Li Y, Chen E, Xu J, Zhang W, Zeng X, Luo X. Repeat and haplotype aware error correction in nanopore sequencing reads with DeChat. Commun Biol 2024; 7:1678. [PMID: 39702496 DOI: 10.1038/s42003-024-07376-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2024] [Accepted: 12/05/2024] [Indexed: 12/21/2024] Open
Abstract
Error self-correction is crucial for analyzing long-read sequencing data, but existing methods often struggle with noisy data or are tailored to technologies like PacBio HiFi. There is a gap in methods optimized for Nanopore R10 simplex reads, which typically have error rates below 2%. We introduce DeChat, a novel approach designed specifically for these reads. DeChat enables repeat- and haplotype-aware error correction, leveraging the strengths of both de Bruijn graphs and variant-aware multiple sequence alignment to create a synergistic approach. This approach avoids read overcorrection, ensuring that variants in repeats and haplotypes are preserved while sequencing errors are accurately corrected. Benchmarking on simulated and real datasets shows that DeChat-corrected reads have significantly fewer errors-up to two orders of magnitude lower-compared to other methods, without losing read information. Furthermore, DeChat-corrected reads clearly improves genome assembly and taxonomic classification.
Collapse
Affiliation(s)
- Yuansheng Liu
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Yichen Li
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Enlian Chen
- College of Biology, Hunan University, Changsha, China
| | - Jialu Xu
- College of Biology, Hunan University, Changsha, China
| | - Wenhai Zhang
- College of Biology, Hunan University, Changsha, China
| | - Xiangxiang Zeng
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Xiao Luo
- College of Biology, Hunan University, Changsha, China.
| |
Collapse
|
10
|
Sankaran K, Kodikara S, Li JJ, Cao KAL. Semisynthetic simulation for microbiome data analysis. Brief Bioinform 2024; 26:bbaf051. [PMID: 39927858 PMCID: PMC11808806 DOI: 10.1093/bib/bbaf051] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2024] [Revised: 12/19/2024] [Accepted: 01/23/2025] [Indexed: 02/11/2025] Open
Abstract
High-throughput sequencing data lie at the heart of modern microbiome research. Effective analysis of these data requires careful preprocessing, modeling, and interpretation to detect subtle signals and avoid spurious associations. In this review, we discuss how simulation can serve as a sandbox to test candidate approaches, creating a setting that mimics real data while providing ground truth. This is particularly valuable for power analysis, methods benchmarking, and reliability analysis. We explain the probability, multivariate analysis, and regression concepts behind modern simulators and how different implementations make trade-offs between generality, faithfulness, and controllability. Recognizing that all simulators only approximate reality, we review methods to evaluate how accurately they reflect key properties. We also present case studies demonstrating the value of simulation in differential abundance testing, dimensionality reduction, network analysis, and data integration. Code for these examples is available in an online tutorial (https://go.wisc.edu/8994yz) that can be easily adapted to new problem settings.
Collapse
Affiliation(s)
- Kris Sankaran
- Department of Statistics, University of Wisconsin-Madison, 1300 University Ave, Madison,WI 53703, United States
| | - Saritha Kodikara
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Building 184/30 Royal Parade, Melbourne, VIC 3052, Australia
| | - Jingyi Jessica Li
- Department of Statistics and Data Science, University of California, Los Angeles, 520 Portola Plaza, Los Angeles, CA 90095, United States
- Department of Human Genetics, University of California, Los Angeles, 695 Charles E Young Dr S, Los Angeles, CA 90095, United States
- Department of Biostatistics, University of California, Los Angeles, 650 Charles E. Young Dr S, Los Angeles, CA 90095, United States
| | - Kim-Anh Lê Cao
- Melbourne Integrative Genomics, School of Mathematics and Statistics, University of Melbourne, Building 184/30 Royal Parade, Melbourne, VIC 3052, Australia
| |
Collapse
|
11
|
Nickols WA, McIver LJ, Walsh A, Zhang Y, Nearing JT, Asnicar F, Punčochář M, Segata N, Nguyen LH, Hartmann EM, Franzosa EA, Huttenhower C, Thompson KN. Evaluating metagenomic analyses for undercharacterized environments: what's needed to light up the microbial dark matter? BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.08.622677. [PMID: 39574575 PMCID: PMC11580994 DOI: 10.1101/2024.11.08.622677] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]
Abstract
Non-human-associated microbial communities play important biological roles, but they remain less understood than human-associated communities. Here, we assess the impact of key environmental sample properties on a variety of state-of-the-art metagenomic analysis methods. In simulated datasets, all methods performed similarly at high taxonomic ranks, but newer marker-based methods incorporating metagenomic assembled genomes outperformed others at lower taxonomic levels. In real environmental data, taxonomic profiles assigned to the same sample by different methods showed little agreement at lower taxonomic levels, but the methods agreed better on community diversity estimates and estimates of the relationships between environmental parameters and microbial profiles.
Collapse
Affiliation(s)
- William A. Nickols
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Lauren J. McIver
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
| | - Aaron Walsh
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Yancong Zhang
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Jacob T. Nearing
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Francesco Asnicar
- Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Trento, Italy
| | - Michal Punčochář
- Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Trento, Italy
| | - Nicola Segata
- Department of Cellular, Computational and Integrative Biology (CIBIO), University of Trento, Trento, Italy
| | - Long H. Nguyen
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | - Erica M. Hartmann
- Department of Civil and Environmental Engineering, McCormick School of Engineering, Northwestern University, Evanston, IL, USA
- Center for Synthetic Biology, Northwestern University, Evanston, IL, USA
- Department of Medicine/Division of Pulmonary Medicine, Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Eric A. Franzosa
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Curtis Huttenhower
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Division of Gastroenterology, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
- Department of Immunology and Infectious Diseases, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
| | - Kelsey N. Thompson
- Department of Biostatistics, T.H. Chan School of Public Health, Harvard University, Boston, MA, USA
- Harvard Chan Microbiome in Public Health Center, Harvard T. H. Chan School of Public Health, Boston, MA, USA
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|
12
|
Gulyás G, Kakuk B, Dörmő Á, Járay T, Prazsák I, Csabai Z, Henkrich MM, Boldogkői Z, Tombácz D. Cross-comparison of gut metagenomic profiling strategies. Commun Biol 2024; 7:1445. [PMID: 39505993 PMCID: PMC11541596 DOI: 10.1038/s42003-024-07158-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Accepted: 10/28/2024] [Indexed: 11/08/2024] Open
Abstract
The rapid advancements in sequencing technologies and bioinformatics have enabled metagenomic research of complex microbial systems, but reliable results depend on consistent laboratory and bioinformatics approaches. Current efforts to identify best practices often focus on optimizing specific steps, making it challenging to understand the influence of each stage on microbial population analysis and compare data across studies. This study evaluated DNA extraction, library construction methodologies, sequencing platforms, and computational approaches using a dog stool sample, two synthetic microbial community mixtures, and various sequencing data sources. Our work, the most comprehensive evaluation of metagenomic methods to date. We developed a software tool, termed minitax, which provides consistent results across the range of platforms and methodologies. Our findings showed that the Zymo Research Quick-DNA HMW MagBead Kit, Illumina DNA Prep library preparation method, and the minitax bioinformatics tool were the most effective for high-quality microbial diversity analysis. However, the effectiveness of pipelines or method combinations is sample-specific, making it difficult to identify a universally optimal approach. Therefore, employing multiple approaches is crucial for obtaining reliable outcomes in microbial systems.
Collapse
Affiliation(s)
- Gábor Gulyás
- Department of Medical Biology, Faculty of Medicine, University of Szeged, Szeged, Hungary
- MTA-SZTE Lendület GeMiNI Research Group, University of Szeged, Szeged, Hungary
| | - Balázs Kakuk
- Department of Medical Biology, Faculty of Medicine, University of Szeged, Szeged, Hungary
- MTA-SZTE Lendület GeMiNI Research Group, University of Szeged, Szeged, Hungary
| | - Ákos Dörmő
- Department of Medical Biology, Faculty of Medicine, University of Szeged, Szeged, Hungary
- MTA-SZTE Lendület GeMiNI Research Group, University of Szeged, Szeged, Hungary
| | - Tamás Járay
- Department of Medical Biology, Faculty of Medicine, University of Szeged, Szeged, Hungary
- MTA-SZTE Lendület GeMiNI Research Group, University of Szeged, Szeged, Hungary
| | - István Prazsák
- Department of Medical Biology, Faculty of Medicine, University of Szeged, Szeged, Hungary
- MTA-SZTE Lendület GeMiNI Research Group, University of Szeged, Szeged, Hungary
| | - Zsolt Csabai
- Department of Medical Biology, Faculty of Medicine, University of Szeged, Szeged, Hungary
- MTA-SZTE Lendület GeMiNI Research Group, University of Szeged, Szeged, Hungary
| | - Miksa Máté Henkrich
- Department of Medical Biology, Faculty of Medicine, University of Szeged, Szeged, Hungary
- MTA-SZTE Lendület GeMiNI Research Group, University of Szeged, Szeged, Hungary
| | - Zsolt Boldogkői
- Department of Medical Biology, Faculty of Medicine, University of Szeged, Szeged, Hungary.
- MTA-SZTE Lendület GeMiNI Research Group, University of Szeged, Szeged, Hungary.
| | - Dóra Tombácz
- Department of Medical Biology, Faculty of Medicine, University of Szeged, Szeged, Hungary.
- MTA-SZTE Lendület GeMiNI Research Group, University of Szeged, Szeged, Hungary.
| |
Collapse
|
13
|
Kang X, Zhang W, Li Y, Luo X, Schönhuth A. HyLight: Strain aware assembly of low coverage metagenomes. Nat Commun 2024; 15:8665. [PMID: 39375348 PMCID: PMC11458758 DOI: 10.1038/s41467-024-52907-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2023] [Accepted: 09/23/2024] [Indexed: 10/09/2024] Open
Abstract
Different strains of identical species can vary substantially in terms of their spectrum of biomedically relevant phenotypes. Reconstructing the genomes of microbial communities at the level of their strains poses significant challenges, because sequencing errors can obscure strain-specific variants. Next-generation sequencing (NGS) reads are too short to resolve complex genomic regions. Third-generation sequencing (TGS) reads, although longer, are prone to higher error rates or substantially more expensive. Limiting TGS coverage to reduce costs compromises the accuracy of the assemblies. This explains why prior approaches agree on losses in strain awareness, accuracy, tendentially excessive costs, or combinations thereof. We introduce HyLight, a metagenome assembly approach that addresses these challenges by implementing the complementary strengths of TGS and NGS data. HyLight employs strain-resolved overlap graphs (OG) to accurately reconstruct individual strains within microbial communities. Our experiments demonstrate that HyLight produces strain-aware and contiguous assemblies at minimal error content, while significantly reducing costs because utilizing low-coverage TGS data. HyLight achieves an average improvement of 19.05% in preserving strain identity and demonstrates near-complete strain awareness across diverse datasets. In summary, HyLight offers considerable advances in metagenome assembly, insofar as it delivers significantly enhanced strain awareness, contiguity, and accuracy without the typical compromises observed in existing approaches.
Collapse
Affiliation(s)
- Xiongbin Kang
- College of Biology, Hunan University, Changsha, China
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Wenhai Zhang
- College of Biology, Hunan University, Changsha, China
| | - Yichen Li
- College of Computer Science and Electronic Engineering, Hunan University, Changsha, China
| | - Xiao Luo
- College of Biology, Hunan University, Changsha, China.
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
14
|
Yang Z, Shan Y, Liu X, Chen G, Pan Y, Gou Q, Zou J, Chang Z, Zeng Q, Yang C, Kong J, Sun Y, Li S, Zhang X, Wu WC, Li C, Peng H, Holmes EC, Guo D, Shi M. VirID: Beyond Virus Discovery-An Integrated Platform for Comprehensive RNA Virus Characterization. Mol Biol Evol 2024; 41:msae202. [PMID: 39331699 PMCID: PMC11523140 DOI: 10.1093/molbev/msae202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2024] [Revised: 09/10/2024] [Accepted: 09/24/2024] [Indexed: 09/29/2024] Open
Abstract
RNA viruses exhibit vast phylogenetic diversity and can significantly impact public health and agriculture. However, current bioinformatics tools for viral discovery from metagenomic data frequently generate false positive virus results, overestimate viral diversity, and misclassify virus sequences. Additionally, current tools often fail to determine virus-host associations, which hampers investigation of the potential threat posed by a newly detected virus. To address these issues we developed VirID, a software tool specifically designed for the discovery and characterization of RNA viruses from metagenomic data. The basis of VirID is a comprehensive RNA-dependent RNA polymerase database to enhance a workflow that includes RNA virus discovery, phylogenetic analysis, and phylogeny-based virus characterization. Benchmark tests on a simulated data set demonstrated that VirID had high accuracy in profiling viruses and estimating viral richness. In evaluations with real-world samples, VirID was able to identify RNA viruses of all types, but also provided accurate estimations of viral genetic diversity and virus classification, as well as comprehensive insights into virus associations with humans, animals, and plants. VirID therefore offers a robust tool for virus discovery and serves as a valuable resource in basic virological studies, pathogen surveillance, and early warning systems for infectious disease outbreaks.
Collapse
Affiliation(s)
- Ziyue Yang
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Yongtao Shan
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Xue Liu
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Guowei Chen
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong (SAR), China
| | - Yuanfei Pan
- Ministry of Education Key Laboratory of Biodiversity Science and Ecological Engineering, School of Life Sciences, Fudan University, Shanghai, China
| | - Qinyu Gou
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Jie Zou
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Zilong Chang
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Qiang Zeng
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Chunhui Yang
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Jianbin Kong
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Kowloon, Hong Kong (SAR), China
| | - Shaochuan Li
- Goodwill Institute of Life Sciences, Guangzhou, China
| | - Xu Zhang
- Goodwill Institute of Life Sciences, Guangzhou, China
| | - Wei-chen Wu
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Chunmei Li
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Hong Peng
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
| | - Edward C Holmes
- School of Medical Sciences, The University of Sydney, Sydney, New South Wales, Australia
- Laboratory of Data Discovery for Health Limited, Hong Kong (SAR), China
| | - Deyin Guo
- Guangzhou National Laboratory, Guangzhou International Bio-Island, Guangzhou, China
- State Key Laboratory of Respiratory Disease, National Clinical Research Center for Respiratory Disease, Guangzhou Institute of Respiratory Health, The First Affiliated Hospital of Guangzhou Medical University, Guangzhou, Guangdong, China
| | - Mang Shi
- State Key Laboratory for Biocontrol, School of Medicine, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Diseases, Sun Yat-sen University, Shenzhen, China
- Shenzhen Key Laboratory for Systems Medicine in Inflammatory Diseases, Shenzhen Campus of Sun Yat-sen University, Sun Yat-sen University, Shenzhen, China
- Guangdong Provincial Center for Disease Control and Prevention, Guangzhou, China
| |
Collapse
|
15
|
Ciuchcinski K, Stokke R, Steen IH, Dziewit L. Landscape of the metaplasmidome of deep-sea hydrothermal vents located at Arctic Mid-Ocean Ridges in the Norwegian-Greenland Sea: ecological insights from comparative analysis of plasmid identification tools. FEMS Microbiol Ecol 2024; 100:fiae124. [PMID: 39271469 PMCID: PMC11451466 DOI: 10.1093/femsec/fiae124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2024] [Revised: 09/04/2024] [Accepted: 09/12/2024] [Indexed: 09/15/2024] Open
Abstract
Plasmids are one of the key drivers of microbial adaptation and evolution. However, their diversity and role in adaptation, especially in extreme environments, remains largely unexplored. In this study, we aimed to identify, characterize, and compare plasmid sequences originating from samples collected from deep-sea hydrothermal vents located in Arctic Mid-Ocean Ridges. To achieve this, we employed, and benchmarked three recently developed plasmid identification tools-PlasX, GeNomad, and PLASMe-on metagenomic data from this unique ecosystem. To date, this is the first direct comparison of these computational methods in the context of data from extreme environments. Upon recovery of plasmid contigs, we performed a multiapproach analysis, focusing on identifying taxonomic and functional biases within datasets originating from each tool. Next, we implemented a majority voting system to identify high-confidence plasmid contigs, enhancing the reliability of our findings. By analysing the consensus plasmid sequences, we gained insights into their diversity, ecological roles, and adaptive significance. Within the high-confidence sequences, we identified a high abundance of Pseudomonadota and Campylobacterota, as well as multiple toxin-antitoxin systems. Our findings ensure a deeper understanding of how plasmids contribute to shaping microbial communities living under extreme conditions of hydrothermal vents, potentially uncovering novel adaptive mechanisms.
Collapse
Affiliation(s)
- Karol Ciuchcinski
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw,00-927, Warsaw, Poland
| | - Runar Stokke
- Department of Biological Sciences, Center for Deep Sea Research, University of Bergen, N-5020, Bergen, Norway
| | - Ida Helene Steen
- Department of Biological Sciences, Center for Deep Sea Research, University of Bergen, N-5020, Bergen, Norway
| | - Lukasz Dziewit
- Department of Environmental Microbiology and Biotechnology, Institute of Microbiology, Faculty of Biology, University of Warsaw,00-927, Warsaw, Poland
| |
Collapse
|
16
|
Espindola AS. Simulated High Throughput Sequencing Datasets: A Crucial Tool for Validating Bioinformatic Pathogen Detection Pipelines. BIOLOGY 2024; 13:700. [PMID: 39336128 PMCID: PMC11428249 DOI: 10.3390/biology13090700] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 09/03/2024] [Accepted: 09/03/2024] [Indexed: 09/30/2024]
Abstract
The validation of diagnostic assays in plant pathogen detection is a critical area of research. It requires the use of both negative and positive controls containing a known quantity of the target pathogen, which are crucial elements when calculating analytical sensitivity and specificity, among other diagnostic performance metrics. High Throughput Sequencing (HTS) is a method that allows the simultaneous detection of a theoretically unlimited number of plant pathogens. However, accurately identifying the pathogen from HTS data is directly related to the bioinformatic pipeline utilized and its effectiveness at correctly assigning reads to their associated taxa. To this day, there is no consensus about the pipeline that should be used to detect the pathogens in HTS data, and results often undergo review and scientific evaluation. It is, therefore, imperative to establish HTS resources tailored for evaluating the performance of bioinformatic pipelines utilized in plant pathogen detection. Standardized artificial HTS datasets can be used as a benchmark by allowing users to test their pipelines for various pathogen infection scenarios, some of the most prevalent being multiple infections, low titer pathogens, mutations, and new strains, among others. Having these artificial HTS datasets in the hands of HTS diagnostic assay validators can help resolve challenges encountered when implementing bioinformatics pipelines for routine pathogen detection. Offering these purely artificial HTS datasets as benchmarking tools will significantly advance research on plant pathogen detection using HTS and enable a more robust and standardized evaluation of the bioinformatic methods, thereby enhancing the field of plant pathogen detection.
Collapse
Affiliation(s)
- Andres S Espindola
- Department of Entomology and Plant Pathology, Oklahoma State University, Stillwater, OK 74078, USA
| |
Collapse
|
17
|
Hera MR, Liu S, Wei W, Rodriguez JS, Ma C, Koslicki D. Metagenomic functional profiling: to sketch or not to sketch? Bioinformatics 2024; 40:ii165-ii173. [PMID: 39230701 PMCID: PMC11373326 DOI: 10.1093/bioinformatics/btae397] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
MOTIVATION Functional profiling of metagenomic samples is essential to decipher the functional capabilities of microbial communities. Traditional and more widely used functional profilers in the context of metagenomics rely on aligning reads against a known reference database. However, aligning sequencing reads against a large and fast-growing database is computationally expensive. In general, k-mer-based sketching techniques have been successfully used in metagenomics to address this bottleneck, notably in taxonomic profiling. In this work, we describe leveraging FracMinHash (implemented in sourmash, a publicly available software), a k-mer-sketching algorithm, to obtain functional profiles of metagenome samples. RESULTS We show how pieces of the sourmash software (and the resulting FracMinHash sketches) can be put together in a pipeline to functionally profile a metagenomic sample. We named our pipeline fmh-funprofiler. We report that the functional profiles obtained using this pipeline demonstrate comparable completeness and better purity compared to the profiles obtained using other alignment-based methods when applied to simulated metagenomic data. We also report that fmh-funprofiler is 39-99× faster in wall-clock time, and consumes up to 40-55× less memory. Coupled with the KEGG database, this method not only replicates fundamental biological insights but also highlights novel signals from the Human Microbiome Project datasets. AVAILABILITY AND IMPLEMENTATION This fast and lightweight metagenomic functional profiler is freely available and can be accessed here: https://github.com/KoslickiLab/fmh-funprofiler. All scripts of the analyses we present in this manuscript can be found on GitHub.
Collapse
Affiliation(s)
- Mahmudur Rahman Hera
- School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| | - Shaopeng Liu
- Bioinformatics and Genomics, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| | - Wei Wei
- Bioinformatics and Genomics, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| | - Judith S Rodriguez
- Bioinformatics and Genomics, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| | - Chunyu Ma
- Bioinformatics and Genomics, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| | - David Koslicki
- School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, Pennsylvania 16802, United States
- Bioinformatics and Genomics, Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, Pennsylvania 16802, United States
- Department of Biology, Pennsylvania State University, University Park, Pennsylvania 16802, United States
| |
Collapse
|
18
|
Sanguineti D, Zampieri G, Treu L, Campanaro S. Metapresence: a tool for accurate species detection in metagenomics based on the genome-wide distribution of mapping reads. mSystems 2024; 9:e0021324. [PMID: 38980053 PMCID: PMC11338496 DOI: 10.1128/msystems.00213-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2024] [Accepted: 06/15/2024] [Indexed: 07/10/2024] Open
Abstract
Shotgun metagenomics allows comprehensive sampling of the genomic information of microbes in a given environment and is a tool of choice for studying complex microbial systems. Mapping sequencing reads against a set of reference or metagenome-assembled genomes is in principle a simple and powerful approach to define the species-level composition of the microbial community under investigation. However, despite the widespread use of this approach, there is no established way to properly interpret the alignment results, with arbitrary relative abundance thresholds being routinely used to discriminate between present and absent species. Such an approach can be affected by significant biases, especially in the identification of rare species. Therefore, it is important to develop new metrics to overcome these biases. Here, we present Metapresence, a new tool to perform reliable identification of the species in metagenomic samples based on the distribution of mapped reads on the reference genomes. The analysis is based on two metrics describing the breadth of coverage and the genomic distance between consecutive reads. We demonstrate the high precision and wide applicability of the tool using data from various synthetic communities, a real mock community, and the gut microbiome of healthy individuals and antibiotic-associated-diarrhea patients. Overall, our results suggest that the proposed approach has a robust performance in hard-to-analyze microbial communities containing contaminated or closely related genomes in low abundance.IMPORTANCEDespite the prevalent use of genome-centric alignment-based methods to characterize microbial community composition, there lacks a standardized approach for accurately identifying the species within a sample. Currently, arbitrary relative abundance thresholds are commonly employed for this purpose. However, due to the inherent complexity of genome structure and biases associated with genome-centric approaches, this practice tends to be imprecise. Notably, it introduces significant biases, particularly in the identification of rare species. The method presented here addresses these limitations and contributes significantly to overcoming inaccuracies in precisely defining community composition, especially when dealing with rare members.
Collapse
Affiliation(s)
| | - Guido Zampieri
- Department of Biology,
University of Padova,
Padova, Italy
| | - Laura Treu
- Department of Biology,
University of Padova,
Padova, Italy
| | | |
Collapse
|
19
|
Mallawaarachchi V, Wickramarachchi A, Xue H, Papudeshi B, Grigson SR, Bouras G, Prahl RE, Kaphle A, Verich A, Talamantes-Becerra B, Dinsdale EA, Edwards RA. Solving genomic puzzles: computational methods for metagenomic binning. Brief Bioinform 2024; 25:bbae372. [PMID: 39082646 PMCID: PMC11289683 DOI: 10.1093/bib/bbae372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 06/05/2024] [Accepted: 07/15/2024] [Indexed: 08/03/2024] Open
Abstract
Metagenomics involves the study of genetic material obtained directly from communities of microorganisms living in natural environments. The field of metagenomics has provided valuable insights into the structure, diversity and ecology of microbial communities. Once an environmental sample is sequenced and processed, metagenomic binning clusters the sequences into bins representing different taxonomic groups such as species, genera, or higher levels. Several computational tools have been developed to automate the process of metagenomic binning. These tools have enabled the recovery of novel draft genomes of microorganisms allowing us to study their behaviors and functions within microbial communities. This review classifies and analyzes different approaches of metagenomic binning and different refinement, visualization, and evaluation techniques used by these methods. Furthermore, the review highlights the current challenges and areas of improvement present within the field of research.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Anuradha Wickramarachchi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Hansheng Xue
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Bhavya Papudeshi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Susanna R Grigson
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - George Bouras
- Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
- The Department of Surgery—Otolaryngology Head and Neck Surgery, University of Adelaide and the Basil Hetzel Institute for Translational Health Research, Central Adelaide Local Health Network, Adelaide, SA 5011, Australia
| | - Rosa E Prahl
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Anubhav Kaphle
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Andrey Verich
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
- The Kirby Institute, The University of New South Wales, Randwick, Sydney, NSW 2052, Australia
| | - Berenice Talamantes-Becerra
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Elizabeth A Dinsdale
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Robert A Edwards
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| |
Collapse
|
20
|
Zhang Z, Xiao J, Wang H, Yang C, Huang Y, Yue Z, Chen Y, Han L, Yin K, Lyu A, Fang X, Zhang L. Exploring high-quality microbial genomes by assembling short-reads with long-range connectivity. Nat Commun 2024; 15:4631. [PMID: 38821971 PMCID: PMC11143213 DOI: 10.1038/s41467-024-49060-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/20/2023] [Accepted: 05/17/2024] [Indexed: 06/02/2024] Open
Abstract
Although long-read sequencing enables the generation of complete genomes for unculturable microbes, its high cost limits the widespread adoption of long-read sequencing in large-scale metagenomic studies. An alternative method is to assemble short-reads with long-range connectivity, which can be a cost-effective way to generate high-quality microbial genomes. Here, we develop Pangaea, a bioinformatic approach designed to enhance metagenome assembly using short-reads with long-range connectivity. Pangaea leverages connectivity derived from physical barcodes of linked-reads or virtual barcodes by aligning short-reads to long-reads. Pangaea utilizes a deep learning-based read binning algorithm to assemble co-barcoded reads exhibiting similar sequence contexts and abundances, thereby improving the assembly of high- and medium-abundance microbial genomes. Pangaea also leverages a multi-thresholding algorithm strategy to refine assembly for low-abundance microbes. We benchmark Pangaea on linked-reads and a combination of short- and long-reads from simulation data, mock communities and human gut metagenomes. Pangaea achieves significantly higher contig continuity as well as more near-complete metagenome-assembled genomes (NCMAGs) than the existing assemblers. Pangaea also generates three complete and circular NCMAGs on the human gut microbiomes.
Collapse
Grants
- This research was partially supported by the Young Collaborative Research Grant (C2004-23Y, L.Z.), HMRF (11221026, L.Z.), the open project of BGI-Shenzhen, Shenzhen 518000, China (BGIRSZ20220012, L.Z.), the Hong Kong Research Grant Council Early Career Scheme (HKBU 22201419, L.Z.), HKBU Start-up Grant Tier 2 (RC-SGT2/19-20/SCI/007, L.Z.), HKBU IRCMS (No. IRCMS/19-20/D02, L.Z.).
- This research was partially supported by the open project of BGI-Shenzhen, Shenzhen 518000, China (BGIRSZ20220014, KJ.Y.).
- The study were partially supported by the Science Technology and Innovation Committee of Shenzhen Municipality, China (SGDX20190919142801722, XD.F.),
Collapse
Affiliation(s)
- Zhenmiao Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Jin Xiao
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Hongbo Wang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | - Chao Yang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
| | | | - Zhen Yue
- BGI Research, Sanya, 572025, China
| | - Yang Chen
- State Key Laboratory of Dampness Syndrome of Chinese Medicine, The Second Affiliated Hospital of Guangzhou University of Chinese, Guangzhou, China
| | - Lijuan Han
- Department of Scientific Research, Kangmeihuada GeneTech Co., Ltd (KMHD), Shenzhen, China
| | - Kejing Yin
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China
- Institute for Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China
| | - Aiping Lyu
- School of Chinese Medicine, Hong Kong Baptist University, Hong Kong, China
| | - Xiaodong Fang
- BGI Research, Shenzhen, 518083, China
- BGI Research, Sanya, 572025, China
- Department of Scientific Research, Kangmeihuada GeneTech Co., Ltd (KMHD), Shenzhen, China
| | - Lu Zhang
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, China.
- Institute for Research and Continuing Education, Hong Kong Baptist University, Shenzhen, China.
| |
Collapse
|
21
|
Yu R, Huang Z, Lam TYC, Sun Y. Utilizing profile hidden Markov model databases for discovering viruses from metagenomic data: a comprehensive review. Brief Bioinform 2024; 25:bbae292. [PMID: 39003531 PMCID: PMC11246558 DOI: 10.1093/bib/bbae292] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/01/2024] [Revised: 05/19/2024] [Accepted: 06/04/2024] [Indexed: 07/15/2024] Open
Abstract
Profile hidden Markov models (pHMMs) are able to achieve high sensitivity in remote homology search, making them popular choices for detecting novel or highly diverged viruses in metagenomic data. However, many existing pHMM databases have different design focuses, making it difficult for users to decide the proper one to use. In this review, we provide a thorough evaluation and comparison for multiple commonly used profile HMM databases for viral sequence discovery in metagenomic data. We characterized the databases by comparing their sizes, their taxonomic coverage, and the properties of their models using quantitative metrics. Subsequently, we assessed their performance in virus identification across multiple application scenarios, utilizing both simulated and real metagenomic data. We aim to offer researchers a thorough and critical assessment of the strengths and limitations of different databases. Furthermore, based on the experimental results obtained from the simulated and real metagenomic data, we provided practical suggestions for users to optimize their use of pHMM databases, thus enhancing the quality and reliability of their findings in the field of viral metagenomics.
Collapse
Affiliation(s)
- Runzhou Yu
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| | - Ziyi Huang
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| | - Theo Y C Lam
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| | - Yanni Sun
- Department of Electrical Engineering, City University of Hong Kong, Tat Chee Avenue, Kowloon, Hong Kong, China
| |
Collapse
|
22
|
Pinto Y, Chakraborty M, Jain N, Bhatt AS. Phage-inclusive profiling of human gut microbiomes with Phanta. Nat Biotechnol 2024; 42:651-662. [PMID: 37231259 DOI: 10.1038/s41587-023-01799-4] [Citation(s) in RCA: 19] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Accepted: 04/20/2023] [Indexed: 05/27/2023]
Abstract
Due to technical limitations, most gut microbiome studies have focused on prokaryotes, overlooking viruses. Phanta, a virome-inclusive gut microbiome profiling tool, overcomes the limitations of assembly-based viral profiling methods by using customized k-mer-based classification tools and incorporating recently published catalogs of gut viral genomes. Phanta's optimizations consider the small genome size of viruses, sequence homology with prokaryotes and interactions with other gut microbes. Extensive testing of Phanta on simulated data demonstrates that it quickly and accurately quantifies prokaryotes and viruses. When applied to 245 fecal metagenomes from healthy adults, Phanta identifies ~200 viral species per sample, ~5× more than standard assembly-based methods. We observe a ~2:1 ratio between DNA viruses and bacteria, with higher interindividual variability of the gut virome compared to the gut bacteriome. In another cohort, we observe that Phanta performs equally well on bulk versus virus-enriched metagenomes, making it possible to study prokaryotes and viruses in a single experiment, with a single analysis.
Collapse
Affiliation(s)
- Yishay Pinto
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Medicine, Divisions of Hematology and Blood & Marrow Transplantation, Stanford University, Stanford, CA, USA
| | | | - Navami Jain
- Department of Genetics, Stanford University, Stanford, CA, USA
- Department of Medicine, Divisions of Hematology and Blood & Marrow Transplantation, Stanford University, Stanford, CA, USA
| | - Ami S Bhatt
- Department of Genetics, Stanford University, Stanford, CA, USA.
- Department of Medicine, Divisions of Hematology and Blood & Marrow Transplantation, Stanford University, Stanford, CA, USA.
| |
Collapse
|
23
|
Sepich-Poore GD, McDonald D, Kopylova E, Guccione C, Zhu Q, Austin G, Carpenter C, Fraraccio S, Wandro S, Kosciolek T, Janssen S, Metcalf JL, Song SJ, Kanbar J, Miller-Montgomery S, Heaton R, Mckay R, Patel SP, Swafford AD, Korem T, Knight R. Robustness of cancer microbiome signals over a broad range of methodological variation. Oncogene 2024; 43:1127-1148. [PMID: 38396294 PMCID: PMC10997506 DOI: 10.1038/s41388-024-02974-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2023] [Revised: 02/03/2024] [Accepted: 02/07/2024] [Indexed: 02/25/2024]
Abstract
In 2020, we identified cancer-specific microbial signals in The Cancer Genome Atlas (TCGA) [1]. Multiple peer-reviewed papers independently verified or extended our findings [2-12]. Given this impact, we carefully considered concerns by Gihawi et al. [13] that batch correction and database contamination with host sequences artificially created the appearance of cancer type-specific microbiomes. (1) We tested batch correction by comparing raw and Voom-SNM-corrected data per-batch, finding predictive equivalence and significantly similar features. We found consistent results with a modern microbiome-specific method (ConQuR [14]), and when restricting to taxa found in an independent, highly-decontaminated cohort. (2) Using Conterminator [15], we found low levels of human contamination in our original databases (~1% of genomes). We demonstrated that the increased detection of human reads in Gihawi et al. [13] was due to using a newer human genome reference. (3) We developed Exhaustive, a method twice as sensitive as Conterminator, to clean RefSeq. We comprehensively host-deplete TCGA with many human (pan)genome references. We repeated all analyses with this and the Gihawi et al. [13] pipeline, and found cancer type-specific microbiomes. These extensive re-analyses and updated methods validate our original conclusion that cancer type-specific microbial signatures exist in TCGA, and show they are robust to methodology.
Collapse
Affiliation(s)
- Gregory D Sepich-Poore
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
- Micronoma, San Diego, CA, USA
- Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Daniel McDonald
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Evguenia Kopylova
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Clarity Genomics, Antwerp, Belgium
| | - Caitlin Guccione
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
| | - Qiyun Zhu
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- School of Life Sciences, Arizona State University, Tempe, AZ, USA
| | - George Austin
- Department of Biomedical Informatics, Columbia University Irving Medical Center, New York, NY, USA
- Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
| | - Carolina Carpenter
- Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA
| | - Serena Fraraccio
- Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA
- Micronoma, San Diego, CA, USA
| | - Stephen Wandro
- Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA
- Micronoma, San Diego, CA, USA
| | - Tomasz Kosciolek
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Malopolska Centre of Biotechnology, Jagiellonian University in Kraków, Kraków, Poland
| | - Stefan Janssen
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Algorithmic Bioinformatics, Department of Biology and Chemistry, Justus Liebig University Gießen, Gießen, Germany
| | - Jessica L Metcalf
- Department of Animal Sciences, Colorado State University, Fort Collins, CO, USA
| | - Se Jin Song
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA
- Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA
| | - Jad Kanbar
- Department of Medicine, University of California San Diego, La Jolla, CA, USA
| | - Sandrine Miller-Montgomery
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA
- Micronoma, San Diego, CA, USA
| | - Robert Heaton
- Department of Psychiatry, University of California San Diego, La Jolla, CA, USA
| | - Rana Mckay
- Moores Cancer Center, University of California San Diego Health, La Jolla, CA, USA
| | - Sandip Pravin Patel
- Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA
- Moores Cancer Center, University of California San Diego Health, La Jolla, CA, USA
| | - Austin D Swafford
- Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA
| | - Tal Korem
- Program for Mathematical Genomics, Department of Systems Biology, Columbia University Irving Medical Center, New York, NY, USA
- Department of Obstetrics and Gynecology, Columbia University Irving Medical Center, New York, NY, USA
| | - Rob Knight
- Department of Bioengineering, University of California San Diego, La Jolla, CA, USA.
- Department of Pediatrics, University of California San Diego, La Jolla, CA, USA.
- Center for Microbiome Innovation, University of California San Diego, La Jolla, CA, USA.
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA, USA.
| |
Collapse
|
24
|
Qiu Z, Yuan L, Lian CA, Lin B, Chen J, Mu R, Qiao X, Zhang L, Xu Z, Fan L, Zhang Y, Wang S, Li J, Cao H, Li B, Chen B, Song C, Liu Y, Shi L, Tian Y, Ni J, Zhang T, Zhou J, Zhuang WQ, Yu K. BASALT refines binning from metagenomic data and increases resolution of genome-resolved metagenomic analysis. Nat Commun 2024; 15:2179. [PMID: 38467684 PMCID: PMC10928208 DOI: 10.1038/s41467-024-46539-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2023] [Accepted: 03/01/2024] [Indexed: 03/13/2024] Open
Abstract
Metagenomic binning is an essential technique for genome-resolved characterization of uncultured microorganisms in various ecosystems but hampered by the low efficiency of binning tools in adequately recovering metagenome-assembled genomes (MAGs). Here, we introduce BASALT (Binning Across a Series of Assemblies Toolkit) for binning and refinement of short- and long-read sequencing data. BASALT employs multiple binners with multiple thresholds to produce initial bins, then utilizes neural networks to identify core sequences to remove redundant bins and refine non-redundant bins. Using the same assemblies generated from Critical Assessment of Metagenome Interpretation (CAMI) datasets, BASALT produces up to twice as many MAGs as VAMB, DASTool, or metaWRAP. Processing assemblies from a lake sediment dataset, BASALT produces ~30% more MAGs than metaWRAP, including 21 unique class-level prokaryotic lineages. Functional annotations reveal that BASALT can retrieve 47.6% more non-redundant opening-reading frames than metaWRAP. These results highlight the robust handling of metagenomic sequencing data of BASALT.
Collapse
Affiliation(s)
- Zhiguang Qiu
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
| | - Li Yuan
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Chun-Ang Lian
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
| | - Bin Lin
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
| | - Jie Chen
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Rong Mu
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
| | - Xuejiao Qiao
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
| | - Liyu Zhang
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
| | - Zheng Xu
- Southern University of Sciences and Technology Yantian Hospital, Shenzhen, China
- Institute of Biomedicine and Biotechnology, Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, Guangdong, China
| | - Lu Fan
- Department of Ocean Science and Engineering, Southern University of Science and Technology (SUSTech), Shenzhen, China
| | - Yunzeng Zhang
- Joint International Research Laboratory of Agriculture and Agri-Product Safety, the Ministry of Education of China, Yangzhou University, Yangzhou, China
| | - Shanquan Wang
- Environmental Microbiomics Research Center, School of Environmental Science and Engineering, Sun Yat-Sen University, Guangzhou, China
| | - Junyi Li
- School of Computer Science and Technology, Harbin Institute of Technology (Shenzhen), Shenzhen, Guangdong, China
| | - Huiluo Cao
- Department of Microbiology, University of Hong Kong, Hong Kong, China
| | - Bing Li
- Shenzhen International Graduate School, Tsinghua University, Shenzhen, China
| | - Baowei Chen
- Guangdong Provincial Key Laboratory of Marine Resources and Coastal Engineering, School of Marine Sciences, Sun Yat-sen University, Zhuhai, China
| | - Chi Song
- Institute of Herbgenomics, Chengdu University of Traditional Chinese Medicine, Chengdu, China
- Wuhan Benagen Technology Co., Ltd, Wuhan, China
| | - Yongxin Liu
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, China
| | - Lili Shi
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- State Key Laboratory of Chemical Oncogenomics, School of Chemical Biology and Biotechnology, Peking University Shenzhen Graduate School, Shenzhen, China
| | - Yonghong Tian
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China
- School of Electronic and Computer Engineering, Peking University, Shenzhen, China
- Peng Cheng Laboratory, Shenzhen, China
| | - Jinren Ni
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China
- College of Environmental Sciences and Engineering, Key Laboratory of Water and Sediment Sciences, Ministry of Education, Peking University, Beijing, China
| | - Tong Zhang
- Department of Civil Engineering, University of Hong Kong, Hong Kong, China
| | - Jizhong Zhou
- Institute for Environmental Genomics, University of Oklahoma, Norman, OK, USA
| | - Wei-Qin Zhuang
- Department of Civil and Environmental Engineering, Faculty of Engineering, University of Auckland, Auckland, New Zealand
| | - Ke Yu
- Eco-environment and Resource Efficiency Research Laboratory, School of Environment and Energy, Shenzhen Graduate School, Peking University, Shenzhen, China.
- AI for Science (AI4S)-Preferred Program, Peking University, Shenzhen, China.
| |
Collapse
|
25
|
Hui X, Yang J, Sun J, Liu F, Pan W. MCSS: microbial community simulator based on structure. Front Microbiol 2024; 15:1358257. [PMID: 38516019 PMCID: PMC10956353 DOI: 10.3389/fmicb.2024.1358257] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Accepted: 02/20/2024] [Indexed: 03/23/2024] Open
Abstract
De novo assembly plays a pivotal role in metagenomic analysis, and the incorporation of third-generation sequencing technology can significantly improve the integrity and accuracy of assembly results. Recently, with advancements in sequencing technology (Hi-Fi, ultra-long), several long-read-based bioinformatic tools have been developed. However, the validation of the performance and reliability of these tools is a crucial concern. To address this gap, we present MCSS (microbial community simulator based on structure), which has the capability to generate simulated microbial community and sequencing datasets based on the structure attributes of real microbiome communities. The evaluation results indicate that it can generate simulated communities that exhibit both diversity and similarity to actual community structures. Additionally, MCSS generates synthetic PacBio Hi-Fi and Oxford Nanopore Technologies (ONT) long reads for the species within the simulated community. This innovative tool provides a valuable resource for benchmarking and refining metagenomic analysis methods. Code available at: https://github.com/panlab-bio/mcss.
Collapse
Affiliation(s)
- Xingqi Hui
- Zhengzhou Research Base, State Key Laboratory of Cotton Biology, School of Agricultural Sciences, Zhengzhou University, Zhengzhou, China
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (ICR, CAAS), Shenzhen, China
| | - Jinbao Yang
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (ICR, CAAS), Shenzhen, China
- College of Informatics, Huazhong Agricultural University, Wuhan, China
| | - Jinhuan Sun
- Key Laboratory of Plant Molecular Physiology, CAS Center for Excellence in Molecular Plant Sciences, Institute of Botany, Chinese Academy of Sciences, Beijing, China
| | - Fang Liu
- Zhengzhou Research Base, State Key Laboratory of Cotton Biology, School of Agricultural Sciences, Zhengzhou University, Zhengzhou, China
- National Key Laboratory of Cotton Bio-Breeding and Integrated Utilization, Institute of Cotton Research, Chinese Academy of Agricultural Sciences (ICR, CAAS), Anyang, China
| | - Weihua Pan
- Shenzhen Branch, Guangdong Laboratory of Lingnan Modern Agriculture, Genome Analysis Laboratory of the Ministry of Agriculture and Rural Affairs, Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences (ICR, CAAS), Shenzhen, China
| |
Collapse
|
26
|
Matchado MS, Rühlemann M, Reitmeier S, Kacprowski T, Frost F, Haller D, Baumbach J, List M. On the limits of 16S rRNA gene-based metagenome prediction and functional profiling. Microb Genom 2024; 10:001203. [PMID: 38421266 PMCID: PMC10926695 DOI: 10.1099/mgen.0.001203] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2023] [Accepted: 02/05/2024] [Indexed: 03/02/2024] Open
Abstract
Molecular profiling techniques such as metagenomics, metatranscriptomics or metabolomics offer important insights into the functional diversity of the microbiome. In contrast, 16S rRNA gene sequencing, a widespread and cost-effective technique to measure microbial diversity, only allows for indirect estimation of microbial function. To mitigate this, tools such as PICRUSt2, Tax4Fun2, PanFP and MetGEM infer functional profiles from 16S rRNA gene sequencing data using different algorithms. Prior studies have cast doubts on the quality of these predictions, motivating us to systematically evaluate these tools using matched 16S rRNA gene sequencing, metagenomic datasets, and simulated data. Our contribution is threefold: (i) using simulated data, we investigate if technical biases could explain the discordance between inferred and expected results; (ii) considering human cohorts for type two diabetes, colorectal cancer and obesity, we test if health-related differential abundance measures of functional categories are concordant between 16S rRNA gene-inferred and metagenome-derived profiles and; (iii) since 16S rRNA gene copy number is an important confounder in functional profiles inference, we investigate if a customised copy number normalisation with the rrnDB database could improve the results. Our results show that 16S rRNA gene-based functional inference tools generally do not have the necessary sensitivity to delineate health-related functional changes in the microbiome and should thus be used with care. Furthermore, we outline important differences in the individual tools tested and offer recommendations for tool selection.
Collapse
Affiliation(s)
- Monica Steffi Matchado
- Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Malte Rühlemann
- Institute of Clinical Molecular Biology, Kiel University, Kiel, Germany
| | - Sandra Reitmeier
- ZIEL - Institute for Food & Health, Core Facility Microbiome, Technical University of Munich, Freising, Germany
| | - Tim Kacprowski
- Division Data Science in Biomedicine, Peter L. Reichertz Institute for Medical Informatics of Technische Universität Braunschweig and Hannover Medical School, Braunschweig, Germany
- Braunschweig Integrated Centre of Systems Biology (BRICS), TU Braunschweig, Braunschweig, Germany
- Department of Computational Biology of Infection Research, Helmholtz Centre for Infection Research (HZI), Braunschweig, Germany
| | - Fabian Frost
- Department of Medicine A, University Medicine Greifswald, Greifswald, Germany
| | - Dirk Haller
- ZIEL - Institute for Food & Health, Core Facility Microbiome, Technical University of Munich, Freising, Germany
- Chair of Nutrition and Immunology, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| | - Jan Baumbach
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany
- Institute of Mathematics and Computer Science, University of Southern Denmark, Odense, Denmark
| | - Markus List
- Data Science in Systems Biology, TUM School of Life Sciences, Technical University of Munich, Freising, Germany
| |
Collapse
|
27
|
Valencia EM, Maki KA, Dootz JN, Barb JJ. Mock community taxonomic classification performance of publicly available shotgun metagenomics pipelines. Sci Data 2024; 11:81. [PMID: 38233447 PMCID: PMC10794705 DOI: 10.1038/s41597-023-02877-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 12/22/2023] [Indexed: 01/19/2024] Open
Abstract
Shotgun metagenomic sequencing comprehensively samples the DNA of a microbial sample. Choosing the best bioinformatics processing package can be daunting due to the wide variety of tools available. Here, we assessed publicly available shotgun metagenomics processing packages/pipelines including bioBakery, Just a Microbiology System (JAMS), Whole metaGenome Sequence Assembly V2 (WGSA2), and Woltka using 19 publicly available mock community samples and a set of five constructed pathogenic gut microbiome samples. Also included is a workflow for labelling bacterial scientific names with NCBI taxonomy identifiers for better resolution in assessing results. The Aitchison distance, a sensitivity metric, and total False Positive Relative Abundance were used for accuracy assessments for all pipelines and mock samples. Overall, bioBakery4 performed the best with most of the accuracy metrics, while JAMS and WGSA2, had the highest sensitivities. Furthermore, bioBakery is commonly used and only requires a basic knowledge of command line usage. This work provides an unbiased assessment of shotgun metagenomics packages and presents results assessing the performance of the packages using mock community sequence data.
Collapse
Affiliation(s)
- E Michael Valencia
- Translational Biobehavioral and Health Disparities Branch, National Institutes of Health Clinical Center, Bethesda, MD, 20814, USA
| | - Katherine A Maki
- Translational Biobehavioral and Health Disparities Branch, National Institutes of Health Clinical Center, Bethesda, MD, 20814, USA
| | - Jennifer N Dootz
- Biosystems and Biomaterials Division, National Institute of Standards and Technology, Gaithersburg, MD, 20899, USA
| | - Jennifer J Barb
- Translational Biobehavioral and Health Disparities Branch, National Institutes of Health Clinical Center, Bethesda, MD, 20814, USA.
| |
Collapse
|
28
|
Steinke K, Pamp SJ, Munk P. MAGICIAN: MAG simulation for investigating criteria for bioinformatic analysis. BMC Genomics 2024; 25:55. [PMID: 38216924 PMCID: PMC10785454 DOI: 10.1186/s12864-023-09912-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2023] [Accepted: 12/15/2023] [Indexed: 01/14/2024] Open
Abstract
BACKGROUND The possibility of recovering metagenome-assembled genomes (MAGs) from sequence reads allows for further insights into microbial communities and their members, possibly even analyzing such sequences with tools designed for single-isolate genomes. As result quality depends on sequence quality, performance of tools for single-isolate genomes on MAGs should be tested beforehand. Bioinformatics can be leveraged to quickly create varied synthetic test sets with known composition for this purpose. RESULTS We present MAGICIAN, a flexible, user-friendly pipeline for the simulation of MAGs. MAGICIAN combines a synthetic metagenome simulator with a metagenomic assembly and binning pipeline to simulate MAGs based on user-supplied input genomes, allowing users to test performance of tools on MAGs while having a ground truth to compare results to. Using MAGICIAN, we found that even very slight (1%) changes in depth of coverage can drastically affect whether a genome can be recovered. We also demonstrate the use of simulated MAGs by evaluating the suitability of such genomes obtained with MAGICIAN's current default pipeline for analysis with the antimicrobial resistance gene identification tool ResFinder. CONCLUSIONS Using MAGICIAN, it is possible to simulate MAGs which, while generally high in quality, reflect issues encountered with real-world data, thus providing realistic best-case data. Evaluating the results of ResFinder analysis of these genomes revealed a risk for plausible-looking false positives, which underlines the need for pipeline validation so that researchers are aware of the potential issues when interpreting real-world data. Furthermore, the effects of fluctuations in depth of coverage on genome recovery in our simulated "random sequencing" warrant further investigation and indicate random subsampling of reads may affect discovery of more genomes.
Collapse
Affiliation(s)
- Kat Steinke
- Center for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Kemitorvet 204, 2800, Kongens Lyngby, Denmark
- Department of Clinical Microbiology, Odense University Hospital, J. B. Winsløws Vej 21, 5000, Odense, Denmark
| | - Sünje J Pamp
- Center for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Kemitorvet 204, 2800, Kongens Lyngby, Denmark
| | - Patrick Munk
- Center for Genomic Epidemiology, National Food Institute, Technical University of Denmark, Kemitorvet 204, 2800, Kongens Lyngby, Denmark.
| |
Collapse
|
29
|
Baud A, Kennedy SP. Targeted Metagenomic Databases Provide Improved Analysis of Microbiota Samples. Microorganisms 2024; 12:135. [PMID: 38257962 PMCID: PMC10819777 DOI: 10.3390/microorganisms12010135] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Revised: 12/15/2023] [Accepted: 12/28/2023] [Indexed: 01/24/2024] Open
Abstract
We report on Moonbase, an innovative pipeline that builds upon the established tools of MetaPhlAn and Kraken2, enhancing their capabilities for more precise taxonomic detection and quantification in diverse microbial communities. Moonbase enhances the performance of Kraken2 mapping by providing an efficient method for constructing project-specific databases. Moonbase was evaluated using synthetic metagenomic samples and compared against MetaPhlAn3 and generalized Kraken2 databases. Moonbase significantly improved species precision and quantification, outperforming marker genes and generalized databases. Construction of a phylogenetic tree from 16S genome data in Moonbase allowed for the incorporation of UniFrac-type phylogenetic information into diversity calculations of samples. We demonstrated that the resulting analysis increased statistical power in distinguishing microbial communities. This study highlights the continual evolution of metagenomic tools with the goal of improving metagenomic analysis and highlighting the potential of the Moonbase pipeline.
Collapse
Affiliation(s)
| | - Sean P. Kennedy
- Institut Pasteur, Université Paris Cité, Département de Biologie Computationnelle, F-75015 Paris, France
| |
Collapse
|
30
|
Kim N, Kim CY, Ma J, Yang S, Park DJ, Ha SJ, Belenky P, Lee I. MRGM: an enhanced catalog of mouse gut microbial genomes substantially broadening taxonomic and functional landscapes. Gut Microbes 2024; 16:2393791. [PMID: 39230075 PMCID: PMC11376411 DOI: 10.1080/19490976.2024.2393791] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 03/10/2024] [Revised: 08/12/2024] [Accepted: 08/13/2024] [Indexed: 09/05/2024] Open
Abstract
Mouse gut microbiome research is pivotal for understanding the human gut microbiome, providing insights into disease modeling, host-microbe interactions, and the dietary influence on the gut microbiome. To enhance the translational value of mouse gut microbiome studies, we need detailed and high-quality catalogs of mouse gut microbial genomes. We introduce the Mouse Reference Gut Microbiome (MRGM), a comprehensive catalog with 42,245 non-redundant mouse gut bacterial genomes across 1,524 species. MRGM marks a 40% increase in the known taxonomic diversity of mouse gut microbes, capturing previously underrepresented lineages through refined genome quality assessment techniques. MRGM not only broadens the taxonomic landscape but also enriches the functional landscape of the mouse gut microbiome. Using deep learning, we have elevated the Gene Ontology annotation rate for mouse gut microbial proteins from 3.2% with orthology to 60%, marking an over 18-fold increase. MRGM supports both DNA- and marker-based taxonomic profiling by providing custom databases, surpassing previous catalogs in performance. Finally, taxonomic and functional comparisons between human and mouse gut microbiota reveal diet-driven divergences in their taxonomic composition and functional enrichment. Overall, our study highlights the value of high-quality microbial genome catalogs in advancing our understanding of the co-evolution between gut microbes and their host.
Collapse
Affiliation(s)
- Nayeon Kim
- Department of Biotechnology, College of Life Science & Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Chan Yeong Kim
- Department of Biotechnology, College of Life Science & Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Junyeong Ma
- Department of Biotechnology, College of Life Science & Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Sunmo Yang
- Department of Biotechnology, College of Life Science & Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Dong Jin Park
- Department of Biochemistry, College of Life Science & Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Sang-Jun Ha
- Department of Biochemistry, College of Life Science & Biotechnology, Yonsei University, Seoul, Republic of Korea
| | - Peter Belenky
- Department of Molecular Microbiology and Immunology, Brown University, Providence, RI, USA
| | - Insuk Lee
- Department of Biotechnology, College of Life Science & Biotechnology, Yonsei University, Seoul, Republic of Korea
- POSTECH Biotech Center, Pohang University of Science and Technology (POSTECH), Pohang, Republic of Korea
| |
Collapse
|
31
|
Kang X, Xu J, Luo X, Schönhuth A. Hybrid-hybrid correction of errors in long reads with HERO. Genome Biol 2023; 24:275. [PMID: 38041098 PMCID: PMC10690975 DOI: 10.1186/s13059-023-03112-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Accepted: 11/16/2023] [Indexed: 12/03/2023] Open
Abstract
Although generally superior, hybrid approaches for correcting errors in third-generation sequencing (TGS) reads, using next-generation sequencing (NGS) reads, mistake haplotype-specific variants for errors in polyploid and mixed samples. We suggest HERO, as the first "hybrid-hybrid" approach, to make use of both de Bruijn graphs and overlap graphs for optimal catering to the particular strengths of NGS and TGS reads. Extensive benchmarking experiments demonstrate that HERO improves indel and mismatch error rates by on average 65% (27[Formula: see text]95%) and 20% (4[Formula: see text]61%). Using HERO prior to genome assembly significantly improves the assemblies in the majority of the relevant categories.
Collapse
Affiliation(s)
- Xiongbin Kang
- College of Biology, Hunan University, Changsha, China
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany
| | - Jialu Xu
- College of Biology, Hunan University, Changsha, China
| | - Xiao Luo
- College of Biology, Hunan University, Changsha, China.
| | - Alexander Schönhuth
- Genome Data Science, Faculty of Technology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
32
|
Walsh LH, Coakley M, Walsh AM, O'Toole PW, Cotter PD. Bioinformatic approaches for studying the microbiome of fermented food. Crit Rev Microbiol 2023; 49:693-725. [PMID: 36287644 DOI: 10.1080/1040841x.2022.2132850] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/11/2022] [Revised: 08/11/2022] [Accepted: 09/28/2022] [Indexed: 11/03/2022]
Abstract
High-throughput DNA sequencing-based approaches continue to revolutionise our understanding of microbial ecosystems, including those associated with fermented foods. Metagenomic and metatranscriptomic approaches are state-of-the-art biological profiling methods and are employed to investigate a wide variety of characteristics of microbial communities, such as taxonomic membership, gene content and the range and level at which these genes are expressed. Individual groups and consortia of researchers are utilising these approaches to produce increasingly large and complex datasets, representing vast populations of microorganisms. There is a corresponding requirement for the development and application of appropriate bioinformatic tools and pipelines to interpret this data. This review critically analyses the tools and pipelines that have been used or that could be applied to the analysis of metagenomic and metatranscriptomic data from fermented foods. In addition, we critically analyse a number of studies of fermented foods in which these tools have previously been applied, to highlight the insights that these approaches can provide.
Collapse
Affiliation(s)
- Liam H Walsh
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
- School of Microbiology, University College Cork, Ireland
| | - Mairéad Coakley
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
| | - Aaron M Walsh
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
| | - Paul W O'Toole
- School of Microbiology, University College Cork, Ireland
- APC Microbiome Ireland, University College Cork, Ireland
| | - Paul D Cotter
- Teagasc Food Research Centre, Moorepark, Fermoy, Cork, Ireland
- APC Microbiome Ireland, University College Cork, Ireland
- VistaMilk SFI Research Centre, Teagasc, Moorepark, Fermoy, Cork, Ireland
| |
Collapse
|
33
|
Huttenhower C, Finn RD, McHardy AC. Challenges and opportunities in sharing microbiome data and analyses. Nat Microbiol 2023; 8:1960-1970. [PMID: 37783751 DOI: 10.1038/s41564-023-01484-x] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2021] [Accepted: 08/28/2023] [Indexed: 10/04/2023]
Abstract
Microbiome data, metadata and analytical workflows have become 'big' in terms of volume and complexity. Although the infrastructure and technologies to share data have been established, the interdisciplinary and multi-omic nature of the field can make resources difficult to identify and use. Following best practices for data deposition requires substantial effort, with sometimes little obvious reward. Gaps remain where microbiome-specific resources for data sharing or reproducibility do not yet exist. We outline available best practices, challenges to their adoption and opportunities in data sharing in microbiome research. We showcase examples of best practices and advocate for their enforcement and incentivization for data sharing. This includes recognition of data curation and sharing endeavours by individuals, institutions, journals and funders. Opportunities for progress include enabling microbiome-specific databases to incorporate future methods for data analysis, integration and reuse.
Collapse
Affiliation(s)
- Curtis Huttenhower
- Harvard Chan Microbiome in Public Health Center, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Departments of Biostatistics and Immunology and Infectious Diseases, Harvard T.H. Chan School of Public Health, Boston, MA, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | - Robert D Finn
- European Molecular Biology Laboratory, European Bioinformatics Institute, Wellcome Genome Campus, Hinxton, Cambridge, UK
| | - Alice Carolyn McHardy
- Computational Biology of Infection Research, Helmholtz Centre for Infection Research, Braunschweig, Germany.
- Braunschweig Integrated Centre of Systems Biology (BRICS), Technische Universität Braunschweig, Braunschweig, Germany.
| |
Collapse
|
34
|
Park H, Lim SJ, Cosme J, O'Connell K, Sandeep J, Gayanilo F, Cutter Jr. GR, Montes E, Nitikitpaiboon C, Fisher S, Moustahfid H, Thompson LR. Investigation of machine learning algorithms for taxonomic classification of marine metagenomes. Microbiol Spectr 2023; 11:e0523722. [PMID: 37695074 PMCID: PMC10580933 DOI: 10.1128/spectrum.05237-22] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Accepted: 06/30/2023] [Indexed: 09/12/2023] Open
Abstract
IMPORTANCE Taxonomic profiling of microbial communities is essential to model microbial interactions and inform habitat conservation. This work develops approaches in constructing training/testing data sets from publicly available marine metagenomes and evaluates the performance of machine learning (ML) approaches in read-based taxonomic classification of marine metagenomes. Predictions from two models are used to test accuracy in metagenomic classification and to guide improvements in ML approaches. Our study provides insights on the methods, results, and challenges of deep learning on marine microbial metagenomic data sets. Future machine learning approaches can be improved by rectifying genome coverage and class imbalance in the training data sets, developing alternative models, and increasing the accessibility of computational resources for model training and refinement.
Collapse
Affiliation(s)
- Helen Park
- Center for Synthetic and Systems Biology, School of Life Sciences, Tsinghua-Peking Center for Life Sciences, Tsinghua University, Beijing, China
- EPSRC/BBSRC Future Biomanufacturing Research Hub, EPSRC Synthetic Biology Research Centre SYNBIOCHEM Manchester Institute of Biotechnology and School of Chemistry, The University of Manchester, Manchester, United Kingdom
| | - Shen Jean Lim
- Cooperative Institute for Marine and Atmospheric Studies, Rosenstiel School of Marine, Atmospheric, and Earth Science, University of Miami, Miami, Florida, USA
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
- College of Marine Science, University of South Florida, St Petersburg, Florida, USA
| | | | - Kyle O'Connell
- Deloitte Consulting LLP, Biomedical Data Science Team, Arlington, Virginia, USA
- Department of Vertebrate Zoology, National Museum of Natural History, Smithsonian Institution, Northwest, Washington, DC, USA
| | - Jilla Sandeep
- Harte Research Institute, Texas A&M University-Corpus Christi, Corpus Christi, Texas, USA
| | - Felimon Gayanilo
- Harte Research Institute, Texas A&M University-Corpus Christi, Corpus Christi, Texas, USA
| | - George R. Cutter Jr.
- Southwest Fisheries Science Center, Antarctic Ecosystem Research Division, National Oceanic and Atmospheric Administration, La Jolla, California, USA
| | - Enrique Montes
- Cooperative Institute for Marine and Atmospheric Studies, Rosenstiel School of Marine, Atmospheric, and Earth Science, University of Miami, Miami, Florida, USA
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
| | - Chotinan Nitikitpaiboon
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, The University of Tokyo, Tokyo, Japan
| | - Sam Fisher
- Deloitte Consulting LLP, Biomedical Data Science Team, Arlington, Virginia, USA
| | - Hassan Moustahfid
- NOAA/US Integrated Ocean Observing System (IOOS), Silver Spring, Maryland, USA
| | - Luke R. Thompson
- Ocean Chemistry and Ecosystems Division, Atlantic Oceanographic and Meteorological Laboratory, National Oceanic and Atmospheric Administration, Miami, Florida, USA
- Northern Gulf Institute, Mississippi State University, Mississippi, USA
| |
Collapse
|
35
|
Trinh P, Clausen DS, Willis AD. happi: a hierarchical approach to pangenomics inference. Genome Biol 2023; 24:214. [PMID: 37773075 PMCID: PMC10540326 DOI: 10.1186/s13059-023-03040-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Accepted: 08/16/2023] [Indexed: 09/30/2023] Open
Abstract
Recovering metagenome-assembled genomes (MAGs) from shotgun sequencing data is an increasingly common task in microbiome studies, as MAGs provide deeper insight into the functional potential of both culturable and non-culturable microorganisms. However, metagenome-assembled genomes vary in quality and may contain omissions and contamination. These errors present challenges for detecting genes and comparing gene enrichment across sample types. To address this, we propose happi, an approach to testing hypotheses about gene enrichment that accounts for genome quality. We illustrate the advantages of happi over existing approaches using published Saccharibacteria MAGs, Streptococcus thermophilus MAGs, and via simulation.
Collapse
Affiliation(s)
- Pauline Trinh
- Department of Environmental & Occupational Health Sciences, University of Washington, Seattle, WA, USA
| | - David S Clausen
- Department of Biostatistics, University of Washington, Seattle, WA, USA
| | - Amy D Willis
- Department of Biostatistics, University of Washington, Seattle, WA, USA.
| |
Collapse
|
36
|
Price C, Russell JA. AMAnD: an automated metagenome anomaly detection methodology utilizing DeepSVDD neural networks. Front Public Health 2023; 11:1181911. [PMID: 37497030 PMCID: PMC10368493 DOI: 10.3389/fpubh.2023.1181911] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 06/12/2023] [Indexed: 07/28/2023] Open
Abstract
The composition of metagenomic communities within the human body often reflects localized medical conditions such as upper respiratory diseases and gastrointestinal diseases. Fast and accurate computational tools to flag anomalous metagenomic samples from typical samples are desirable to understand different phenotypes, especially in contexts where repeated, long-duration temporal sampling is done. Here, we present Automated Metagenome Anomaly Detection (AMAnD), which utilizes two types of Deep Support Vector Data Description (DeepSVDD) models; one trained on taxonomic feature space output by the Pan-Genomics for Infectious Agents (PanGIA) taxonomy classifier and one trained on kmer frequency counts. AMAnD's semi-supervised one-class approach makes no assumptions about what an anomaly may look like, allowing the flagging of potentially novel anomaly types. Three diverse datasets are profiled. The first dataset is hosted on the National Center for Biotechnology Information's (NCBI) Sequence Read Archive (SRA) and contains nasopharyngeal swabs from healthy and COVID-19-positive patients. The second dataset is also hosted on SRA and contains gut microbiome samples from normal controls and from patients with slow transit constipation (STC). AMAnD can learn a typical healthy nasopharyngeal or gut microbiome profile and reliably flag the anomalous COVID+ or STC samples in both feature spaces. The final dataset is a synthetic metagenome created by the Critical Assessment of Metagenome Annotation Simulator (CAMISIM). A control dataset of 50 well-characterized organisms was submitted to CAMISIM to generate 100 synthetic control class samples. The experimental conditions included 12 different spiked-in contaminants that are taxonomically similar to organisms present in the laboratory blank sample ranging from one strain tree branch taxonomic distance away to one family tree branch taxonomic distance away. This experiment was repeated in triplicate at three different coverage levels to probe the dependence on sample coverage. AMAnD was again able to flag the contaminant inserts as anomalous. AMAnD's assumption-free flagging of metagenomic anomalies, the real-time model training update potential of the deep learning approach, and the strong performance even with lightweight models of low sample cardinality would make AMAnD well-suited to a wide array of applied metagenomics biosurveillance use-cases, from environmental to clinical utility.
Collapse
|
37
|
Zhou B, Li H. STEMSIM: a simulator of within-strain short-term evolutionary mutations for longitudinal metagenomic data. Bioinformatics 2023; 39:btad302. [PMID: 37154701 PMCID: PMC10188296 DOI: 10.1093/bioinformatics/btad302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2022] [Revised: 03/29/2023] [Accepted: 04/29/2023] [Indexed: 05/10/2023] Open
Abstract
MOTIVATION As the resolution of metagenomic analysis increases, the evolution of microbial genomes in longitudinal metagenomic data has become a research focus. Some software has been developed for the simulation of complex microbial communities at the strain level. However, the tool for simulating within-strain evolutionary signals in longitudinal samples is still lacking. RESULTS In this study, we introduce STEMSIM, a user-friendly command-line simulator of short-term evolutionary mutations for longitudinal metagenomic data. The input is simulated longitudinal raw sequencing reads of microbial communities or single species. The output is the modified reads with within-strain evolutionary mutations and the relevant information of these mutations. STEMSIM will be of great use for the evaluation of analytic tools that detect short-term evolutionary mutations in metagenomic data. AVAILABILITY AND IMPLEMENTATION STEMSIM and its tutorial are freely available online at https://github.com/BoyanZhou/STEMSim.
Collapse
Affiliation(s)
- Boyan Zhou
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, NY 10016, USA
| | - Huilin Li
- Division of Biostatistics, Department of Population Health, New York University School of Medicine, New York, NY 10016, USA
| |
Collapse
|
38
|
Mineeva O, Danciu D, Schölkopf B, Ley RE, Rätsch G, Youngblut ND. ResMiCo: Increasing the quality of metagenome-assembled genomes with deep learning. PLoS Comput Biol 2023; 19:e1011001. [PMID: 37126495 PMCID: PMC10174551 DOI: 10.1371/journal.pcbi.1011001] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2022] [Revised: 05/11/2023] [Accepted: 03/06/2023] [Indexed: 05/02/2023] Open
Abstract
The number of published metagenome assemblies is rapidly growing due to advances in sequencing technologies. However, sequencing errors, variable coverage, repetitive genomic regions, and other factors can produce misassemblies, which are challenging to detect for taxonomically novel genomic data. Assembly errors can affect all downstream analyses of the assemblies. Accuracy for the state of the art in reference-free misassembly prediction does not exceed an AUPRC of 0.57, and it is not clear how well these models generalize to real-world data. Here, we present the Residual neural network for Misassembled Contig identification (ResMiCo), a deep learning approach for reference-free identification of misassembled contigs. To develop ResMiCo, we first generated a training dataset of unprecedented size and complexity that can be used for further benchmarking and developments in the field. Through rigorous validation, we show that ResMiCo is substantially more accurate than the state of the art, and the model is robust to novel taxonomic diversity and varying assembly methods. ResMiCo estimated 7% misassembled contigs per metagenome across multiple real-world datasets. We demonstrate how ResMiCo can be used to optimize metagenome assembly hyperparameters to improve accuracy, instead of optimizing solely for contiguity. The accuracy, robustness, and ease-of-use of ResMiCo make the tool suitable for general quality control of metagenome assemblies and assembly methodology optimization.
Collapse
Affiliation(s)
- Olga Mineeva
- Department of Computer Science, ETH Zürich, Zürich, Switzerland
- Department of Empirical Inference, Max Planck Institute for Intelligent Systems, Tübingen, Germany
- Swiss Institute for Bioinformatics, Lausanne, Switzerland
| | - Daniel Danciu
- Department of Computer Science, ETH Zürich, Zürich, Switzerland
| | - Bernhard Schölkopf
- Department of Computer Science, ETH Zürich, Zürich, Switzerland
- Department of Empirical Inference, Max Planck Institute for Intelligent Systems, Tübingen, Germany
- ETH AI center, ETH Zürich, Zürich, Switzerland
| | - Ruth E Ley
- Department of Microbiome Science, Max Planck Institute for Biology, Tübingen, Germany
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zürich, Zürich, Switzerland
- Swiss Institute for Bioinformatics, Lausanne, Switzerland
- ETH AI center, ETH Zürich, Zürich, Switzerland
- Department of Biology, ETH Zürich, Zürich, Switzerland
- Medical Informatics Unit, Zürich University Hospital, Zürich, Switzerland
| | - Nicholas D Youngblut
- Department of Microbiome Science, Max Planck Institute for Biology, Tübingen, Germany
| |
Collapse
|
39
|
García Mendez D, Sanabria J, Wist J, Holmes E. Effect of Operational Parameters on the Cultivation of the Gut Microbiome in Continuous Bioreactors Inoculated with Feces: A Systematic Review. JOURNAL OF AGRICULTURAL AND FOOD CHEMISTRY 2023; 71:6213-6225. [PMID: 37070710 PMCID: PMC10143624 DOI: 10.1021/acs.jafc.2c08146] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/20/2022] [Revised: 01/27/2023] [Accepted: 01/27/2023] [Indexed: 05/03/2023]
Abstract
Since the early 1980s, multiple researchers have contributed to the development of in vitro models of the human gastrointestinal system for the mechanistic interrogation of the gut microbiome ecology. Using a bioreactor for simulating all the features and conditions of the gastrointestinal system is a massive challenge. Some conditions, such as temperature and pH, are readily controlled, but a more challenging feature to simulate is that both may vary in different regions of the gastrointestinal tract. Promising solutions have been developed for simulating other functionalities, such as dialysis capabilities, peristaltic movements, and biofilm growth. This research field is under constant development, and further efforts are needed to drive these models closer to in vivo conditions, thereby increasing their usefulness for studying the gut microbiome impact on human health. Therefore, understanding the influence of key operational parameters is fundamental for the refinement of the current bioreactors and for guiding the development of more complex models. In this review, we performed a systematic search for operational parameters in 229 papers that used continuous bioreactors seeded with human feces. Despite the reporting of operational parameters for the various bioreactor models being variable, as a result of a lack of standardization, the impact of specific operational parameters on gut microbial ecology is discussed, highlighting the advantages and limitations of the current bioreactor systems.
Collapse
Affiliation(s)
- David
Felipe García Mendez
- Australian
National Phenome Centre and Computational and Systems Medicine, Health
Futures Institute, Murdoch University, Harry Perkins Building, Perth, Australia WA6150
| | - Janeth Sanabria
- Australian
National Phenome Centre and Computational and Systems Medicine, Health
Futures Institute, Murdoch University, Harry Perkins Building, Perth, Australia WA6150
- Environmental
Microbiology and Biotechnology Laboratory, Engineering School of Environmental
& Natural Resources, Engineering Faculty, Universidad del Valle—Sede Meléndez, Cali, Colombia 76001
| | - Julien Wist
- Australian
National Phenome Centre and Computational and Systems Medicine, Health
Futures Institute, Murdoch University, Harry Perkins Building, Perth, Australia WA6150
- Chemistry
Department, Universidad del Valle, 76001, Cali, Colombia
| | - Elaine Holmes
- Australian
National Phenome Centre and Computational and Systems Medicine, Health
Futures Institute, Murdoch University, Harry Perkins Building, Perth, Australia WA6150
| |
Collapse
|
40
|
Yang C, Lo T, Nip KM, Hafezqorani S, Warren RL, Birol I. Characterization and simulation of metagenomic nanopore sequencing data with Meta-NanoSim. Gigascience 2023; 12:giad013. [PMID: 36939007 PMCID: PMC10025935 DOI: 10.1093/gigascience/giad013] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2022] [Revised: 01/19/2023] [Accepted: 02/17/2023] [Indexed: 03/21/2023] Open
Abstract
BACKGROUND Nanopore sequencing is crucial to metagenomic studies as its kilobase-long reads can contribute to resolving genomic structural differences among microbes. However, sequencing platform-specific challenges, including high base-call error rate, nonuniform read lengths, and the presence of chimeric artifacts, necessitate specifically designed analytical algorithms. The use of simulated datasets with characteristics that are true to the sequencing platform under evaluation is a cost-effective way to assess the performance of bioinformatics tools with the ground truth in a controlled environment. RESULTS Here, we present Meta-NanoSim, a fast and versatile utility that characterizes and simulates the unique properties of nanopore metagenomic reads. It improves upon state-of-the-art methods on microbial abundance estimation through a base-level quantification algorithm. Meta-NanoSim can simulate complex microbial communities composed of both linear and circular genomes and can stream reference genomes from online servers directly. Simulated datasets showed high congruence with experimental data in terms of read length, error profiles, and abundance levels. We demonstrate that Meta-NanoSim simulated data can facilitate the development of metagenomic algorithms and guide experimental design through a metagenome assembly benchmarking task. CONCLUSIONS The Meta-NanoSim characterization module investigates read features, including chimeric information and abundance levels, while the simulation module simulates large and complex multisample microbial communities with different abundance profiles. All trained models and the software are freely accessible at GitHub: https://github.com/bcgsc/NanoSim.
Collapse
Affiliation(s)
- Chen Yang
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Theodora Lo
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Ka Ming Nip
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - Saber Hafezqorani
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Bioinformatics Graduate Program, University of British Columbia, Genome Sciences Centre, BCCA 100-570 West 7th Avenue, Vancouver, BC, V5Z 4S6, Canada
| | - René L Warren
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
| | - Inanc Birol
- Canada's Michael Smith Genome Sciences Centre, BC Cancer, Vancouver, BC, V5Z 4S6, Canada
- Department of Medical Genetics, University of British Columbia, Life Sciences Centre Room 1364 – 2350 Health Science Mall Vancouver, BC V6T 1Z3, Canada
| |
Collapse
|
41
|
Gabrielli M, Dai Z, Delafont V, Timmers PHA, van der Wielen PWJJ, Antonelli M, Pinto AJ. Identifying Eukaryotes and Factors Influencing Their Biogeography in Drinking Water Metagenomes. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2023; 57:3645-3660. [PMID: 36827617 PMCID: PMC9996835 DOI: 10.1021/acs.est.2c09010] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/29/2022] [Revised: 02/13/2023] [Accepted: 02/13/2023] [Indexed: 06/18/2023]
Abstract
The biogeography of eukaryotes in drinking water systems is poorly understood relative to that of prokaryotes or viruses, limiting the understanding of their role and management. A challenge with studying complex eukaryotic communities is that metagenomic analysis workflows are currently not as mature as those that focus on prokaryotes or viruses. In this study, we benchmarked different strategies to recover eukaryotic sequences and genomes from metagenomic data and applied the best-performing workflow to explore the factors affecting the relative abundance and diversity of eukaryotic communities in drinking water distribution systems (DWDSs). We developed an ensemble approach exploiting k-mer- and reference-based strategies to improve eukaryotic sequence identification and identified MetaBAT2 as the best-performing binning approach for their clustering. Applying this workflow to the DWDS metagenomes showed that eukaryotic sequences typically constituted small proportions (i.e., <1%) of the overall metagenomic data with higher relative abundances in surface water-fed or chlorinated systems with high residuals. The α and β diversities of eukaryotes were correlated with those of prokaryotic and viral communities, highlighting the common role of environmental/management factors. Finally, a co-occurrence analysis highlighted clusters of eukaryotes whose members' presence and abundance in DWDSs were affected by disinfection strategies, climate conditions, and source water types.
Collapse
Affiliation(s)
- Marco Gabrielli
- Dipartimento
di Ingegneria Civile e Ambientale—Sezione Ambientale, Politecnico di Milano, Milan 20133, Italy
| | - Zihan Dai
- Research
Center for Eco-Environmental Sciences, Chinese Academy of Sciences, Beijing 100085, China
| | - Vincent Delafont
- Laboratoire
Ecologie et Biologie des Interactions (EBI), Equipe Microorganismes,
Hôtes, Environnements, Université
de Poitiers, Poitiers 86073, France
| | - Peer H. A. Timmers
- KWR
Watercycle Research Institute, 3433 PE Nieuwegein, The Netherlands
- Department
of Microbiology, Radboud University, Heyendaalseweg 135, 6525 AJ Nijmegen, The Netherlands
| | - Paul W. J. J. van der Wielen
- KWR
Watercycle Research Institute, 3433 PE Nieuwegein, The Netherlands
- Laboratory
of Microbiology, Wageningen University, 6700 HB Wageningen, The Netherlands
| | - Manuela Antonelli
- Dipartimento
di Ingegneria Civile e Ambientale—Sezione Ambientale, Politecnico di Milano, Milan 20133, Italy
| | - Ameet J. Pinto
- School
of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, Georgia 30332, United States
| |
Collapse
|
42
|
Jurado-Rueda F, Alonso-Guirado L, Perea-Chamblee TE, Elliott OT, Filip I, Rabadán R, Malats N. Benchmarking of microbiome detection tools on RNA-seq synthetic databases according to diverse conditions. BIOINFORMATICS ADVANCES 2023; 3:vbad014. [PMID: 36874954 PMCID: PMC9976984 DOI: 10.1093/bioadv/vbad014] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/15/2022] [Revised: 11/15/2022] [Accepted: 02/03/2023] [Indexed: 02/24/2023]
Abstract
Motivation Here, we performed a benchmarking analysis of five tools for microbe sequence detection using transcriptomics data (Kraken2, MetaPhlAn2, PathSeq, DRAC and Pandora). We built a synthetic database mimicking real-world structure with tuned conditions accounting for microbe species prevalence, base calling quality and sequence length. Sensitivity and positive predictive value (PPV) parameters, as well as computational requirements, were used for tool ranking. Results GATK PathSeq showed the highest sensitivity on average and across all scenarios considered. However, the main drawback of this tool was its slowness. Kraken2 was the fastest tool and displayed the second-best sensitivity, though with large variance depending on the species to be classified. There was no significant difference for the other three algorithms sensitivity. The sensitivity of MetaPhlAn2 and Pandora was affected by sequence number and DRAC by sequence quality and length. Results from this study support the use of Kraken2 for routine microbiome profiling based on its competitive sensitivity and runtime performance. Nonetheless, we strongly endorse to complement it by combining with MetaPhlAn2 for thorough taxonomic analyses. Availability and implementation https://github.com/fjuradorueda/MIME/ and https://github.com/lola4/DRAC/. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Francisco Jurado-Rueda
- Genetic & Molecular Epidemiology Group, Spanish National Cancer Research Centre and CIBERONC, Madrid 28029, Spain
| | - Lola Alonso-Guirado
- Genetic & Molecular Epidemiology Group, Spanish National Cancer Research Centre and CIBERONC, Madrid 28029, Spain
| | - Tomin E Perea-Chamblee
- Program for Mathematical Genomics and Department of Systems Biology, Columbia University, New York, NY 10027, USA
| | - Oliver T Elliott
- Program for Mathematical Genomics and Department of Systems Biology, Columbia University, New York, NY 10027, USA
| | - Ioan Filip
- Program for Mathematical Genomics and Department of Systems Biology, Columbia University, New York, NY 10027, USA
| | - Raúl Rabadán
- Program for Mathematical Genomics and Department of Systems Biology, Columbia University, New York, NY 10027, USA
| | - Núria Malats
- Genetic & Molecular Epidemiology Group, Spanish National Cancer Research Centre and CIBERONC, Madrid 28029, Spain
| |
Collapse
|
43
|
Metagenomic Antimicrobial Susceptibility Testing from Simulated Native Patient Samples. Antibiotics (Basel) 2023; 12:antibiotics12020366. [PMID: 36830277 PMCID: PMC9952719 DOI: 10.3390/antibiotics12020366] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2023] [Revised: 02/06/2023] [Accepted: 02/08/2023] [Indexed: 02/12/2023] Open
Abstract
Genomic antimicrobial susceptibility testing (AST) has been shown to be accurate for many pathogens and antimicrobials. However, these methods have not been systematically evaluated for clinical metagenomic data. We investigate the performance of in-silico AST from clinical metagenomes (MG-AST). Using isolate sequencing data from a multi-center study on antimicrobial resistance (AMR) as well as shotgun-sequenced septic urine samples, we simulate over 2000 complicated urinary tract infection (cUTI) metagenomes with known resistance phenotype to 5 antimicrobials. Applying rule-based and machine learning-based genomic AST classifiers, we explore the impact of sequencing depth and technology, metagenome complexity, and bioinformatics processing approaches on AST accuracy. By using an optimized metagenomics assembly and binning workflow, MG-AST achieved balanced accuracy within 5.1% of isolate-derived genomic AST. For poly-microbial infections, taxonomic sample complexity and relatedness of taxa in the sample is a key factor influencing metagenomic binning and downstream MG-AST accuracy. We show that the reassignment of putative plasmid contigs by their predicted host range and investigation of whole resistome capabilities improved MG-AST performance on poly-microbial samples. We further demonstrate that machine learning-based methods enable MG-AST with superior accuracy compared to rule-based approaches on simulated native patient samples.
Collapse
|
44
|
Martin S, Ayling M, Patrono L, Caccamo M, Murcia P, Leggett RM. Capturing variation in metagenomic assembly graphs with MetaCortex. Bioinformatics 2023; 39:6986127. [PMID: 36722204 PMCID: PMC9889960 DOI: 10.1093/bioinformatics/btad020] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2022] [Revised: 11/10/2022] [Accepted: 01/11/2023] [Indexed: 01/13/2023] Open
Abstract
MOTIVATION The assembly of contiguous sequence from metagenomic samples presents a particular challenge, due to the presence of multiple species, often closely related, at varying levels of abundance. Capturing diversity within species, for example, viral haplotypes, or bacterial strain-level diversity, is even more challenging. RESULTS We present MetaCortex, a metagenome assembler that captures intra-species diversity by searching for signatures of local variation along assembled sequences in the underlying assembly graph and outputting these sequences in sequence graph format. We show that MetaCortex produces accurate assemblies with higher genome coverage and contiguity than other popular metagenomic assemblers on mock viral communities with high levels of strain-level diversity and on simulated communities containing simulated strains. AVAILABILITY AND IMPLEMENTATION Source code is freely available to download from https://github.com/SR-Martin/metacortex, is implemented in C and supported on MacOS and Linux. The version used for the results presented in this article is available at doi.org/10.5281/zenodo.7273627. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | | | | | - Pablo Murcia
- MRC-University of Glasgow Centre for Virus Research, Glasgow G61 1QH, UK
| | | |
Collapse
|
45
|
Salazar VW, Shaban B, Quiroga MDM, Turnbull R, Tescari E, Rossetto Marcelino V, Verbruggen H, Lê Cao KA. Metaphor-A workflow for streamlined assembly and binning of metagenomes. Gigascience 2022; 12:giad055. [PMID: 37522759 PMCID: PMC10388702 DOI: 10.1093/gigascience/giad055] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/20/2023] [Revised: 06/05/2023] [Accepted: 07/04/2023] [Indexed: 08/01/2023] Open
Abstract
Recent advances in bioinformatics and high-throughput sequencing have enabled the large-scale recovery of genomes from metagenomes. This has the potential to bring important insights as researchers can bypass cultivation and analyze genomes sourced directly from environmental samples. There are, however, technical challenges associated with this process, most notably the complexity of computational workflows required to process metagenomic data, which include dozens of bioinformatics software tools, each with their own set of customizable parameters that affect the final output of the workflow. At the core of these workflows are the processes of assembly-combining the short-input reads into longer, contiguous fragments (contigs)-and binning, clustering these contigs into individual genome bins. The limitations of assembly and binning algorithms also pose different challenges depending on the selected strategy to execute them. Both of these processes can be done for each sample separately or by pooling together multiple samples to leverage information from a combination of samples. Here we present Metaphor, a fully automated workflow for genome-resolved metagenomics (GRM). Metaphor differs from existing GRM workflows by offering flexible approaches for the assembly and binning of the input data and by combining multiple binning algorithms with a bin refinement step to achieve high-quality genome bins. Moreover, Metaphor generates reports to evaluate the performance of the workflow. We showcase the functionality of Metaphor on different synthetic datasets and the impact of available assembly and binning strategies on the final results.
Collapse
Affiliation(s)
- Vinícius W Salazar
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
| | - Babak Shaban
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Maria del Mar Quiroga
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Robert Turnbull
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Edoardo Tescari
- Melbourne Data Analytics Platform (MDAP), University of Melbourne, Carlton, VIC 3053, Victoria, Australia
| | - Vanessa Rossetto Marcelino
- Department of Molecular and Translational Sciences, Monash University, Clayton, VIC 3168, Victoria, Australia
- Centre for Innate Immunity and Infectious Diseases, Hudson Institute of Medical Research, Clayton, VIC 3168, Victoria, Australia
- School of BioSciences, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
- Department of Microbiology and Immunology, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Parkville, VIC 3052, Victoria, Australia
| | - Heroen Verbruggen
- School of BioSciences, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
| | - Kim-Anh Lê Cao
- Melbourne Integrative Genomics, School of Mathematics & Statistics, University of Melbourne, Parkville, VIC 3052, Victoria, Australia
| |
Collapse
|
46
|
Mendes CI, Vila-Cerqueira P, Motro Y, Moran-Gilad J, Carriço JA, Ramirez M. LMAS: evaluating metagenomic short de novo assembly methods through defined communities. Gigascience 2022; 12:giac122. [PMID: 36576131 PMCID: PMC9795473 DOI: 10.1093/gigascience/giac122] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 09/26/2022] [Accepted: 11/16/2022] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND The de novo assembly of raw sequence data is key in metagenomic analysis. It allows recovering draft genomes from a pool of mixed raw reads, yielding longer sequences that offer contextual information and provide a more complete picture of the microbial community. FINDINGS To better compare de novo assemblers for metagenomic analysis, LMAS (Last Metagenomic Assembler Standing) was developed as a flexible platform allowing users to evaluate assembler performance given known standard communities. Overall, in our test datasets, k-mer De Bruijn graph assemblers outperformed the alternative approaches but came with a greater computational cost. Furthermore, assemblers branded as metagenomic specific did not consistently outperform other genomic assemblers in metagenomic samples. Some assemblers still in use, such as ABySS, MetaHipmer2, minia, and VelvetOptimiser, perform relatively poorly and should be used with caution when assembling complex samples. Meaningful strain resolution at the single-nucleotide polymorphism level was not achieved, even by the best assemblers tested. CONCLUSIONS The choice of a de novo assembler depends on the computational resources available, the replicon of interest, and the major goals of the analysis. No single assembler appeared an ideal choice for short-read metagenomic prokaryote replicon assembly, each showing specific strengths. The choice of metagenomic assembler should be guided by user requirements and characteristics of the sample of interest, and LMAS provides an interactive evaluation platform for this purpose. LMAS is open source, and the workflow and its documentation are available at https://github.com/B-UMMI/LMAS and https://lmas.readthedocs.io/, respectively.
Collapse
Affiliation(s)
- Catarina Inês Mendes
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, 1649-028 Lisboa, Portugal
| | - Pedro Vila-Cerqueira
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, 1649-028 Lisboa, Portugal
| | - Yair Motro
- Faculty of Health Sciences, Ben-Gurion University of the Negev, 8410501 Beer-Sheva, Israel
| | - Jacob Moran-Gilad
- Faculty of Health Sciences, Ben-Gurion University of the Negev, 8410501 Beer-Sheva, Israel
| | - João André Carriço
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, 1649-028 Lisboa, Portugal
| | - Mário Ramirez
- Instituto de Microbiologia, Instituto de Medicina Molecular, Faculdade de Medicina, Universidade de Lisboa, 1649-028 Lisboa, Portugal
| |
Collapse
|
47
|
Zhu Y, Shang J, Peng C, Sun Y. Phage family classification under Caudoviricetes: A review of current tools using the latest ICTV classification framework. Front Microbiol 2022; 13:1032186. [PMID: 36590402 PMCID: PMC9800612 DOI: 10.3389/fmicb.2022.1032186] [Citation(s) in RCA: 47] [Impact Index Per Article: 15.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 11/29/2022] [Indexed: 12/23/2022] Open
Abstract
Bacteriophages, which are viruses infecting bacteria, are the most ubiquitous and diverse entities in the biosphere. There is accumulating evidence revealing their important roles in shaping the structure of various microbiomes. Thanks to (viral) metagenomic sequencing, a large number of new bacteriophages have been discovered. However, lacking a standard and automatic virus classification pipeline, the taxonomic characterization of new viruses seriously lag behind the sequencing efforts. In particular, according to the latest version of ICTV, several large phage families in the previous classification system are removed. Therefore, a comprehensive review and comparison of taxonomic classification tools under the new standard are needed to establish the state-of-the-art. In this work, we retrained and tested four recently published tools on newly labeled databases. We demonstrated their utilities and tested them on multiple datasets, including the RefSeq, short contigs, simulated metagenomic datasets, and low-similarity datasets. This study provides a comprehensive review of phage family classification in different scenarios and a practical guidance for choosing appropriate taxonomic classification pipelines. To our best knowledge, this is the first review conducted under the new ICTV classification framework. The results show that the new family classification framework overall leads to better conserved groups and thus makes family-level classification more feasible.
Collapse
|
48
|
Mining of novel secondary metabolite biosynthetic gene clusters from acid mine drainage. Sci Data 2022; 9:760. [PMID: 36494363 PMCID: PMC9734747 DOI: 10.1038/s41597-022-01866-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2021] [Accepted: 11/23/2022] [Indexed: 12/13/2022] Open
Abstract
Acid mine drainage (AMD) is usually acidic (pH < 4) and contains high concentrations of dissolved metals and metalloids, making AMD a typical representative of extreme environments. Recent studies have shown that microbes play a key role in AMD bioremediation, and secondary metabolite biosynthetic gene clusters (smBGCs) from AMD microbes are important resources for the synthesis of antibacterial and anticancer drugs. Here, 179 samples from 13 mineral types were used to analyze the putative novel microorganisms and secondary metabolites in AMD environments. Among 7,007 qualified metagenome-assembled genomes (MAGs) mined from these datasets, 6,340 MAGs could not be assigned to any GTDB species representative. Overall, 11,856 smBGCs in eight categories were obtained from 7,007 qualified MAGs, and 10,899 smBGCs were identified as putative novel smBGCs. We anticipate that these datasets will accelerate research in the field of AMD bioremediation, aid in the discovery of novel secondary metabolites, and facilitate investigation into gene functions, metabolic pathways, and CNPS cycles in AMD.
Collapse
|
49
|
Liu Y, Elworth RAL, Jochum MD, Aagaard KM, Treangen TJ. De novo identification of microbial contaminants in low microbial biomass microbiomes with Squeegee. Nat Commun 2022; 13:6799. [PMID: 36357382 PMCID: PMC9649624 DOI: 10.1038/s41467-022-34409-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2021] [Accepted: 10/25/2022] [Indexed: 11/12/2022] Open
Abstract
Computational analysis of host-associated microbiomes has opened the door to numerous discoveries relevant to human health and disease. However, contaminant sequences in metagenomic samples can potentially impact the interpretation of findings reported in microbiome studies, especially in low-biomass environments. Contamination from DNA extraction kits or sampling lab environments leaves taxonomic "bread crumbs" across multiple distinct sample types. Here we describe Squeegee, a de novo contamination detection tool that is based upon this principle, allowing the detection of microbial contaminants when negative controls are unavailable. On the low-biomass samples, we compare Squeegee predictions to experimental negative control data and show that Squeegee accurately recovers putative contaminants. We analyze samples of varying biomass from the Human Microbiome Project and identify likely, previously unreported kit contamination. Collectively, our results highlight that Squeegee can identify microbial contaminants with high precision and thus represents a computational approach for contaminant detection when negative controls are unavailable.
Collapse
Affiliation(s)
- Yunxi Liu
- Rice University, Department of Computer Science, Houston, TX, 77005, USA
| | - R A Leo Elworth
- Rice University, Department of Computer Science, Houston, TX, 77005, USA
| | - Michael D Jochum
- Department of Obstetrics and Gynecology, Division of Maternal-Fetal Medicine, Baylor College of Medicine and Texas Children's Hospital, Houston, TX, 77030, USA
| | - Kjersti M Aagaard
- Department of Obstetrics and Gynecology, Division of Maternal-Fetal Medicine, Baylor College of Medicine and Texas Children's Hospital, Houston, TX, 77030, USA
| | - Todd J Treangen
- Rice University, Department of Computer Science, Houston, TX, 77005, USA.
| |
Collapse
|
50
|
VeChat: correcting errors in long reads using variation graphs. Nat Commun 2022; 13:6657. [PMID: 36333324 PMCID: PMC9636371 DOI: 10.1038/s41467-022-34381-8] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Accepted: 10/24/2022] [Indexed: 11/06/2022] Open
Abstract
Error correction is the canonical first step in long-read sequencing data analysis. Current self-correction methods, however, are affected by consensus sequence induced biases that mask true variants in haplotypes of lower frequency showing in mixed samples. Unlike consensus sequence templates, graph-based reference systems are not affected by such biases, so do not mistakenly mask true variants as errors. We present VeChat, as an approach to implement this idea: VeChat is based on variation graphs, as a popular type of data structure for pangenome reference systems. Extensive benchmarking experiments demonstrate that long reads corrected by VeChat contain 4 to 15 (Pacific Biosciences) and 1 to 10 times (Oxford Nanopore Technologies) less errors than when being corrected by state of the art approaches. Further, using VeChat prior to long-read assembly significantly improves the haplotype awareness of the assemblies. VeChat is an easy-to-use open-source tool and publicly available at https://github.com/HaploKit/vechat .
Collapse
|