1
|
Rahman Hera M, Koslicki D. Estimating similarity and distance using FracMinHash. Algorithms Mol Biol 2025; 20:8. [PMID: 40375084 DOI: 10.1186/s13015-025-00276-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2024] [Accepted: 03/30/2025] [Indexed: 05/18/2025] Open
Abstract
MOTIVATION The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics are still lacking. THEORETICAL CONTRIBUTIONS In this paper, we present a theoretical framework for estimating similarity/distance metrics by using FracMinHash sketches, when the metric is expressible in a certain form. We establish conditions under which such an estimation is sound and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings. PRACTICAL CONTRIBUTIONS We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. frac-kmc is also the first parallel tool for this task, allowing for speeding up sketch generation using multiple CPU cores - an option lacking in existing serialized tools. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.
Collapse
Affiliation(s)
- Mahmudur Rahman Hera
- School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, USA.
| | - David Koslicki
- School of Electrical Engineering and Computer Science, Pennsylvania State University, University Park, USA.
- Huck Institutes of the Life Sciences, Pennsylvania State University, University Park, USA.
- Department of Biology, Pennsylvania State University, University Park , USA.
| |
Collapse
|
2
|
Zielezinski A, Gudyś A, Barylski J, Siminski K, Rozwalak P, Dutilh BE, Deorowicz S. Ultrafast and accurate sequence alignment and clustering of viral genomes. Nat Methods 2025:10.1038/s41592-025-02701-7. [PMID: 40374946 DOI: 10.1038/s41592-025-02701-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2024] [Accepted: 04/14/2025] [Indexed: 05/18/2025]
Abstract
Viromics produces millions of viral genomes and fragments annually, overwhelming traditional sequence comparison methods. Here we introduce Vclust, an approach that determines average nucleotide identity by Lempel-Ziv parsing and clusters viral genomes with thresholds endorsed by authoritative viral genomics and taxonomy consortia. Vclust demonstrates superior accuracy and efficiency compared to existing tools, clustering millions of genomes in a few hours on a mid-range workstation.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland
| | - Adam Gudyś
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Jakub Barylski
- Department of Molecular Virology, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland
| | - Krzysztof Siminski
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland
| | - Piotr Rozwalak
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University, Poznan, Poland
- Institute of Biodiversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, Jena, Germany
| | - Bas E Dutilh
- Institute of Biodiversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, Jena, Germany.
- Theoretical Biology and Bioinformatics, Science4Life, Utrecht University, Utrecht, the Netherlands.
| | - Sebastian Deorowicz
- Faculty of Automatic Control, Electronics and Computer Science, Silesian University of Technology, Gliwice, Poland.
| |
Collapse
|
3
|
Langwig MV, Koester F, Martin C, Zhou Z, Joye SB, Reysenbach AL, Anantharaman K. Endemism shapes viral ecology and evolution in globally distributed hydrothermal vent ecosystems. Nat Commun 2025; 16:4076. [PMID: 40307239 PMCID: PMC12043954 DOI: 10.1038/s41467-025-59154-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2024] [Accepted: 04/10/2025] [Indexed: 05/02/2025] Open
Abstract
Viruses are ubiquitous in deep-sea hydrothermal vents, where they influence microbial communities and biogeochemistry. Yet, viral ecology and evolution remain understudied in these environments. Here, we identify 49,962 viruses from 52 globally distributed hydrothermal vent samples (10 plume, 40 deposit, and 2 diffuse flow metagenomes), and reconstruct 5708 viral metagenome-assembled genomes, the majority of which were bacteriophages. Hydrothermal viruses were largely endemic, however, some viruses were shared between geographically separated vents, predominantly between the Lau Basin and Brothers Volcano in the Pacific Ocean. Geographically distant viruses shared proteins related to core functions such as structural proteins, and rarely, proteins of auxiliary functions involved in processes such as fermentation and cobalamin biosynthesis. Common microbial hosts of viruses included members of Campylobacterota, Alpha-, and Gammaproteobacteria in deposits, and Gammaproteobacteria in plumes. Campylobacterota- and Gammaproteobacteria-infecting viruses reflected variations in hydrothermal chemistry and functional redundancy in their predicted microbial hosts, suggesting that hydrothermal geology is a driver of viral ecology and coevolution of viruses and hosts. Our results indicate that viral ecology and evolution in globally distributed hydrothermal vents is shaped by endemism and thus may have increased susceptibility to the negative impacts of deep-sea mining and anthropogenic change in ocean ecosystems.
Collapse
Affiliation(s)
- Marguerite V Langwig
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
- Freshwater and Marine Sciences Program, University of Wisconsin-Madison, Madison, WI, USA
| | - Faith Koester
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
| | - Cody Martin
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
- Microbiology Doctoral Training Program, University of Wisconsin-Madison, Madison, WI, USA
| | - Zhichao Zhou
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA
| | - Samantha B Joye
- Department of Marine Sciences, University of Georgia, Athens, GA, USA
| | | | - Karthik Anantharaman
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Integrative Biology, University of Wisconsin-Madison, Madison, WI, USA.
- Department of Data Science and AI, Wadhwani School of Data Science and AI, Indian Institute of Technology Madras, Chennai, TN, India.
| |
Collapse
|
4
|
Gerhardt K, Ruiz-Perez C, Rodriguez-R L, Jain C, Tiedje J, Cole J, Konstantinidis K. FastAAI: efficient estimation of genome average amino acid identity and phylum-level relationships using tetramers of universal proteins. Nucleic Acids Res 2025; 53:gkaf348. [PMID: 40287826 PMCID: PMC12034039 DOI: 10.1093/nar/gkaf348] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Revised: 03/28/2025] [Accepted: 04/16/2025] [Indexed: 04/29/2025] Open
Abstract
Estimation of whole-genome relatedness and taxonomic identification are two important bioinformatics tasks in describing environmental or clinical microbiomes. The genome-aggregate Average Nucleotide Identity is routinely used to derive the relatedness of closely related (species level) microbial and viral genomes, but it is not appropriate for more divergent genomes. Average Amino-acid Identity (AAI) can be used in the latter cases, but no current AAI implementation can efficiently compare thousands of genomes. Here we present FastAAI, a tool that estimates whole-genome pairwise relatedness using shared tetramers of universal proteins in a matter of microseconds, providing a speedup of up to 5 orders of magnitude when compared with current methods for calculating AAI or alternative whole-genome metrics. Further, FastAAI resolves distantly related genomes related at the phylum level with comparable accuracy to the phylogeny of ribosomal RNA genes, substantially improving on a known limitation of current AAI implementations. Our analysis of the resulting AAI matrices also indicated that bacterial lineages predominantly evolve gradually, rather than showing bursts of diversification, and that AAI thresholds to define classes, orders, and families are generally elusive. Therefore, FastAAI uniquely expands the toolbox for microbiome analysis and allows it to scale to millions of genomes.
Collapse
Affiliation(s)
- Kenji Gerhardt
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, United States
| | - Carlos A Ruiz-Perez
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, United States
| | - Luis M Rodriguez-R
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA 30332, United States
- Department of Microbiology and Digital Science Center (DiSC), University of Innsbruck, Innsbruck 6020, Austria
| | - Chirag Jain
- Department of Computational and Data Sciences, Indian Institute of Science, Bengaluru, KA 560012, India
| | - James M Tiedje
- Center for Microbial Ecology, Michigan State University, East Lansing MI 48824, United States
| | - James R Cole
- Center for Microbial Ecology, Michigan State University, East Lansing MI 48824, United States
| | - Konstantinos T Konstantinidis
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA 30332, United States
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA 30332, United States
| |
Collapse
|
5
|
Aroney STN, Newell RJP, Nissen JN, Camargo AP, Tyson GW, Woodcroft BJ. CoverM: read alignment statistics for metagenomics. Bioinformatics 2025; 41:btaf147. [PMID: 40193404 PMCID: PMC11993303 DOI: 10.1093/bioinformatics/btaf147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2025] [Revised: 03/26/2025] [Accepted: 04/03/2025] [Indexed: 04/09/2025] Open
Abstract
SUMMARY Genome-centric analysis of metagenomic samples is a powerful method for understanding the function of microbial communities. Calculating read coverage is a central part of analysis, enabling differential coverage binning for recovery of genomes and estimation of microbial community composition. Coverage is determined by processing read alignments to reference sequences of either contigs or genomes. Per-reference coverage is typically calculated in an ad-hoc manner, with each software package providing its own implementation and specific definition of coverage. Here we present a unified software package CoverM which calculates several coverage statistics for contigs and genomes in an ergonomic and flexible manner. It uses "Mosdepth arrays" for computational efficiency and avoids unnecessary I/O overhead by calculating coverage statistics from streamed read alignment results. AVAILABILITY AND IMPLEMENTATION CoverM is free software available at https://github.com/wwood/coverm. CoverM is implemented in Rust, with Python (https://github.com/apcamargo/pycoverm) and Julia (https://github.com/JuliaBinaryWrappers/CoverM_jll.jl) interfaces.
Collapse
Affiliation(s)
- Samuel T N Aroney
- Centre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology (QUT), Translational Research Institute, Woolloongabba 4102, Australia
| | - Rhys J P Newell
- Centre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology (QUT), Translational Research Institute, Woolloongabba 4102, Australia
| | - Jakob N Nissen
- The Novo Nordisk Foundation Center for Basic Metabolic Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Antonio Pedro Camargo
- Departamento de Genética e Evolução, Instituto de Biologia, Universidade Estadual de Campinas, Campinas, São Paulo 13083-970, Brazil
- Lawrence Berkeley National Laboratory, DOE Joint Genome Institute, Berkeley, CA 94720, United States
| | - Gene W Tyson
- Centre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology (QUT), Translational Research Institute, Woolloongabba 4102, Australia
| | - Ben J Woodcroft
- Centre for Microbiome Research, School of Biomedical Sciences, Queensland University of Technology (QUT), Translational Research Institute, Woolloongabba 4102, Australia
| |
Collapse
|
6
|
He H, Yi K, Yang L, Jing Y, Kang L, Gao Z, Xiang D, Tan G, Wang Y, Liu Q, Xie L, Jiang S, Liu T, Chen W. Development of a lytic Ralstonia phage cocktail and evaluation of its control efficacy against tobacco bacterial wilt. FRONTIERS IN PLANT SCIENCE 2025; 16:1554992. [PMID: 40182540 PMCID: PMC11966396 DOI: 10.3389/fpls.2025.1554992] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/03/2025] [Accepted: 02/19/2025] [Indexed: 04/05/2025]
Abstract
Introduction Bacterial wilt (BW) caused by Ralstonia pseudosolanacearum is a devastating soil-borne disease. Bacteriophages are important biocontrol resources that rapidly and specifically lyse host bacteria, showing good application potential in agricultural production. Methods This study isolated nine phages (YL1-YL9) and, using host range and pot experiments, identified two broader host range phages (YL1 and YL4) and two higher control efficacy phages (YL2 and YL3), which were combined to obtain five cocktails (BPC-1-BPC-5). Results Pot experiments showed that BPC-1 (YL3 and YL4) had the highest control efficacy (99.25%). Biological characterization revealed that these four phages had substantial thermal stability and pH tolerance. Whole genome sequencing and analysis showed that YL1, YL2, YL3, and YL4 belonged to the genus Gervaisevirus. AlphaFold 3 predictions of tail fiber protein II structures showed that YL1 differed significantly from the other phages. Amino acid sequence alignment revealed that the ORF66 (YL1) "tip domain" of contained a higher proportion of aromatic and positively charged amino acids. However, the surface of the ORF69 (YL4) "tip domain" exhibited more positively charged residues than ORF66 (YL2) and ORF70 (YL3). These characteristics are hypothesized to confer a broader host range to YL1 and YL4. Discussion This study demonstrates that phages assembling a broad host range and high control efficacy have better biocontrol potential, providing high-quality resources for the biological control of BW.
Collapse
Affiliation(s)
- Haoxin He
- College of Plant Protection, Hunan Agricultural University, Changsha, China
| | - Ke Yi
- Tobacco Leaf Raw Material Procurement Center, China Tobacco Hunan Industrial Co., Ltd, Changsha, China
| | - Lei Yang
- Tobacco Leaf Raw Material Procurement Center, China Tobacco Hunan Industrial Co., Ltd, Changsha, China
| | - Yongfeng Jing
- Tobacco Leaf Raw Material Procurement Center, China Tobacco Hunan Industrial Co., Ltd, Changsha, China
| | - Lifu Kang
- Tobacco Leaf Raw Material Procurement Center, China Tobacco Hunan Industrial Co., Ltd, Changsha, China
| | - Zhihao Gao
- Tobacco Leaf Raw Material Procurement Center, China Tobacco Hunan Industrial Co., Ltd, Changsha, China
| | - Dong Xiang
- Tobacco Leaf Raw Material Procurement Center, China Tobacco Hunan Industrial Co., Ltd, Changsha, China
| | - Ge Tan
- Tobacco Leaf Raw Material Procurement Center, China Tobacco Hunan Industrial Co., Ltd, Changsha, China
| | - Yunsheng Wang
- College of Plant Protection, Hunan Agricultural University, Changsha, China
| | - Qian Liu
- College of Plant Protection, Hunan Agricultural University, Changsha, China
| | - Lin Xie
- College of Plant Protection, Hunan Agricultural University, Changsha, China
| | - Shiya Jiang
- College of Plant Protection, Hunan Agricultural University, Changsha, China
| | - Tianbo Liu
- Plant Protection Research Center, Hunan Tobacco Science Research Institute, Changsha, China
| | - Wu Chen
- College of Plant Protection, Hunan Agricultural University, Changsha, China
| |
Collapse
|
7
|
Lalanne C, Silar P. FungANI, a BLAST-based program for analyzing average nucleotide identity (ANI) between two fungal genomes, enables easy fungal species delimitation. Fungal Genet Biol 2025; 177:103969. [PMID: 39894199 DOI: 10.1016/j.fgb.2025.103969] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2024] [Revised: 12/18/2024] [Accepted: 01/25/2025] [Indexed: 02/04/2025]
Abstract
Fungal species delimitation and phylogeny will likely rely in the future upon whole genome sequence comparison, as the costs of such sequences are rapidly decreasing. Average Nucleotide Identity (ANI) between genomes is a convenient metric that can be rapidly calculated for species delimitation. However, there is presently no easy-to-use program calculating the ANI between two fungal genomes and providing easy-to interpret results that can be help mycologists having limited access to bioinformatic facilities. Here, we present FungANI, a customizable BLAST-based program that calculate ANI between genomes. The program primarily targets Linux workstations or servers but it can be run on the latest Windows, macOS and Linux 64-Bit operating systems as a standalone desktop application. It was tested with various publicly-available genomes from species belonging to the Sordariales order. It proved efficient to differentiate closely related species and retrace their possible phylogenetic relationships. However, FungANI did not perform well for phylogenetic reconstruction on a broader evolutionary scale such as inferring relationships between distant genera. The program is freely available at https://github.com/podo-gec/fungani.
Collapse
Affiliation(s)
- Christophe Lalanne
- Univ Paris Paris Cité, Laboratoire Interdisciplinaire des Energies de Demain, 75205 Paris Cité CEDEX 13, France
| | - Philippe Silar
- Univ Paris Paris Cité, Laboratoire Interdisciplinaire des Energies de Demain, 75205 Paris Cité CEDEX 13, France.
| |
Collapse
|
8
|
Majidian S, Hwang S, Zakeri M, Langmead B. EvANI benchmarking workflow for evolutionary distance estimation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.23.639716. [PMID: 40027788 PMCID: PMC11870633 DOI: 10.1101/2025.02.23.639716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Advances in long-read sequencing technology has led to a rapid increase in high-quality genome assemblies. These make it possible to compare genome sequences across the Tree of Life, deepening our understanding of evolutionary relationships. Average nucleotide identity (ANI) is a distance measure that has been applied to species delineation, building of guide trees, and searching large sequence databases. Since computing ANI is computationally expensive, the field has increasingly turned to sketch-based approaches that use assumptions and heuristics to speed this up. We propose a suite of simulated and real benchmark datasets, together with a rank-correlation-based metric, to study how these assumptions and heuristics impact distance estimates. We call this evaluation framework EvANI. With EvANI, we show that ANIb is the ANI estimation algorithm that best captures tree distance, though it is also the least efficient. We show that k-mer based approaches are extremely efficient and have consistently strong accuracy. We also show that some clades have inter-sequence distances that are best computed using multiple values of k, e.g. k = 10 and k = 19 for Chlamydiales. Finally, we highlight that approaches based on maximal exact matches may represent an advantageous compromise, achieving an intermediate level of computational efficiency while avoiding over-reliance on a single fixed k-mer length.
Collapse
Affiliation(s)
- Sina Majidian
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore, USA
| | - Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| |
Collapse
|
9
|
Ndovie W, Havránek J, Leconte J, Koszucki J, Chindelevitch L, Adriaenssens EM, Mostowy RJ. Exploration of the genetic landscape of bacterial dsDNA viruses reveals an ANI gap amid extensive mosaicism. mSystems 2025; 10:e0166124. [PMID: 39878503 PMCID: PMC11834439 DOI: 10.1128/msystems.01661-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2024] [Accepted: 01/06/2025] [Indexed: 01/31/2025] Open
Abstract
Average nucleotide identity (ANI) is a widely used metric to estimate genetic relatedness, especially in microbial species delineation. While ANI calculation has been well optimized for bacteria and closely related viral genomes, accurate estimation of ANI below 80%, particularly in large reference data sets, has been challenging due to a lack of accurate and scalable methods. To bridge this gap, we introduce MANIAC, an efficient computational pipeline optimized for estimating ANI and alignment fraction (AF) in viral genomes with divergence around ANI of 70%. Using a rigorous simulation framework, we demonstrate MANIAC's accuracy and scalability compared to existing approaches, even to data sets of hundreds of thousands of viral genomes. Applying MANIAC to a curated data set of complete bacterial dsDNA viruses revealed a multimodal ANI distribution, with a distinct gap around 80%, akin to the bacterial ANI gap (~90%) but shifted, likely due to viral-specific evolutionary processes such as recombination dynamics and mosaicism. We then evaluated ANI and AF as predictors of genus-level taxonomy using a logistic regression model. We found that this model has strong predictive power (PR-AUC = 0.981), but that it works much better for virulent (PR-AUC = 0.997) than temperate (PR-AUC = 0.847) bacterial viruses. This highlights the complexity of taxonomic classification in temperate phages, known for their extensive mosaicism, and cautions against over-reliance on ANI in such cases. MANIAC can be accessed at https://github.com/bioinf-mcb/MANIAC.IMPORTANCEWe introduce a novel computational pipeline called MANIAC, designed to accurately assess average nucleotide identity (ANI) and alignment fraction (AF) between diverse viral genomes, scalable to data sets of over 100k genomes. Using computer simulations and real data analyses, we show that MANIAC could accurately estimate genetic relatedness between pairs of viral genomes of around 60%-70% ANI. We applied MANIAC to investigate the question of ANI discontinuity in bacterial dsDNA viruses, finding evidence for an ANI gap, akin to the one seen in bacteria but around ANI of 80%. We then assessed the ability of ANI and AF to predict taxonomic genus boundaries, finding its strong predictive power in virulent, but not in temperate phages. Our results suggest that bacterial dsDNA viruses may exhibit an ANI threshold (on average around 80%) above which recombination helps maintain population cohesiveness, as previously argued in bacteria.
Collapse
Affiliation(s)
- Wanangwa Ndovie
- Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
- Doctoral School of Exact and Natural Sciences, Jagiellonian University, Kraków, Poland
| | - Jan Havránek
- Faculty of Biochemistry, Biophysics and Biotechnology, Jagiellonian University, Kraków, Poland
| | - Jade Leconte
- Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
| | - Janusz Koszucki
- Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
- Doctoral School of Exact and Natural Sciences, Jagiellonian University, Kraków, Poland
| | - Leonid Chindelevitch
- Department of Infectious Disease Epidemiology, School of Public Health, Imperial College London, London, United Kingdom
| | | | - Rafal J. Mostowy
- Malopolska Centre of Biotechnology, Jagiellonian University, Kraków, Poland
| |
Collapse
|
10
|
Salamzade R, Tran P, Martin C, Manson A, Gilmore M, Earl A, Anantharaman K, Kalan L. zol and fai: large-scale targeted detection and evolutionary investigation of gene clusters. Nucleic Acids Res 2025; 53:gkaf045. [PMID: 39907107 PMCID: PMC11795205 DOI: 10.1093/nar/gkaf045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2024] [Revised: 12/06/2024] [Accepted: 01/24/2025] [Indexed: 02/06/2025] Open
Abstract
Many universally and conditionally important genes are genomically aggregated within clusters. Here, we introduce fai and zol, which together enable large-scale comparative analysis of different types of gene clusters and mobile-genetic elements, such as biosynthetic gene clusters (BGCs) or viruses. Fundamentally, they overcome a current bottleneck to reliably perform comprehensive orthology inference at large scale across broad taxonomic contexts and thousands of genomes. First, fai allows the identification of orthologous instances of a query gene cluster of interest amongst a database of target genomes. Subsequently, zol enables reliable, context-specific inference of ortholog groups for individual protein-encoding genes across gene cluster instances. In addition, zol performs functional annotation and computes a variety of evolutionary statistics for each inferred ortholog group. Importantly, in comparison to tools for visual exploration of homologous relationships between gene clusters, zol can scale to handle thousands of gene cluster instances and produce detailed reports that are easy to digest. To showcase fai and zol, we apply them for: (i) longitudinal tracking of a virus in metagenomes, (ii) performing population genetic investigations of BGCs for a fungal species, and (iii) uncovering evolutionary trends for a virulence-associated gene cluster across thousands of genomes from a diverse bacterial genus.
Collapse
Affiliation(s)
- Rauf Salamzade
- Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, 53706, United States
- Microbiology Doctoral Training Program, University of Wisconsin-Madison, Madison, WI, 53706, United States
| | - Patricia Q Tran
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, United States
- Freshwater and Marine Science Doctoral Program, University of Wisconsin-Madison, Madison, WI, 53706, United States
| | - Cody Martin
- Microbiology Doctoral Training Program, University of Wisconsin-Madison, Madison, WI, 53706, United States
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, United States
| | - Abigail L Manson
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, United States
| | - Michael S Gilmore
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, United States
- Department of Ophthalmology, Harvard Medical School and Massachusetts Eye and Ear, Boston, MA, 02114, United States
- Department of Microbiology, Harvard Medical School and Massachusetts Eye and Ear, Boston, MA, 02115, United States
| | - Ashlee M Earl
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, United States
| | - Karthik Anantharaman
- Department of Bacteriology, University of Wisconsin-Madison, Madison, WI, 53706, United States
| | - Lindsay R Kalan
- Department of Medical Microbiology and Immunology, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, 53706, United States
- Department of Medicine, Division of Infectious Disease, School of Medicine and Public Health, University of Wisconsin-Madison, Madison, WI, 53705, United States
- M.G. DeGroote Institute for Infectious Disease Research, David Braley Centre for Antibiotic Discovery, McMaster University, Hamilton, Ontario, L8S 4L8, Canada
- Department of Biochemistry and Biomedical Sciences, McMaster University, Hamilton, Ontario, L8S 4K1, Canada
| |
Collapse
|
11
|
Elmanzalawi M, Fujisawa T, Mori H, Nakamura Y, Tanizawa Y. DFAST_QC: quality assessment and taxonomic identification tool for prokaryotic Genomes. BMC Bioinformatics 2025; 26:3. [PMID: 39773409 PMCID: PMC11705978 DOI: 10.1186/s12859-024-06030-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/23/2024] [Accepted: 12/27/2024] [Indexed: 01/11/2025] Open
Abstract
BACKGROUND Accurate taxonomic classification in genome databases is essential for reliable biological research and effective data sharing. Mislabeling or inaccuracies in genome annotations can lead to incorrect scientific conclusions and hinder the reproducibility of research findings. Despite advances in genome analysis techniques, challenges persist in ensuring precise and reliable taxonomic assignments. Existing tools for genome verification often involve extensive computational resources or lengthy processing times, which can limit their accessibility and scalability for large-scale projects. There is a need for more efficient, user-friendly solutions that can handle diverse datasets and provide accurate results with minimal computational demands. This work aimed to address these challenges by introducing a novel tool that enhances taxonomic accuracy, offers a user-friendly interface, and supports large-scale analyses. RESULTS We introduce a novel tool for the quality control and taxonomic classification tool of prokaryotic genomes, called DFAST_QC, which is available as both a command-line tool and a web service. DFAST_QC can quickly identify species based on NCBI and GTDB taxonomies by combining genome-distance calculations using MASH with ANI calculations using Skani. We evaluated DFAST_QC's performance in species identification and found it to be highly consistent with existing taxonomic standards, successfully identifying species across diverse datasets. In several cases, DFAST_QC identified potential mislabeling of species names in public databases and highlighted discrepancies in current classifications, demonstrating its capability to uncover errors and enhance taxonomic accuracy. Additionally, the tool's efficient design allows it to operate smoothly on local machines with minimal computational requirements, making it a practical choice for large-scale genome projects. CONCLUSIONS DFAST_QC is a reliable and efficient tool for accurate taxonomic identification and genome quality control, well-suited for large-scale genomic studies. Its compatibility with limited-resource environments, combined with its user-friendly design, ensures seamless integration into existing workflows. DFAST_QC's ability to refine species assignments in public databases highlights its value as a complementary tool for maintaining and enhancing the accuracy of taxonomic data in genomic research. The web version is available at https://dfast.ddbj.nig.ac.jp/dqc/submit/ , and the source code for local use can be found at https://github.com/nigyta/dfast_qc .
Collapse
Affiliation(s)
- Mohamed Elmanzalawi
- Department of Genetics, School of Life Science, The Graduate University for Advanced Studies (SOKENDAI), Mishima, 411-8540, Japan
| | - Takatomo Fujisawa
- Department of Informatics, National Institute of Genetics, Mishima, 411-8540, Japan
| | - Hiroshi Mori
- Department of Genetics, School of Life Science, The Graduate University for Advanced Studies (SOKENDAI), Mishima, 411-8540, Japan
- Department of Informatics, National Institute of Genetics, Mishima, 411-8540, Japan
| | - Yasukazu Nakamura
- Department of Genetics, School of Life Science, The Graduate University for Advanced Studies (SOKENDAI), Mishima, 411-8540, Japan
- Department of Informatics, National Institute of Genetics, Mishima, 411-8540, Japan
| | - Yasuhiro Tanizawa
- Department of Genetics, School of Life Science, The Graduate University for Advanced Studies (SOKENDAI), Mishima, 411-8540, Japan.
- Department of Informatics, National Institute of Genetics, Mishima, 411-8540, Japan.
| |
Collapse
|
12
|
Dmitrijeva M, Ruscheweyh HJ, Feer L, Li K, Miravet-Verde S, Sintsova A, Mende DR, Zeller G, Sunagawa S. The mOTUs online database provides web-accessible genomic context to taxonomic profiling of microbial communities. Nucleic Acids Res 2025; 53:D797-D805. [PMID: 39526369 PMCID: PMC11701688 DOI: 10.1093/nar/gkae1004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2024] [Revised: 10/03/2024] [Accepted: 10/22/2024] [Indexed: 11/16/2024] Open
Abstract
Determining the taxonomic composition (taxonomic profiling) is a fundamental task in studying environmental and host-associated microbial communities. However, genome-resolved microbial diversity on Earth remains undersampled, and accessing the genomic context of taxa detected during taxonomic profiling remains a challenging task. Here, we present the mOTUs online database (mOTUs-db), which is consistent with and interfaces with the mOTUs taxonomic profiling tool. It comprises 2.83 million metagenome-assembled genomes (MAGs) and 919 090 single-cell and isolate genomes from 124 295 species-level taxonomic units. In addition to being one of the largest prokaryotic genome resources to date, all MAGs in the mOTUs-db were reconstructed de novo in 117 902 individual samples by abundance correlation of scaffolds across multiple samples for improved quality metrics. The database complements the Genome Taxonomy Database, with over 50% of its species-level taxonomic groups being unique. It also offers interactive querying, enabling users to explore and download genomes at various taxonomic levels. The mOTUs-db is accessible at https://motus-db.org.
Collapse
Affiliation(s)
- Marija Dmitrijeva
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
| | - Hans-Joachim Ruscheweyh
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
| | - Lilith Feer
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
| | - Kang Li
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
| | - Samuel Miravet-Verde
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
| | - Anna Sintsova
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
| | - Daniel R Mende
- Medical Microbiology and Infection Prevention (MMI), Amsterdam University Medical Center, 1105AZ Amsterdam, The Netherlands
| | - Georg Zeller
- Structural and Computational Biology Unit, European Molecular Biology Laboratory, 69117 Heidelberg, Germany
- Leiden University Center for Infectious Diseases (LUCID), Leiden University Medical Center, 2333ZA Leiden, The Netherlands
- Center for Microbiome Analyses and Therapeutics, Leiden University Medical Center, 2333ZA Leiden, Netherlands
| | - Shinichi Sunagawa
- Department of Biology, Institute of Microbiology and Swiss Institute of Bioinformatics, ETH Zürich, 8093 Zürich, Switzerland
| |
Collapse
|
13
|
Michoud G, Peter H, Busi SB, Bourquin M, Kohler TJ, Geers A, Ezzat L, Battin TJ. Mapping the metagenomic diversity of the multi-kingdom glacier-fed stream microbiome. Nat Microbiol 2025; 10:217-230. [PMID: 39747693 DOI: 10.1038/s41564-024-01874-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2024] [Accepted: 10/29/2024] [Indexed: 01/04/2025]
Abstract
Glacier-fed streams (GFS) feature among Earth's most extreme aquatic ecosystems marked by pronounced oligotrophy and environmental fluctuations. Microorganisms mainly organize in biofilms within them, but how they cope with such conditions is unknown. Here, leveraging 156 metagenomes from the Vanishing Glaciers project obtained from sediment samples in GFS from 9 mountains ranges, we report thousands of metagenome-assembled genomes (MAGs) encompassing prokaryotes, algae, fungi and viruses, that shed light on biotic interactions within glacier-fed stream biofilms. A total of 2,855 bacterial MAGs were characterized by diverse strategies to exploit inorganic and organic energy sources, in part via functional redundancy and mixotrophy. We show that biofilms probably become more complex and switch from chemoautotrophy to heterotrophy as algal biomass increases in GFS owing to glacier shrinkage. Our MAG compendium sheds light on the success of microbial life in GFS and provides a resource for future research on a microbiome potentially impacted by climate change.
Collapse
Affiliation(s)
- Grégoire Michoud
- River Ecosystems Laboratory, Alpine and Polar Environmental Research Center, ENAC, Ecole Polytechnique Fédérale de Lausanne, Sion, Switzerland.
| | - Hannes Peter
- River Ecosystems Laboratory, Alpine and Polar Environmental Research Center, ENAC, Ecole Polytechnique Fédérale de Lausanne, Sion, Switzerland
| | | | - Massimo Bourquin
- River Ecosystems Laboratory, Alpine and Polar Environmental Research Center, ENAC, Ecole Polytechnique Fédérale de Lausanne, Sion, Switzerland
| | - Tyler J Kohler
- Department of Ecology, Faculty of Science, Charles University, Prague, Czechia
| | - Aileen Geers
- River Ecosystems Laboratory, Alpine and Polar Environmental Research Center, ENAC, Ecole Polytechnique Fédérale de Lausanne, Sion, Switzerland
| | - Leila Ezzat
- MARBEC, University of Montpellier, CNRS, Ifremer, IRD, Montpellier, France
| | - Tom J Battin
- River Ecosystems Laboratory, Alpine and Polar Environmental Research Center, ENAC, Ecole Polytechnique Fédérale de Lausanne, Sion, Switzerland.
| |
Collapse
|
14
|
Yu ZL, Wang RB. Revised taxonomic classification of the Stenotrophomonas genomes, providing new insights into the genus Stenotrophomonas. Front Microbiol 2024; 15:1488674. [PMID: 39726962 PMCID: PMC11669713 DOI: 10.3389/fmicb.2024.1488674] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2024] [Accepted: 11/18/2024] [Indexed: 12/28/2024] Open
Abstract
Background Stenotrophomonas strains are important opportunistic pathogens with great potential applications in industry and agriculture. Their significant genetic and phenotypic diversity has led to several changes in their taxonomic localization and was prone to inaccurate species classification based on traditional identification methods. Methods All 2,615 genomes of the genus Stenotrophomonas were obtained from the NCBI genome database. Genomic methods, including average nucleotide identity (ANI), were used to evaluate the 31 defined species. After evaluating the ANI thresholds applicable to Stenotrophomonas, the species classification of all submitted genomes was revised. Results Compared to the reference genomes of each species, 41.17% of the submitted Stenotrophomonas genomes had ANI values below 95, and 8.58% of the genomes were even below 90%. Moreover, 45.3% (705/1555) of the S. maltophilia strains actually belonged to other species within the S. maltophilia complex (Smc), or even to distant relatives outside the Smc. Based on the ANI threshold values of 95 and 90% for species and complexes confirmed to be applicable to Stenotrophomonas, 2,213 submitted Stenotrophomonas genomes were re-divided into 116 ANI genome species. Conclusion The results confirmed that 16S rRNA gene sequencing has low discriminability for the closely related Stenotrophomonas species. The annotated species of a considerable strain were indeed incorrect, especially since many S. maltophilia strains did not belong to this representative pathogenic species of Stenotrophomonas. This makes it necessary to reconsider the evolutionary relationship, pathogenicity, and clinical significance of Stenotrophomonas.
Collapse
Affiliation(s)
| | - Rui-Bai Wang
- National Key Laboratory of Intelligent Tracking and Forecasting for Infectious Disease, National Institute for Communicable Disease Control and Prevention, Chinese Center for Disease Control and Prevention, Beijing, China
| |
Collapse
|
15
|
Marçais G, Elder CS, Kingsford C. k-nonical space: sketching with reverse complements. Bioinformatics 2024; 40:btae629. [PMID: 39432565 PMCID: PMC11549021 DOI: 10.1093/bioinformatics/btae629] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2024] [Revised: 10/01/2024] [Accepted: 10/17/2024] [Indexed: 10/23/2024] Open
Abstract
MOTIVATION Sequences equivalent to their reverse complements (i.e. double-stranded DNA) have no analogue in text analysis and non-biological string algorithms. Despite this striking difference, algorithms designed for computational biology (e.g. sketching algorithms) are designed and tested in the same way as classical string algorithms. Then, as a post-processing step, these algorithms are adapted to work with genomic sequences by folding a k-mer and its reverse complement into a single sequence: The canonical representation (k-nonical space). RESULTS The effect of using the canonical representation with sketching methods is understudied and not understood. As a first step, we use context-free sketching methods to illustrate the potentially detrimental effects of using canonical k-mers with string algorithms not designed to accommodate for them. In particular, we show that large stretches of the genome ("sketching deserts") are undersampled or entirely skipped by context-free sketching methods, effectively making these genomic regions invisible to subsequent algorithms using these sketches. We provide empirical data showing these effects and develop a theoretical framework explaining the appearance of sketching deserts. Finally, we propose two schemes to accommodate for these effects: (i) a new procedure that adapts existing sketching methods to k-nonical space and (ii) an optimization procedure to directly design new sketching methods for k-nonical space. AVAILABILITY AND IMPLEMENTATION The code used in this analysis is available under a permissive license at https://github.com/Kingsford-Group/mdsscope.
Collapse
Affiliation(s)
- Guillaume Marçais
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - C S Elder
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Carl Kingsford
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| |
Collapse
|
16
|
Kazantseva E, Donmez A, Frolova M, Pop M, Kolmogorov M. Strainy: phasing and assembly of strain haplotypes from long-read metagenome sequencing. Nat Methods 2024; 21:2034-2043. [PMID: 39327484 DOI: 10.1038/s41592-024-02424-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2023] [Accepted: 08/22/2024] [Indexed: 09/28/2024]
Abstract
Bacterial species in microbial communities are often represented by mixtures of strains, distinguished by small variations in their genomes. Short-read approaches can be used to detect small-scale variation between strains but fail to phase these variants into contiguous haplotypes. Long-read metagenome assemblers can generate contiguous bacterial chromosomes but often suppress strain-level variation in favor of species-level consensus. Here we present Strainy, an algorithm for strain-level metagenome assembly and phasing from Nanopore and PacBio reads. Strainy takes a de novo metagenomic assembly as input and identifies strain variants, which are then phased and assembled into contiguous haplotypes. Using simulated and mock Nanopore and PacBio metagenome data, we show that Strainy assembles accurate and complete strain haplotypes, outperforming current Nanopore-based methods and comparable with PacBio-based algorithms in completeness and accuracy. We then use Strainy to assemble strain haplotypes of a complex environmental metagenome, revealing distinct strain distribution and mutational patterns in bacterial species.
Collapse
Affiliation(s)
- Ekaterina Kazantseva
- Bioinformatics and Systems Biology Program, ITMO University, St. Petersburg, Russia
| | - Ataberk Donmez
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Maria Frolova
- Functional Genomics of Prokaryotes Laboratory, Institute of Cell Biophysics, RAS, Pushchino, Russia
| | - Mihai Pop
- Department of Computer Science, University of Maryland, College Park, MD, USA.
| | - Mikhail Kolmogorov
- Cancer Data Science Laboratory, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
17
|
Hall MB, Wick RR, Judd LM, Nguyen AN, Steinig EJ, Xie O, Davies M, Seemann T, Stinear TP, Coin L. Benchmarking reveals superiority of deep learning variant callers on bacterial nanopore sequence data. eLife 2024; 13:RP98300. [PMID: 39388235 PMCID: PMC11466455 DOI: 10.7554/elife.98300] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/12/2024] Open
Abstract
Variant calling is fundamental in bacterial genomics, underpinning the identification of disease transmission clusters, the construction of phylogenetic trees, and antimicrobial resistance detection. This study presents a comprehensive benchmarking of variant calling accuracy in bacterial genomes using Oxford Nanopore Technologies (ONT) sequencing data. We evaluated three ONT basecalling models and both simplex (single-strand) and duplex (dual-strand) read types across 14 diverse bacterial species. Our findings reveal that deep learning-based variant callers, particularly Clair3 and DeepVariant, significantly outperform traditional methods and even exceed the accuracy of Illumina sequencing, especially when applied to ONT's super-high accuracy model. ONT's superior performance is attributed to its ability to overcome Illumina's errors, which often arise from difficulties in aligning reads in repetitive and variant-dense genomic regions. Moreover, the use of high-performing variant callers with ONT's super-high accuracy data mitigates ONT's traditional errors in homopolymers. We also investigated the impact of read depth on variant calling, demonstrating that 10× depth of ONT super-accuracy data can achieve precision and recall comparable to, or better than, full-depth Illumina sequencing. These results underscore the potential of ONT sequencing, combined with advanced variant calling algorithms, to replace traditional short-read sequencing methods in bacterial genomics, particularly in resource-limited settings.
Collapse
Affiliation(s)
- Michael B Hall
- Department of Microbiology and Immunology, The University of Melbourne, at the Peter Doherty Institute for Infection and ImmunityMelbourneAustralia
| | - Ryan R Wick
- Department of Microbiology and Immunology, The University of Melbourne, at the Peter Doherty Institute for Infection and ImmunityMelbourneAustralia
- Centre for Pathogen Genomics, The University of MelbourneMelbourneAustralia
| | - Louise M Judd
- Department of Microbiology and Immunology, The University of Melbourne, at the Peter Doherty Institute for Infection and ImmunityMelbourneAustralia
- Centre for Pathogen Genomics, The University of MelbourneMelbourneAustralia
| | - An N Nguyen
- Department of Microbiology and Immunology, The University of Melbourne, at the Peter Doherty Institute for Infection and ImmunityMelbourneAustralia
| | - Eike J Steinig
- Department of Microbiology and Immunology, The University of Melbourne, at the Peter Doherty Institute for Infection and ImmunityMelbourneAustralia
| | - Ouli Xie
- Department of Infectious Diseases, The University of Melbourne, at the Peter Doherty Institute for Infection and ImmunityMelbourneAustralia
- Monash Infectious Diseases, Monash HealthMelbourneAustralia
| | - Mark Davies
- Department of Microbiology and Immunology, The University of Melbourne, at the Peter Doherty Institute for Infection and ImmunityMelbourneAustralia
| | - Torsten Seemann
- Department of Microbiology and Immunology, The University of Melbourne, at the Peter Doherty Institute for Infection and ImmunityMelbourneAustralia
- Centre for Pathogen Genomics, The University of MelbourneMelbourneAustralia
| | - Timothy P Stinear
- Department of Microbiology and Immunology, The University of Melbourne, at the Peter Doherty Institute for Infection and ImmunityMelbourneAustralia
- Centre for Pathogen Genomics, The University of MelbourneMelbourneAustralia
| | - Lachlan Coin
- Department of Microbiology and Immunology, The University of Melbourne, at the Peter Doherty Institute for Infection and ImmunityMelbourneAustralia
| |
Collapse
|
18
|
Shaw J, Yu YW. Rapid species-level metagenome profiling and containment estimation with sylph. Nat Biotechnol 2024:10.1038/s41587-024-02412-y. [PMID: 39379646 DOI: 10.1038/s41587-024-02412-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2024] [Accepted: 08/28/2024] [Indexed: 10/10/2024]
Abstract
Profiling metagenomes against databases allows for the detection and quantification of microorganisms, even at low abundances where assembly is not possible. We introduce sylph, a species-level metagenome profiler that estimates genome-to-metagenome containment average nucleotide identity (ANI) through zero-inflated Poisson k-mer statistics, enabling ANI-based taxa detection. On the Critical Assessment of Metagenome Interpretation II (CAMI2) Marine dataset, sylph was the most accurate profiling method of seven tested. For multisample profiling, sylph took >10-fold less central processing unit time compared to Kraken2 and used 30-fold less memory. Sylph's ANI estimates provided an orthogonal signal to abundance, allowing for an ANI-based metagenome-wide association study for Parkinson disease (PD) against 289,232 genomes while confirming known butyrate-PD associations at the strain level. Sylph took <1 min and 16 GB of random-access memory to profile metagenomes against 85,205 prokaryotic and 2,917,516 viral genomes, detecting 30-fold more viral sequences in the human gut compared to RefSeq. Sylph offers precise, efficient profiling with accurate containment ANI estimation even for low-coverage genomes.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, Ontario, Canada.
- Department of Data Science, Dana-Farber Cancer Institute, Boston, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario, Canada.
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, USA.
| |
Collapse
|
19
|
Rey Redondo E, Leung SKK, Yung CCM. Genomic and biogeographic characterisation of the novel prasinovirus Mantoniella tinhauana virus 1. ENVIRONMENTAL MICROBIOLOGY REPORTS 2024; 16:e70020. [PMID: 39392286 PMCID: PMC11467894 DOI: 10.1111/1758-2229.70020] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/04/2024] [Accepted: 09/10/2024] [Indexed: 10/12/2024]
Abstract
Mamiellophyceae are a ubiquitous class of unicellular green algae in the global ocean. Their ecological importance is highlighted in studies focused on the prominent genera Micromonas, Ostreococcus, and Bathycoccus. Mamiellophyceae are susceptible to prasinoviruses, double-stranded DNA viruses belonging to the nucleocytoplasmic large DNA virus group. Our study represents the first isolation of a prasinovirus in the South China Sea and the only one to infect the globally distributed genus Mantoniella. We conducted a comparative analysis with previously identified viral relatives, encompassing morphological characteristics, host specificity, marker-based phylogenetic placement, and whole-genome sequence comparisons. Although it shares morphological and genetic similarities with established prasinoviruses, this novel virus showed distinct genetic traits, confining its infection to the species Mantoniella tinhauana. We also explored the global biogeography of this prasinovirus and its host by mapping metagenomic data and analysing their relationship with various environmental parameters. Our results demonstrate a pronounced link between the virus and its host, both found predominantly in higher latitudes in the surface ocean. By gaining an increased understanding of the relationships between viruses, hosts, and environments, researchers can better make predictions and potentially implement measures to mitigate the consequences of climate change on oceanic processes.
Collapse
Affiliation(s)
- Elvira Rey Redondo
- Department of Ocean ScienceThe Hong Kong University of Science and TechnologyHong KongHong Kong SAR
| | - Shara Ka Kiu Leung
- Department of Ocean ScienceThe Hong Kong University of Science and TechnologyHong KongHong Kong SAR
| | - Charmaine Cheuk Man Yung
- Department of Ocean ScienceThe Hong Kong University of Science and TechnologyHong KongHong Kong SAR
| |
Collapse
|
20
|
Butel-Simoes G, Steinig E, Savic I, Zhanduisenov M, Papadakis G, Tran T, Moselen J, Caly L, Williamson DA, Lim CK. Optimising nucleic acid recovery from rapid antigen tests for whole genome sequencing of respiratory viruses. J Clin Virol 2024; 174:105714. [PMID: 39038394 DOI: 10.1016/j.jcv.2024.105714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2024] [Revised: 07/09/2024] [Accepted: 07/11/2024] [Indexed: 07/24/2024]
Abstract
BACKGROUND Whole genome sequencing (WGS) of respiratory viruses from rapid antigen tests (RAT-WGS) is a novel approach to expanding genomic surveillance of respiratory infections. To date however, there are limited data on the genomic stability of these viruses on RATs. In this study, we investigated the effect of storage conditions and nucleic acid preservatives on the ability to enhance stability and improve recovery of respiratory virus genomes from RATs. METHODS A mixture of common respiratory viruses was used to inoculate RATs at different environmental temperatures (4°C, 20°C and 36°C), with two preservative reagents (RNALater and DNA/RNA shield) Nucleic acid was extracted from RATs at two different timepoints (72 h and seven days) and subject to real-time multiplex respiratory PCR to detect a range of respiratory viruses. WGS was performed using target-enrichment with the TWIST Comprehensive Viral Research Panel. Defined metrics from an automated in-house bioinformatic pipeline were used to assess and compare viral genome recovery under different conditions. RESULTS Nucleic acid degradation (indicated by relative change in PCR cycle threshold and WGS-based metrics) was most notable at 20 °C and 36 °C. Storage in either RNALater or DNA / RNA shield improved genome recovery for respiratory viruses across all temperature conditions, although this was most pronounced for RNALater. Subtyping of Influenza viruses demonstrated the applicability of RAT-WGS in downstream genomic epidemiological surveillance. CONCLUSIONS Under simulated conditions, RAT-WGS demonstrated that (i) viral genomes were generally stable at 4°C at 72 h and 1 week, (ii) RNALater has a more significant preservation of nucleic acids compared to DNA/RNA Shield and (iii) genome recovery can be achieved using a sequencing depth of 500,000 reads per sample in RNALater, across all respiratory viruses and conditions.
Collapse
Affiliation(s)
- G Butel-Simoes
- Victorian Infectious Diseases Reference Laboratory, The Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia
| | - E Steinig
- Victorian Infectious Diseases Reference Laboratory, The Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia; Department of Infectious Diseases, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia
| | - I Savic
- Victorian Infectious Diseases Reference Laboratory, The Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia
| | - M Zhanduisenov
- Victorian Infectious Diseases Reference Laboratory, The Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia
| | - G Papadakis
- Victorian Infectious Diseases Reference Laboratory, The Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia
| | - T Tran
- Victorian Infectious Diseases Reference Laboratory, The Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia
| | - J Moselen
- Victorian Infectious Diseases Reference Laboratory, The Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia
| | - L Caly
- Victorian Infectious Diseases Reference Laboratory, The Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia; Department of Infectious Diseases, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia
| | - D A Williamson
- Victorian Infectious Diseases Reference Laboratory, The Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia; Department of Infectious Diseases, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia
| | - C K Lim
- Victorian Infectious Diseases Reference Laboratory, The Royal Melbourne Hospital at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia; Department of Infectious Diseases, The University of Melbourne at the Peter Doherty Institute for Infection and Immunity, Melbourne, Victoria, Australia.
| |
Collapse
|
21
|
Ishak S, Rondeau-Leclaire J, Faticov M, Roy S, Laforest-Lapointe I. Boreal moss-microbe interactions are revealed through metagenome assembly of novel bacterial species. Sci Rep 2024; 14:22168. [PMID: 39333734 PMCID: PMC11437008 DOI: 10.1038/s41598-024-73045-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2024] [Accepted: 09/12/2024] [Indexed: 09/30/2024] Open
Abstract
Moss-microbe interactions contribute to ecosystem processes in boreal forests. Yet, how host-specific characteristics and the environment drive the composition and metabolic potential of moss microbiomes is still poorly understood. In this study, we use shotgun metagenomics to identify the taxonomy and metabolic potential of the bacteria of four moss species of the boreal forests of Northern Québec, Canada. To characterize moss bacterial community composition and diversity, we assembled the genomes of 110 potentially novel bacterial species. Our results highlight that moss genus, species, gametophyte section, and to a lesser extent soil pH and soil temperature, drive moss-associated bacterial community composition and diversity. In the brown gametophyte section, two Stigonema spp. showed partial pathway completeness for photosynthesis and nitrogen fixation, while all brown-associated Hyphomicrobiales had complete assimilatory nitrate reduction pathways and many nearly complete carbon fixation pathways. Several brown-associated species showed partial to complete pathways for coenzyme M and F420 biosynthesis, important for methane metabolism. In addition, green-associated Hyphomicrobiales (Methylobacteria spp.) displayed potential for the anoxygenic photosystem II pathway. Overall, our findings demonstrate how host-specific characteristics and environmental factors shape the composition and metabolic potential of moss bacteria, highlighting their roles in carbon fixation, nitrogen cycling, and methane metabolism in boreal forests.
Collapse
Affiliation(s)
- Sarah Ishak
- Département de Biologie, Université de Sherbrooke, Sherbrooke, QC, Canada.
- Centre d'Étude de la Forêt, Université du Québec à Montréal, Montréal, QC, Canada.
| | | | - Maria Faticov
- Département de Biologie, Université de Sherbrooke, Sherbrooke, QC, Canada
- Centre SÈVE, Université de Sherbrooke, Sherbrooke, QC, Canada
- Centre d'Étude de la Forêt, Université du Québec à Montréal, Montréal, QC, Canada
| | - Sébastien Roy
- Département de Biologie, Université de Sherbrooke, Sherbrooke, QC, Canada
- Centre SÈVE, Université de Sherbrooke, Sherbrooke, QC, Canada
| | - Isabelle Laforest-Lapointe
- Département de Biologie, Université de Sherbrooke, Sherbrooke, QC, Canada.
- Centre SÈVE, Université de Sherbrooke, Sherbrooke, QC, Canada.
- Centre d'Étude de la Forêt, Université du Québec à Montréal, Montréal, QC, Canada.
| |
Collapse
|
22
|
Zhao J, Both JP, Rodriguez-R LM, Konstantinidis KT. GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs. Nucleic Acids Res 2024; 52:e74. [PMID: 39011878 PMCID: PMC11381346 DOI: 10.1093/nar/gkae609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 06/20/2024] [Accepted: 06/27/2024] [Indexed: 07/17/2024] Open
Abstract
Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.
Collapse
Affiliation(s)
- Jianshu Zhao
- Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
| | | | - Luis M Rodriguez-R
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA
- Department of Microbiology, University of Innsbruck, Innsbruck, Austria
- Digital Science Center (DiSC), University of Innsbruck, Innsbruck, Austria
| | - Konstantinos T Konstantinidis
- Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| |
Collapse
|
23
|
Junier P, Cailleau G, Fatton M, Udriet P, Hashmi I, Bregnard D, Corona-Ramirez A, Francesco ED, Kuhn T, Mangia N, Zhioua S, Hunkeler D, Bindschedler S, Sieber S, Gonzalez D. A cohesive Microcoleus strain cluster causes benthic cyanotoxic blooms in rivers worldwide. WATER RESEARCH X 2024; 24:100252. [PMID: 39308956 PMCID: PMC11416633 DOI: 10.1016/j.wroa.2024.100252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Revised: 08/21/2024] [Accepted: 08/27/2024] [Indexed: 09/25/2024]
Abstract
Over the last two decades, proliferations of benthic cyanobacteria producing derivatives of anatoxin-a have been reported in rivers worldwide. Here, we follow up on such a toxigenic event happening in the Areuse river in Switzerland and investigate the diversity and genomics of major bloom-forming riverine benthic cyanobacteria. We show, using 16S rRNA-based community profiling, that benthic communities are dominated by Oscillatoriales. We correlate the detection of one Microcoleus sequence variant matching the Microcoleus anatoxicus species with the presence of anatoxin-a derivatives and use long-read metagenomics to assemble complete circular genomes of the strain. The main dihydro-anatoxin-a-producing strain in the Areuse is distinct from strains isolated in New Zealand, the USA, and Canada, but forms a monophyletic strain cluster with them with average nucleotide identity values close to the species threshold. Compared to the rest of the Microcoleus genus, the toxin-producing strains encode a 15 % smaller genome, lacking genes for the synthesis of some essential vitamins. Toxigenic mats harbor a distinct microbiome dominated by proteobacteria and bacteroidetes, which may support cyanobacterial growth by providing them with essential nutrients. We recommend that strains closely related to M. anatoxicus be monitored internationally in order to help predict and mitigate similar cyanotoxic events.
Collapse
Affiliation(s)
- Pilar Junier
- Laboratory of Microbiology, University of Neuchâtel, Switzerland
| | | | - Mathilda Fatton
- Laboratory of Microbiology, University of Neuchâtel, Switzerland
| | - Pauline Udriet
- Laboratory of Microbiology, University of Neuchâtel, Switzerland
| | - Isha Hashmi
- Laboratory of Microbiology, University of Neuchâtel, Switzerland
| | - Danae Bregnard
- Laboratory of Microbiology, University of Neuchâtel, Switzerland
| | | | - Eva di Francesco
- Laboratory of Microbiology, University of Neuchâtel, Switzerland
| | - Thierry Kuhn
- Laboratory of Microbiology, University of Neuchâtel, Switzerland
| | - Naïma Mangia
- Laboratory of Microbiology, University of Neuchâtel, Switzerland
| | - Sami Zhioua
- Laboratory of Microbiology, University of Neuchâtel, Switzerland
| | - Daniel Hunkeler
- Centre for Hydrogeology and Geothermics, University of Neuchâtel, Switzerland
| | | | - Simon Sieber
- Department of Chemistry, University of Zürich, Switzerland
| | - Diego Gonzalez
- Laboratory of Microbiology, University of Neuchâtel, Switzerland
| |
Collapse
|
24
|
Zhang XB, Oualline G, Shaw J, Yu YW. skandiver: a divergence-based analysis tool for identifying intercellular mobile genetic elements. Bioinformatics 2024; 40:ii155-ii164. [PMID: 39230688 PMCID: PMC11373320 DOI: 10.1093/bioinformatics/btae398] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/05/2024] Open
Abstract
Motivation: Mobile genetic elements (MGEs) are as ubiquitous in nature as they are varied in type, ranging from viral insertions to transposons to incorporated plasmids. Horizontal transfer of MGEs across bacterial species may also pose a significant threat to global health due to their capability to harbor antibiotic resistance genes. However, despite cheap and rapid whole-genome sequencing, the varied nature of MGEs makes it difficult to fully characterize them, and existing methods for detecting MGEs often do not agree on what should count. In this manuscript, we first define and argue in favor of a divergence-based characterization of mobile-genetic elements. Results: Using that paradigm, we present skandiver, a tool designed to efficiently detect MGEs from whole-genome assemblies without the need for gene annotation or markers. skandiver determines mobile elements via genome fragmentation, average nucleotide identity (ANI), and divergence time. By building on the scalable skani software for ANI computation, skandiver can query hundreds of complete assemblies against >65 000 representative genomes in a few minutes and 19 GB memory, providing scalable and efficient method for elucidating mobile element profiles in incomplete, uncharacterized genomic sequences. For isolated and integrated large plasmids (>10 kb), skandiver's recall was 48% and 47%, MobileElementFinder was 59% and 17%, and geNomad was 86% and 32%, respectively. For isolated large plasmids, skandiver's recall (48%) is lower than state-of-the-art reference-based methods geNomad (86%) and MobileElementFinder (59%). However, skandiver achieves higher recall on integrated plasmids and, unlike other methods, without comparing against a curated database, making skandiver suitable for discovery of novel MGEs. AVAILABILITY AND IMPLEMENTATION https://github.com/YoukaiFromAccounting/skandiver.
Collapse
Affiliation(s)
- Xiaolei Brian Zhang
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, United States
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Grace Oualline
- Department of Biological Sciences, Carnegie Mellon University, Pittsburgh, PA 15213, United States
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, United States
| | - Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, ON M5S2E4, Canada
| | - Yun William Yu
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA 15213, United States
- Department of Mathematics, University of Toronto, Toronto, ON M5S2E4, Canada
- Department of Computer and Mathematical Sciences, University of Toronto at Scarborough, Toronto, ON M1C1A4, Canada
| |
Collapse
|
25
|
Hsouna J, Gritli T, Ilahi H, Han JC, Ellouze W, Zhang XX, Mansouri M, Rahi P, El Idrissi MM, Lamrabet M, Courty PE, Wipf D, Bekki A, Tambong JT, Mnasri B. Rhizobium aouanii sp. nov., efficient nodulating rhizobia isolated from Acacia saligna roots in Tunisia. Int J Syst Evol Microbiol 2024; 74. [PMID: 39235833 PMCID: PMC11376454 DOI: 10.1099/ijsem.0.006515] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/06/2024] Open
Abstract
Three bacterial strains, 1AS14IT, 1AS12I and 6AS6, isolated from root nodules of Acacia saligna, were characterized using a polyphasic approach. Phylogenetic analysis based on rrs sequences placed all three strains within the Rhizobium leguminosarum complex. Further phylogeny, based on 1 756 bp sequences of four concatenated housekeeping genes (recA, atpD, glnII and gyrB), revealed their distinction from known rhizobia species of the R. leguminosarum complex (Rlc), forming a distinct clade. The closest related species, identified as Rhizobium laguerreae, with a sequence identity of 96.4% based on concatenated recA-atpD-glnII-gyrB sequences. The type strain, 1AS14IT, showed average nucleotide identity (ANI) values of 94.9, 94.3 and 94.1% and DNA-DNA hybridization values of 56.1, 57.4 and 60.0% with the type strains of closest known species: R. laguerreae, Rhizobium acaciae and 'Rhizobium indicum', respectively. Phylogenomic analyses using 81 up-to-date bacteria core genes and the Type (Strain) Genome Server pipeline further supported the uniqueness of strains 1AS14IT, 1AS12I and 6AS6. The relatedness of the novel strains to NCBI unclassified Rhizobium sp. (396 genomes) and metagenome-derived genomes showed ANI values from 76.7 to 94.8% with a species-level cut-off of 96%, suggesting that strains 1AS14I, 1AS12I and 6AS6 are a distinct lineage. Additionally, differentiation of strains 1AS14IT, 1AS12I and 6AS6 from their closest phylogenetic neighbours was achieved using phenotypic, physiological and fatty acid content analyses. Based on the genomic, phenotypic and biochemical data, we propose the establishment of a novel rhizobial species, Rhizobium aouanii sp. nov., with strain 1AS14IT designated as the type strain (=DSM 113914T=LMG 33206T). This study contributes to the understanding of microbial diversity in nitrogen-fixing symbioses, specifically within Acacia saligna ecosystems in Tunisia.
Collapse
Affiliation(s)
- Jihed Hsouna
- Laboratory of Legumes and Sustainable Agrosystems, Centre of Biotechnology of Borj-Cédria, BP 901 Hammam-lif 2050, Tunisia
- University of Carthage, Faculty of Sciences of Bizerte, Tunis, Tunisia
| | - Takwa Gritli
- Laboratory of Legumes and Sustainable Agrosystems, Centre of Biotechnology of Borj-Cédria, BP 901 Hammam-lif 2050, Tunisia
| | - Houda Ilahi
- Laboratory of Legumes and Sustainable Agrosystems, Centre of Biotechnology of Borj-Cédria, BP 901 Hammam-lif 2050, Tunisia
| | - Jia-Cheng Han
- Agricultural Cultural Collection of China, Institute of Agricultural Resources and Regional Planning, Chinese Academy of Agricultural Sciences, Beijing 100080, PR China
| | - Walid Ellouze
- Agriculture and Agri-Food Canada, 4902 Victoria Avenue North, Vineland Station, Ontario, L0R 2E0, Canada
| | - Xiao Xia Zhang
- Agricultural Cultural Collection of China, Institute of Agricultural Resources and Regional Planning, Chinese Academy of Agricultural Sciences, Beijing 100080, PR China
| | - Maroua Mansouri
- Laboratory of Legumes and Sustainable Agrosystems, Centre of Biotechnology of Borj-Cédria, BP 901 Hammam-lif 2050, Tunisia
| | - Praveen Rahi
- Institut Pasteur, Université Paris Cité, Biological Resource Center of Institut Pasteur (CRBIP), Paris, France
| | - Mustapha Missbah El Idrissi
- Faculty of Sciences, Centre de Biotechnologies Végétale et Microbienne, Biodiversité et Environnement, Mohammed V University in Rabat, Rabat, Morocco
| | - Mouad Lamrabet
- Faculty of Sciences, Centre de Biotechnologies Végétale et Microbienne, Biodiversité et Environnement, Mohammed V University in Rabat, Rabat, Morocco
| | - Pierre Emmanuel Courty
- Agroécologie, Institut Agro Dijon, CNRS, Univ. Bourgogne, INRAE, Univ. Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Daniel Wipf
- Agroécologie, Institut Agro Dijon, CNRS, Univ. Bourgogne, INRAE, Univ. Bourgogne Franche-Comté, F-21000 Dijon, France
| | - Abdelkader Bekki
- Biotechnology of Rhizobia and Plant Breeding Laboratory, Department of Biotechnology, Faculty of Sciences, University of Oran1, Sénia, Algeria
| | - James T Tambong
- Agriculture and Agri-Food Canada, 960 Carling Avenue, Ottawa, Ontario, K1A 0C6, Canada
| | - Bacem Mnasri
- Laboratory of Legumes and Sustainable Agrosystems, Centre of Biotechnology of Borj-Cédria, BP 901 Hammam-lif 2050, Tunisia
| |
Collapse
|
26
|
Chen X, Yin X, Shi X, Yan W, Yang Y, Liu L, Zhang T. Melon: metagenomic long-read-based taxonomic identification and quantification using marker genes. Genome Biol 2024; 25:226. [PMID: 39160564 PMCID: PMC11331721 DOI: 10.1186/s13059-024-03363-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2023] [Accepted: 07/30/2024] [Indexed: 08/21/2024] Open
Abstract
Long-read sequencing holds great potential for characterizing complex microbial communities, yet taxonomic profiling tools designed specifically for long reads remain lacking. We introduce Melon, a novel marker-based taxonomic profiler that capitalizes on the unique attributes of long reads. Melon employs a two-stage classification scheme to reduce computational time and is equipped with an expectation-maximization-based post-correction module to handle ambiguous reads. Melon achieves superior performance compared to existing tools in both mock and simulated samples. Using wastewater metagenomic samples, we demonstrate the applicability of Melon by showing it provides reliable estimates of overall genome copies, and species-level taxonomic profiles.
Collapse
Affiliation(s)
- Xi Chen
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| | - Xiaole Yin
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| | - Xianghui Shi
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| | - Weifu Yan
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| | - Yu Yang
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| | - Lei Liu
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong, China
| | - Tong Zhang
- Environmental Microbiome Engineering and Biotechnology Lab, Department of Civil Engineering, The University of Hong Kong, Pokfulam Road, Hong Kong, China.
| |
Collapse
|
27
|
Shaw J, Yu YW. Fairy: fast approximate coverage for multi-sample metagenomic binning. MICROBIOME 2024; 12:151. [PMID: 39143609 PMCID: PMC11323348 DOI: 10.1186/s40168-024-01861-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 06/20/2024] [Indexed: 08/16/2024]
Abstract
BACKGROUND Metagenomic binning, the clustering of assembled contigs that belong to the same genome, is a crucial step for recovering metagenome-assembled genomes (MAGs). Contigs are linked by exploiting consistent signatures along a genome, such as read coverage patterns. Using coverage from multiple samples leads to higher-quality MAGs; however, standard pipelines require all-to-all read alignments for multiple samples to compute coverage, becoming a key computational bottleneck. RESULTS We present fairy ( https://github.com/bluenote-1577/fairy ), an approximate coverage calculation method for metagenomic binning. Fairy is a fast k-mer-based alignment-free method. For multi-sample binning, fairy can be > 250 × faster than read alignment and accurate enough for binning. Fairy is compatible with several existing binners on host and non-host-associated datasets. Using MetaBAT2, fairy recovers 98.5 % of MAGs with > 50 % completeness and < 5 % contamination relative to alignment with BWA. Notably, multi-sample binning with fairy is always better than single-sample binning using BWA ( > 1.5 × more > 50 % complete MAGs on average) while still being faster. For a public sediment metagenome project, we demonstrate that multi-sample binning recovers higher quality Asgard archaea MAGs than single-sample binning and that fairy's results are indistinguishable from read alignment. CONCLUSIONS Fairy is a new tool for approximately and quickly calculating multi-sample coverage for binning, resolving a computational bottleneck for metagenomics. Video Abstract.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, Canada.
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Canada.
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, USA.
| |
Collapse
|
28
|
Motro Y, Temper V, Strahilevitz J, Moran-Gilad J. Invasive infections caused by the recently described species Enterococcus innesii. Eur J Clin Microbiol Infect Dis 2024; 43:1645-1650. [PMID: 38811483 DOI: 10.1007/s10096-024-04864-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2024] [Accepted: 05/24/2024] [Indexed: 05/31/2024]
Abstract
E. innesii is a recently described Enterococcus species which may be difficult to differentiate from the more common E. casseliflavus. We present the first clinical report of invasive E. innesii infection, featuring two cases of biliary sepsis. Whole genome sequencing confirmed the taxonomic assignment and the presence of vanC-4. Analysis of public genomes identified 13 deposited E. innesii and 13 deposited E. casselifalvus/E.gallinarum genomes which could be reassigned as E. innesii. Improved laboratory diagnosis of E. innesii is expected to generate additional data concerning its clinical relevance and support the future diagnosis and treatment of this uncommon pathogen.
Collapse
Affiliation(s)
- Yair Motro
- Department of Health Policy and Management, School of Public Health, Faculty of Health Sciences, Ben Gurion University of the Negev, Beer Sheva, Israel
| | - Violeta Temper
- Department of Clinical Microbiology and Infectious Diseases, Hadassah-Hebrew University, Jerusalem, Israel
| | - Jacob Strahilevitz
- Department of Clinical Microbiology and Infectious Diseases, Hadassah-Hebrew University, Jerusalem, Israel
| | - Jacob Moran-Gilad
- Department of Health Policy and Management, School of Public Health, Faculty of Health Sciences, Ben Gurion University of the Negev, Beer Sheva, Israel.
- Department of Clinical Microbiology and Infectious Diseases, Hadassah-Hebrew University, Jerusalem, Israel.
| |
Collapse
|
29
|
Ramanan V, Sarkar IN. Augmenting bacterial similarity measures using a graph-based genome representation. mSystems 2024; 9:e0049724. [PMID: 38940518 PMCID: PMC11265277 DOI: 10.1128/msystems.00497-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Accepted: 06/05/2024] [Indexed: 06/29/2024] Open
Abstract
Relationships between bacterial taxa are traditionally defined using 16S rRNA nucleotide similarity or average nucleotide identity. Improvements in sequencing technology provide additional pairwise information on genome sequences, which may provide valuable information on genomic relationships. Mapping orthologous gene locations between genome pairs, known as synteny, is typically implemented in the discovery of new species and has not been systematically applied to bacterial genomes. Using a data set of 378 bacterial genomes, we developed and tested a new measure of synteny similarity between a pair of genomes, which was scaled onto 16S rRNA distance using covariance matrices. Based on the input gene functions used (i.e., core, antibiotic resistance, and virulence), we observed varying topological arrangements of bacterial relationship networks by applying (i) complete linkage hierarchical clustering and (ii) K-nearest neighbor graph structures to synteny-scaled 16S data. Our metric improved clustering quality comparatively to state-of-the-art average nucleotide identity metrics while preserving clustering assignments for the highest similarity relationships. Our findings indicate that syntenic relationships provide more granular and interpretable relationships for within-genera taxa compared to pairwise similarity measures, particularly in functional contexts. IMPORTANCE Given the prevalence and necessity of the 16S rRNA measure in bacterial identification and analysis, this additional analysis adds a functional and synteny-based layer to the identification of relatives and clustering of bacteria genomes. It is also of computational interest to model the bacterial genome as a graph structure, which presents new avenues of genomic analysis for bacteria and their closely related strains and species.
Collapse
Affiliation(s)
- Vivek Ramanan
- Center of Computational Molecular Biology, Brown University, Providence, Rhode Island, USA
- Center for Biomedical Informatics, Brown University, Providence, Rhode Island, USA
| | - Indra Neil Sarkar
- Center of Computational Molecular Biology, Brown University, Providence, Rhode Island, USA
- Center for Biomedical Informatics, Brown University, Providence, Rhode Island, USA
- Rhode Island Quality Institute, Providence, Rhode Island, USA
| |
Collapse
|
30
|
Xu W, Hsu PK, Moshiri N, Yu S, Rosing T. HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors. Bioinformatics 2024; 40:btae452. [PMID: 39012512 PMCID: PMC11281827 DOI: 10.1093/bioinformatics/btae452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 07/09/2024] [Accepted: 07/12/2024] [Indexed: 07/17/2024] Open
Abstract
MOTIVATION Genomic distance estimation is a critical workload since exact computation for whole-genome similarity metrics such as Average Nucleotide Identity (ANI) incurs prohibitive runtime overhead. Genome sketching is a fast and memory-efficient solution to estimate ANI similarity by distilling representative k-mers from the original sequences. In this work, we present HyperGen that improves accuracy, runtime performance, and memory efficiency for large-scale ANI estimation. Unlike existing genome sketching algorithms that convert large genome files into discrete k-mer hashes, HyperGen leverages the emerging hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector, HV) in high-dimensional space. HV is compact and can preserve more information, allowing for accurate ANI estimation while reducing required sketch sizes. In particular, the HV sketch representation in HyperGen allows efficient ANI estimation using vector multiplication, which naturally benefits from highly optimized general matrix multiply (GEMM) routines. As a result, HyperGen enables the efficient sketching and ANI estimation for massive genome collections. RESULTS We evaluate HyperGen 's sketching and database search performance using several genome datasets at various scales. HyperGen is able to achieve comparable or superior ANI estimation error and linearity compared to other sketch-based counterparts. The measurement results show that HyperGen is one of the fastest tools for both genome sketching and database search. Meanwhile, HyperGen produces memory-efficient sketch files while ensuring high ANI estimation accuracy. AVAILABILITY A Rust implementation of HyperGen is freely available under the MIT license as an open-source software project at https://github.com/wh-xu/Hyper-Gen. The scripts to reproduce the experimental results can be accessed at https://github.com/wh-xu/experiment-hyper-gen.
Collapse
Affiliation(s)
- Weihong Xu
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, United States
| | - Po-Kai Hsu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, United States
| | - Niema Moshiri
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, United States
| | - Shimeng Yu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, United States
| | - Tajana Rosing
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, United States
| |
Collapse
|
31
|
Shaw J, Gounot JS, Chen H, Nagarajan N, Yu YW. Floria: fast and accurate strain haplotyping in metagenomes. Bioinformatics 2024; 40:i30-i38. [PMID: 38940183 PMCID: PMC11211831 DOI: 10.1093/bioinformatics/btae252] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
SUMMARY Shotgun metagenomics allows for direct analysis of microbial community genetics, but scalable computational methods for the recovery of bacterial strain genomes from microbiomes remains a key challenge. We introduce Floria, a novel method designed for rapid and accurate recovery of strain haplotypes from short and long-read metagenome sequencing data, based on minimum error correction (MEC) read clustering and a strain-preserving network flow model. Floria can function as a standalone haplotyping method, outputting alleles and reads that co-occur on the same strain, as well as an end-to-end read-to-assembly pipeline (Floria-PL) for strain-level assembly. Benchmarking evaluations on synthetic metagenomes show that Floria is > 3× faster and recovers 21% more strain content than base-level assembly methods (Strainberry) while being over an order of magnitude faster when only phasing is required. Applying Floria to a set of 109 deeply sequenced nanopore metagenomes took <20 min on average per sample and identified several species that have consistent strain heterogeneity. Applying Floria's short-read haplotyping to a longitudinal gut metagenomics dataset revealed a dynamic multi-strain Anaerostipes hadrus community with frequent strain loss and emergence events over 636 days. With Floria, accurate haplotyping of metagenomic datasets takes mere minutes on standard workstations, paving the way for extensive strain-level metagenomic analyses. AVAILABILITY AND IMPLEMENTATION Floria is available at https://github.com/bluenote-1577/floria, and the Floria-PL pipeline is available at https://github.com/jsgounot/Floria_analysis_workflow along with code for reproducing the benchmarks.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, Ontario, M5S 2E4, Canada
| | - Jean-Sebastien Gounot
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore
| | - Hanrong Chen
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore
| | - Niranjan Nagarajan
- Genome Institute of Singapore (GIS), Agency for Science, Technology and Research (A*STAR), 60 Biopolis Street, Singapore, 138672, Republic of Singapore
- Yong Loo Lin School of Medicine, National University of Singapore, Singapore, 117597, Republic of Singapore
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario, M5S 2E4, Canada
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA, 15213, United States
| |
Collapse
|
32
|
Hera MR, Koslicki D. Cosine Similarity Estimation Using FracMinHash: Theoretical Analysis, Safety Conditions, and Implementation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.24.595805. [PMID: 38854044 PMCID: PMC11160586 DOI: 10.1101/2024.05.24.595805] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Motivation The increasing number and volume of genomic and metagenomic data necessitates scalable and robust computational models for precise analysis. Sketching techniques utilizing k -mers from a biological sample have proven to be useful for large-scale analyses. In recent years, FracMinHash has emerged as a popular sketching technique and has been used in several useful applications. Recent studies on FracMinHash proved unbiased estimators for the containment and Jaccard indices. However, theoretical investigations for other metrics, such as the cosine similarity, are still lacking. Theoretical contributions In this paper, we present a theoretical framework for estimating cosine similarity from FracMinHash sketches. We establish conditions under which this estimation is sound, and recommend a minimum scale factor s for accurate results. Experimental evidence supports our theoretical findings. Practical contributions We also present frac-kmc, a fast and efficient FracMinHash sketch generator program. frac-kmc is the fastest known FracMinHash sketch generator, delivering accurate and precise results for cosine similarity estimation on real data. We show that by computing FracMinHash sketches using frac-kmc, we can estimate pairwise cosine similarity speedily and accurately on real data. frac-kmc is freely available here: https://github.com/KoslickiLab/frac-kmc/.
Collapse
Affiliation(s)
- Mahmudur Rahman Hera
- School of Electrical Engineering and Computer Science, Pennsylvania State University, USA
| | - David Koslicki
- School of Electrical Engineering and Computer Science, Pennsylvania State University, USA
- Huck Institutes of the Life Sciences, Pennsylvania State University, USA
- Department of Biology, Pennsylvania State University, USA
| |
Collapse
|
33
|
Mussig AJ, Chaumeil PA, Chuvochina M, Rinke C, Parks DH, Hugenholtz P. Putative genome contamination has minimal impact on the GTDB taxonomy. Microb Genom 2024; 10:001256. [PMID: 38809778 PMCID: PMC11261887 DOI: 10.1099/mgen.0.001256] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2024] [Accepted: 05/10/2024] [Indexed: 05/31/2024] Open
Abstract
The Genome Taxonomy Database (GTDB) provides a species to domain classification of publicly available genomes based on average nucleotide identity (ANI) (for species) and a concatenated gene phylogeny normalized by evolutionary rates (for genus to phylum), which has been widely adopted by the scientific community. Here, we use the Genome UNClutterer (GUNC) software to identify putatively contaminated genomes in GTDB release 07-RS207. We found that GUNC reported 35,723 genomes as putatively contaminated, comprising 11.25 % of the 317,542 genomes in GTDB release 07-RS207. To assess the impact of this high level of inferred contamination on the delineation of taxa, we created 'clean' versions of the 34,846 putatively contaminated bacterial genomes by removing the most contaminated half. For each clean half, we re-calculated the ANI and concatenated gene phylogeny and found that only 77 (0.22 %) of the genomes were not consistent with their original classification. We conclude that the delineation of taxa in GTDB is robust to the putative contamination detected by GUNC.
Collapse
Affiliation(s)
- Aaron J. Mussig
- The University of Queensland, School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, St Lucia, QLD, Australia
| | - Pierre-Alain Chaumeil
- The University of Queensland, School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, St Lucia, QLD, Australia
| | - Maria Chuvochina
- The University of Queensland, School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, St Lucia, QLD, Australia
| | - Christian Rinke
- The University of Queensland, School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, St Lucia, QLD, Australia
| | - Donovan H. Parks
- The University of Queensland, School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, St Lucia, QLD, Australia
| | - Philip Hugenholtz
- The University of Queensland, School of Chemistry and Molecular Biosciences, Australian Centre for Ecogenomics, St Lucia, QLD, Australia
| |
Collapse
|
34
|
Yu YW. On Minimizers and Convolutional Filters: Theoretical Connections and Applications to Genome Analysis. J Comput Biol 2024; 31:381-395. [PMID: 38687333 DOI: 10.1089/cmb.2024.0483] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/02/2024] Open
Abstract
Minimizers and convolutional neural networks (CNNs) are two quite distinct popular techniques that have both been employed to analyze categorical biological sequences. At face value, the methods seem entirely dissimilar. Minimizers use min-wise hashing on a rolling window to extract a single important k-mer feature per window. CNNs start with a wide array of randomly initialized convolutional filters, paired with a pooling operation, and then multiple additional neural layers to learn both the filters themselves and how they can be used to classify the sequence. In this study, our main result is a careful mathematical analysis of hash function properties showing that for sequences over a categorical alphabet, random Gaussian initialization of convolutional filters with max-pooling is equivalent to choosing a minimizer ordering such that selected k-mers are (in Hamming distance) far from the k-mers within the sequence but close to other minimizers. In empirical experiments, we find that this property manifests as decreased density in repetitive regions, both in simulation and on real human telomeres. We additionally train from scratch a CNN embedding of synthetic short-reads from the SARS-CoV-2 genome into 3D Euclidean space that locally recapitulates the linear sequence distance of the read origins, a modest step toward building a deep learning assembler, although it is at present too slow to be practical. In total, this article provides a partial explanation for the effectiveness of CNNs in categorical sequence analysis.
Collapse
Affiliation(s)
- Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
- Department of Ray and Stephanie Lane Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
35
|
Zheng A, Shaw J, Yu YW. Mora: abundance aware metagenomic read re-assignment for disentangling similar strains. BMC Bioinformatics 2024; 25:161. [PMID: 38649836 PMCID: PMC11035124 DOI: 10.1186/s12859-024-05768-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Accepted: 04/05/2024] [Indexed: 04/25/2024] Open
Abstract
BACKGROUND Taxonomic classification of reads obtained by metagenomic sequencing is often a first step for understanding a microbial community, but correctly assigning sequencing reads to the strain or sub-species level has remained a challenging computational problem. RESULTS We introduce Mora, a MetagenOmic read Re-Assignment algorithm capable of assigning short and long metagenomic reads with high precision, even at the strain level. Mora is able to accurately re-assign reads by first estimating abundances through an expectation-maximization algorithm and then utilizing abundance information to re-assign query reads. The key idea behind Mora is to maximize read re-assignment qualities while simultaneously minimizing the difference from estimated abundance levels, allowing Mora to avoid over assigning reads to the same genomes. On simulated diverse reads, this allows Mora to achieve F1 scores comparable to other algorithms while having less runtime. However, Mora significantly outshines other algorithms on very similar reads. We show that the high penalty of over assigning reads to a common reference genome allows Mora to accurately infer correct strains for real data in the form of E. coli reads. CONCLUSIONS Mora is a fast and accurate read re-assignment algorithm that is modularized, allowing it to be incorporated into general metagenomics and genomics workflows. It is freely available at https://github.com/AfZheng126/MORA .
Collapse
Affiliation(s)
- Andrew Zheng
- Mathematics, University of Toronto, 27 King's College Circle, Toronto, Ontario, M3R 0A3, Canada
| | - Jim Shaw
- Mathematics, University of Toronto, 27 King's College Circle, Toronto, Ontario, M3R 0A3, Canada.
| | - Yun William Yu
- Mathematics, University of Toronto, 27 King's College Circle, Toronto, Ontario, M3R 0A3, Canada.
- Computer and Mathematical Sciences, University of Toronto at Scarborough, 1265 Military Trail, Toronto, Ontario, M1C 1A4, Canada.
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, Pennsylvania, 15213, USA.
| |
Collapse
|
36
|
Ridley RS, Conrad RE, Lindner BG, Woo S, Konstantinidis KT. Potential routes of plastics biotransformation involving novel plastizymes revealed by global multi-omic analysis of plastic associated microbes. Sci Rep 2024; 14:8798. [PMID: 38627476 PMCID: PMC11021508 DOI: 10.1038/s41598-024-59279-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 04/09/2024] [Indexed: 04/19/2024] Open
Abstract
Despite increasing efforts across various disciplines, the fate, transport, and impact of synthetic plastics on the environment and public health remain poorly understood. To better elucidate the microbial ecology of plastic waste and its potential for biotransformation, we conducted a large-scale analysis of all publicly available meta-omic studies investigating plastics (n = 27) in the environment. Notably, we observed low prevalence of known plastic degraders throughout most environments, except for substantial enrichment in riverine systems. This indicates rivers may be a highly promising environment for discovery of novel plastic bioremediation products. Ocean samples associated with degrading plastics showed clear differentiation from non-degrading polymers, showing enrichment of novel putative biodegrading taxa in the degraded samples. Regarding plastisphere pathogenicity, we observed significant enrichment of antimicrobial resistance genes on plastics but not of virulence factors. Additionally, we report a co-occurrence network analysis of 10 + million proteins associated with the plastisphere. This analysis revealed a localized sub-region enriched with known and putative plastizymes-these may be useful for deeper investigation of nature's ability to biodegrade man-made plastics. Finally, the combined data from our meta-analysis was used to construct a publicly available database, the Plastics Meta-omic Database (PMDB)-accessible at plasticmdb.org. These data should aid in the integrated exploration of the microbial plastisphere and facilitate research efforts investigating the fate and bioremediation potential of environmental plastic waste.
Collapse
Affiliation(s)
- Rodney S Ridley
- School of Chemical and Biomolecular Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA.
| | - Roth E Conrad
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, 30332, USA
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Blake G Lindner
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Seongwook Woo
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, 30332, USA
| | - Konstantinos T Konstantinidis
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, 30332, USA.
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, 30332, USA.
| |
Collapse
|
37
|
Thomy J, Sanchez F, Prioux C, Yau S, Xu Y, Mak J, Sun R, Piganeau G, Yung CCM. Unveiling Prasinovirus diversity and host specificity through targeted enrichment in the South China Sea. ISME COMMUNICATIONS 2024; 4:ycae109. [PMID: 39296779 PMCID: PMC11408933 DOI: 10.1093/ismeco/ycae109] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/04/2024] [Revised: 07/16/2024] [Accepted: 08/27/2024] [Indexed: 09/21/2024]
Abstract
Unicellular green picophytoplankton from the Mamiellales order are pervasive in marine ecosystems and susceptible to infections by prasinoviruses, large double-stranded DNA viruses within the Nucleocytoviricota phylum. We developed a double-stranded DNA virus enrichment and shotgun sequencing method, and successfully assembled 80 prasinovirus genomes from 43 samples in the South China Sea. Our research delivered the first direct estimation of 94% accuracy in correlating genome similarity to host range. Stirkingly, our analyses uncovered unexpected host-switching across diverse algal lineages, challenging the existing paradigms of host-virus co-speciation and revealing the dynamic nature of viral evolution. We also detected six instances of horizontal gene transfer between prasinoviruses and their hosts, including a novel alternative oxidase. Additionally, diversifying selection on a major capsid protein suggests an ongoing co-evolutionary arms race. These insights not only expand our understanding of prasinovirus genomic diversity but also highlight the intricate evolutionary mechanisms driving their ecological success and shaping broader virus-host interactions in marine environments.
Collapse
Affiliation(s)
- Julie Thomy
- Sorbonne Université, CNRS, Laboratoire de Biodiversité et Biotechnologies Microbiennes (LBBM), Observatoire Océanologique, F-66650 Banyuls/Mer, France
- Department of Oceanography, School of Ocean and Earth Science and Technology (SOEST), University of Hawai'i at Mānoa, Honolulu, HI 96822, United States
| | - Frederic Sanchez
- Sorbonne Université, CNRS, Biologie Intégrative des Organismes Marins (BIOM), UMR 7232, Observatoire Océanologique, F-66650 Banyuls/Mer, France
| | - Camille Prioux
- Sorbonne Université, CNRS, Laboratoire de Biodiversité et Biotechnologies Microbiennes (LBBM), Observatoire Océanologique, F-66650 Banyuls/Mer, France
- Centre Scientifique de Monaco, 8 Quai Antoine 1er, Monaco, MC 98000, Principality of Monaco
| | - Sheree Yau
- Sorbonne Université, CNRS, Laboratoire de Biodiversité et Biotechnologies Microbiennes (LBBM), Observatoire Océanologique, F-66650 Banyuls/Mer, France
| | - Yangbing Xu
- Department of Ocean Science, The Hong Kong University of Science and Technology, Hong Kong SAR, China
| | - Julian Mak
- Department of Ocean Science, The Hong Kong University of Science and Technology, Hong Kong SAR, China
| | - Ruixian Sun
- Department of Ocean Science, The Hong Kong University of Science and Technology, Hong Kong SAR, China
| | - Gwenael Piganeau
- Sorbonne Université, CNRS, Laboratoire de Biodiversité et Biotechnologies Microbiennes (LBBM), Observatoire Océanologique, F-66650 Banyuls/Mer, France
- Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), The Hong Kong University of Science and Technology, Hong Kong SAR, China
| | - Charmaine C M Yung
- Department of Ocean Science, The Hong Kong University of Science and Technology, Hong Kong SAR, China
- Southern Marine Science and Engineering Guangdong Laboratory (Guangzhou), The Hong Kong University of Science and Technology, Hong Kong SAR, China
| |
Collapse
|