1
|
Tomasella F, Pizzi C. MetaComBin: combining abundances and overlaps for binning metagenomics reads. FRONTIERS IN BIOINFORMATICS 2025; 5:1504728. [PMID: 40099113 PMCID: PMC11912761 DOI: 10.3389/fbinf.2025.1504728] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2024] [Accepted: 01/27/2025] [Indexed: 03/19/2025] Open
Abstract
Introduction Metagenomics is the discipline that studies heterogeneous microbial samples extracted directly from their natural environment, for example, from soil, water, or the human body. The detection and quantification of species that populate microbial communities have been the subject of many recent studies based on classification and clustering, motivated by being the first step in more complex pipelines (e.g., for functional analysis, de novo assembly, or comparison of metagenomes). Metagenomics has an impact on both environmental studies and precision medicine; thus, it is crucial to improve the quality of species identification through computational tools. Methods In this paper, we explore the idea of improving the overall quality of metagenomics binning at the read level by proposing a computational framework that sequentially combines two complementary read-binning approaches: one based on species abundance determination and another one relying on read overlap in order to cluster reads together. We called this approach MetaComBin (metagenomics combined binning). Results and Discussion The results of our experiments with the MetaComBin approach showed that the combination of two tools, based on different approaches, can improve the clustering quality in realistic conditions where the number of species is not known beforehand.
Collapse
Affiliation(s)
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, Padua, Italy
| |
Collapse
|
2
|
Ndiaye M, Prieto-Baños S, Fitzgerald LM, Yazdizadeh Kharrazi A, Oreshkov S, Dessimoz C, Sedlazeck FJ, Glover N, Majidian S. When less is more: sketching with minimizers in genomics. Genome Biol 2024; 25:270. [PMID: 39402664 PMCID: PMC11472564 DOI: 10.1186/s13059-024-03414-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 10/01/2024] [Indexed: 10/19/2024] Open
Abstract
The exponential increase in sequencing data calls for conceptual and computational advances to extract useful biological insights. One such advance, minimizers, allows for reducing the quantity of data handled while maintaining some of its key properties. We provide a basic introduction to minimizers, cover recent methodological developments, and review the diverse applications of minimizers to analyze genomic data, including de novo genome assembly, metagenomics, read alignment, read correction, and pangenomes. We also touch on alternative data sketching techniques including universal hitting sets, syncmers, or strobemers. Minimizers and their alternatives have rapidly become indispensable tools for handling vast amounts of data.
Collapse
Affiliation(s)
- Malick Ndiaye
- Department of Fundamental Microbiology, UNIL, Lausanne, Switzerland
| | - Silvia Prieto-Baños
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | | | - Sergey Oreshkov
- Department of Endocrinology, Diabetology, Metabolism, CHUV, Lausanne, Switzerland
| | - Christophe Dessimoz
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | | | - Natasha Glover
- Department of Computational Biology, UNIL, Lausanne, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
| | - Sina Majidian
- Department of Computational Biology, UNIL, Lausanne, Switzerland.
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland.
| |
Collapse
|
3
|
Mallawaarachchi V, Wickramarachchi A, Xue H, Papudeshi B, Grigson SR, Bouras G, Prahl RE, Kaphle A, Verich A, Talamantes-Becerra B, Dinsdale EA, Edwards RA. Solving genomic puzzles: computational methods for metagenomic binning. Brief Bioinform 2024; 25:bbae372. [PMID: 39082646 PMCID: PMC11289683 DOI: 10.1093/bib/bbae372] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2024] [Revised: 06/05/2024] [Accepted: 07/15/2024] [Indexed: 08/03/2024] Open
Abstract
Metagenomics involves the study of genetic material obtained directly from communities of microorganisms living in natural environments. The field of metagenomics has provided valuable insights into the structure, diversity and ecology of microbial communities. Once an environmental sample is sequenced and processed, metagenomic binning clusters the sequences into bins representing different taxonomic groups such as species, genera, or higher levels. Several computational tools have been developed to automate the process of metagenomic binning. These tools have enabled the recovery of novel draft genomes of microorganisms allowing us to study their behaviors and functions within microbial communities. This review classifies and analyzes different approaches of metagenomic binning and different refinement, visualization, and evaluation techniques used by these methods. Furthermore, the review highlights the current challenges and areas of improvement present within the field of research.
Collapse
Affiliation(s)
- Vijini Mallawaarachchi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Anuradha Wickramarachchi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Hansheng Xue
- School of Computing, National University of Singapore, Singapore 119077, Singapore
| | - Bhavya Papudeshi
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Susanna R Grigson
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - George Bouras
- Adelaide Medical School, Faculty of Health and Medical Sciences, The University of Adelaide, Adelaide, SA 5005, Australia
- The Department of Surgery—Otolaryngology Head and Neck Surgery, University of Adelaide and the Basil Hetzel Institute for Translational Health Research, Central Adelaide Local Health Network, Adelaide, SA 5011, Australia
| | - Rosa E Prahl
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Anubhav Kaphle
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Andrey Verich
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
- The Kirby Institute, The University of New South Wales, Randwick, Sydney, NSW 2052, Australia
| | - Berenice Talamantes-Becerra
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO), Westmead, NSW 2145, Australia
| | - Elizabeth A Dinsdale
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| | - Robert A Edwards
- Flinders Accelerator for Microbiome Exploration, College of Science and Engineering, Flinders University, Adelaide, SA 5042, Australia
| |
Collapse
|
4
|
Rossignolo E, Comin M. Enhanced Compression of k-Mer Sets with Counters via de Bruijn Graphs. J Comput Biol 2024; 31:524-538. [PMID: 38820168 DOI: 10.1089/cmb.2024.0530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2024] Open
Abstract
An essential task in computational genomics involves transforming input sequences into their constituent k-mers. The quest for an efficient representation of k-mer sets is crucial for enhancing the scalability of bioinformatic analyses. One widely used method involves converting the k-mer set into a de Bruijn graph (dBG), followed by seeking a compact graph representation via the smallest path cover. This study introduces USTAR* (Unitig STitch Advanced constRuction), a tool designed to compress both a set of k-mers and their associated counts. USTAR leverages the connectivity and density of dBGs, enabling a more efficient path selection for constructing the path cover. The efficacy of USTAR is demonstrated through its application in compressing real read data sets. USTAR improves the compression achieved by UST (Unitig STitch), the best algorithm, by percentages ranging from 2.3% to 26.4%, depending on the k-mer size, and it is up to 7 × times faster.
Collapse
Affiliation(s)
- Enrico Rossignolo
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padua, Padua, Italy
| |
Collapse
|
5
|
Cavattoni M, Comin M. ClassGraph: Improving Metagenomic Read Classification with Overlap Graphs. J Comput Biol 2023. [PMID: 37023405 DOI: 10.1089/cmb.2022.0208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/08/2023] Open
Abstract
ABSTRACT Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. Most methods that are currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of sensitivity (the actual number of classified reads), the performance is often poor. One reason is that the reads in a sample can be very different from the corresponding reference genomes; for example, viral genomes are usually highly mutated. To address this issue, in this article, we propose ClassGraph, a new taxonomic classification method that makes use of the read overlap graph and applies a label propagation algorithm to refine the results of existing tools. We evaluated its performance on simulated and real datasets with several taxonomic classification tools, and the results showed an improved sensitivity and F-measure, while maintaining high precision. ClassGraph is capable of improving the classification accuracy, especially in difficult cases such as virus and real datasets, where traditional tools can classify <40% of reads.
Collapse
Affiliation(s)
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
6
|
Wickramarachchi A, Lin Y. Binning long reads in metagenomics datasets using composition and coverage information. Algorithms Mol Biol 2022; 17:14. [PMID: 35821155 PMCID: PMC9277797 DOI: 10.1186/s13015-022-00221-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2021] [Accepted: 06/26/2022] [Indexed: 11/21/2022] Open
Abstract
Background Advancements in metagenomics sequencing allow the study of microbial communities directly from their environments. Metagenomics binning is a key step in the species characterisation of microbial communities. Next-generation sequencing reads are usually assembled into contigs for metagenomics binning mainly due to the limited information within short reads. Third-generation sequencing provides much longer reads that have lengths similar to the contigs assembled from short reads. However, existing contig-binning tools cannot be directly applied on long reads due to the absence of coverage information and the presence of high error rates. The few existing long-read binning tools either use only composition or use composition and coverage information separately. This may ignore bins that correspond to low-abundance species or erroneously split bins that correspond to species with non-uniform coverages. Here we present a reference-free binning approach, LRBinner, that combines composition and coverage information of complete long-read datasets. LRBinner also uses a distance-histogram-based clustering algorithm to extract clusters with varying sizes. Results The experimental results on both simulated and real datasets show that LRBinner achieves the best binning accuracy in most cases while handling the complete datasets without any sampling. Moreover, we show that binning reads using LRBinner prior to assembly reduces computational resources required for assembly while attaining satisfactory assembly qualities. Conclusion LRBinner shows that deep-learning techniques can be used for effective feature aggregation to support the metagenomics binning of long reads. Furthermore, accurate binning of long reads supports improvements in metagenomics assembly, especially in complex datasets. Binning also helps to reduce the resources required for assembly. Source code for LRBinner is freely available at https://github.com/anuradhawick/LRBinner. Supplementary Information The online version contains supplementary material available at 10.1186/s13015-022-00221-z.
Collapse
Affiliation(s)
| | - Yu Lin
- School of Computing, Australian National University, Canberra, Australia.
| |
Collapse
|