1
|
Levallois V, Andreace F, Le Gal B, Dufresne Y, Peterlongo P. The backpack quotient filter: A dynamic and space-efficient data structure for querying k-mers with abundance. iScience 2024; 27:111435. [PMID: 39720533 PMCID: PMC11667073 DOI: 10.1016/j.isci.2024.111435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 08/28/2024] [Accepted: 11/18/2024] [Indexed: 12/26/2024] Open
Abstract
Genomic data sequencing is crucial for understanding biological systems. As genomic databases like the European Nucleotide Archive expand exponentially, efficient data manipulation is essential. A key challenge is querying these databases to determine the presence or absence of specific sequences and their abundance within datasets. This paper presents the Backpack Quotient Filter (BQF), a data structure for indexing k-mers (substrings of length k), which offers greater space efficiency than the Counting Quotient Filter (CQF). The BQF maintains essential features such as abundance information and dynamicity, with an extremely low false positive rate of less than10 - 5 % . Our method redefines abundance information handling and implements an independent strategy for space efficiency. The BQF uses four times less space than the CQF on complex datasets such as sea-water metagenomics sequences. Additionally, its space efficiency improves with larger datasets, addressing the need for scalable data solutions.
Collapse
Affiliation(s)
- Victor Levallois
- University Rennes, Inria, CNRS, IRISA - UMR 6074, 35000 Rennes, France
| | - Francesco Andreace
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, 75015 Paris, France
- Sorbonne Université, Collège doctoral, 75005 Paris, France
| | - Bertrand Le Gal
- University Rennes, Inria, CNRS, IRISA - Taran team, ENSSAT, Lannion, France
| | - Yoann Dufresne
- Department of Computational Biology, Institut Pasteur, Université Paris Cité, 75015 Paris, France
- Bioinformatics and Biostatistics Hub, Institut Pasteur, Université de Paris, 75015 Paris, France
| | - Pierre Peterlongo
- University Rennes, Inria, CNRS, IRISA - UMR 6074, 35000 Rennes, France
| |
Collapse
|
2
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
3
|
Lemane T, Lezzoche N, Lecubin J, Pelletier E, Lescot M, Chikhi R, Peterlongo P. Indexing and real-time user-friendly queries in terabyte-sized complex genomic datasets with kmindex and ORA. NATURE COMPUTATIONAL SCIENCE 2024; 4:104-109. [PMID: 38413777 DOI: 10.1038/s43588-024-00596-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/12/2023] [Accepted: 01/16/2024] [Indexed: 02/29/2024]
Abstract
Public sequencing databases contain vast amounts of biological information, yet they are largely underutilized as it is challenging to efficiently search them for any sequence(s) of interest. We present kmindex, an approach that can index thousands of metagenomes and perform sequence searches in a fraction of a second. The index construction is an order of magnitude faster than previous methods, while search times are two orders of magnitude faster. With negligible false positive rates below 0.01%, kmindex outperforms the precision of existing approaches by four orders of magnitude. Here we demonstrate the scalability of kmindex by successfully indexing 1,393 marine seawater metagenome samples from the Tara Oceans project. Additionally, we introduce the publicly accessible web server Ocean Read Atlas, which enables real-time queries on the Tara Oceans dataset.
Collapse
Affiliation(s)
- Téo Lemane
- Univ. Rennes, Inria, CNRS, IRISA - UMR 6074, Rennes, France.
- Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France.
| | - Nolan Lezzoche
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, Marseille, France
| | | | - Eric Pelletier
- Génomique Métabolique, Genoscope, Institut de Biologie François Jacob, CEA, CNRS, Univ. Evry, Université Paris-Saclay, Evry, France
- Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GO-SEE, CNRS, Paris, France
| | - Magali Lescot
- Aix-Marseille Université, Université de Toulon, IRD, CNRS, Mediterranean Institute of Oceanography (MIO), UM 110, Marseille, France
- Research Federation for the Study of Global Ocean Systems Ecology and Evolution, FR2022/Tara Oceans GO-SEE, CNRS, Paris, France
| | - Rayan Chikhi
- Institut Pasteur, Université Paris Cité, G5 Sequence Bioinformatics, Paris, France
| | | |
Collapse
|