1
|
Schwab RM, Gottlieb SG, Reinert K. TetRex: a novel algorithm for index-accelerated search of highly conserved motifs. NAR Genom Bioinform 2025; 7:lqaf039. [PMID: 40248489 PMCID: PMC12004226 DOI: 10.1093/nargab/lqaf039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2024] [Revised: 03/13/2025] [Accepted: 03/24/2025] [Indexed: 04/19/2025] Open
Abstract
The scale of modern datasets has necessitated innovations to solve even the most traditional and fundamental of computational problems. Set membership and set cardinality are both examples of simple queries that, for large enough datasets, quickly become challenging using traditional approaches. Interestingly, we find a need for these innovations within the field of biology. Despite the vast differences among living organisms, there exist functions so critical to life that they are conserved in the genomes and proteomes across seemingly unrelated species. Regular expressions (regexes) can serve as a convenient way to represent these conserved sequences compactly. However, despite the strong theoretical foundation and maturity of tools available, the state-of-the-art regex search falls short of what is necessary to quickly scan an enormous database of biological sequences. In this work, we describe a novel algorithm for regex search that reduces the search space by leveraging a recently developed compact data structure for set membership, the hierarchical interleaved bloom filter. We show that the runtime of our method combined with a traditional search outperforms state-of-the-art tools.
Collapse
Affiliation(s)
- Remy M Schwab
- Max-Planck-Institute for Molecular Genetics, Ihnestrasse 63, 14195 Berlin, Germany
- Algorithmic Bioinformatics, Freie Universität Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Simon Gene Gottlieb
- Algorithmic Bioinformatics, Freie Universität Berlin, Takustrasse 9, 14195 Berlin, Germany
| | - Knut Reinert
- Max-Planck-Institute for Molecular Genetics, Ihnestrasse 63, 14195 Berlin, Germany
- Algorithmic Bioinformatics, Freie Universität Berlin, Takustrasse 9, 14195 Berlin, Germany
| |
Collapse
|
2
|
Hauswedell H, Hetzel S, Gottlieb SG, Kretzmer H, Meissner A, Reinert K. Lambda3: homology search for protein, nucleotide, and bisulfite-converted sequences. Bioinformatics 2024; 40:btae097. [PMID: 38485699 PMCID: PMC10955267 DOI: 10.1093/bioinformatics/btae097] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/19/2023] [Revised: 12/22/2023] [Accepted: 03/13/2024] [Indexed: 03/22/2024] Open
Abstract
MOTIVATION Local alignments of query sequences in large databases represent a core part of metagenomic studies and facilitate homology search. Following the development of NCBI Blast, many applications aimed to provide faster and equally sensitive local alignment frameworks. Most applications focus on protein alignments, while only few also facilitate DNA-based searches. None of the established programs allow searching DNA sequences from bisulfite sequencing experiments commonly used for DNA methylation profiling, for which specific alignment strategies need to be implemented. RESULTS Here, we introduce Lambda3, a new version of the local alignment application Lambda. Lambda3 is the first solution that enables the search of protein, nucleotide as well as bisulfite-converted nucleotide query sequences. Its protein mode achieves comparable performance to that of the highly optimized protein alignment application Diamond, while the nucleotide mode consistently outperforms established local nucleotide aligners. Combined, Lambda3 presents a universal local alignment framework that enables fast and sensitive homology searches for a wide range of use-cases. AVAILABILITY AND IMPLEMENTATION Lambda3 is free and open-source software publicly available at https://github.com/seqan/lambda/.
Collapse
Affiliation(s)
| | - Sara Hetzel
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin 14195, Germany
| | - Simon G Gottlieb
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin 14195, Germany
- Institute for Bio- and Geosciences, Forschungszentrum Jülich GmbH, Jülich 52428, Germany
| | - Helene Kretzmer
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin 14195, Germany
| | - Alexander Meissner
- Department of Genome Regulation, Max Planck Institute for Molecular Genetics, Berlin 14195, Germany
- Department of Biology, Chemistry and Pharmacy, Freie Universität Berlin, Berlin 14195, Germany
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Berlin 14195, Germany
- Efficient Algorithms for Omics Data Group, Max Planck Institute for Molecular Genetics, Berlin 14195, Germany
| |
Collapse
|
3
|
Zheng H, Marçais G, Kingsford C. Creating and Using Minimizer Sketches in Computational Genomics. J Comput Biol 2023; 30:1251-1276. [PMID: 37646787 PMCID: PMC11082048 DOI: 10.1089/cmb.2023.0094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open
Abstract
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
4
|
Mehringer S, Seiler E, Droop F, Darvish M, Rahn R, Vingron M, Reinert K. Hierarchical Interleaved Bloom Filter: enabling ultrafast, approximate sequence queries. Genome Biol 2023; 24:131. [PMID: 37259161 DOI: 10.1186/s13059-023-02971-4] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/30/2022] [Accepted: 05/11/2023] [Indexed: 06/02/2023] Open
Abstract
We present a novel data structure for searching sequences in large databases: the Hierarchical Interleaved Bloom Filter (HIBF). It is extremely fast and space efficient, yet so general that it could serve as the underlying engine for many applications. We show that the HIBF is superior in build time, index size, and search time while achieving a comparable or better accuracy compared to other state-of-the-art tools. The HIBF builds an index up to 211 times faster, using up to 14 times less space, and can answer approximate membership queries faster by a factor of up to 129.
Collapse
Affiliation(s)
- Svenja Mehringer
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195, Berlin, Germany.
| | - Enrico Seiler
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195, Berlin, Germany
| | - Felix Droop
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195, Berlin, Germany
| | - Mitra Darvish
- MPI for Molecular Genetics, Ihnestr. 63, 14195, Berlin, Germany
| | - René Rahn
- MPI for Molecular Genetics, Ihnestr. 63, 14195, Berlin, Germany
| | - Martin Vingron
- MPI for Molecular Genetics, Ihnestr. 63, 14195, Berlin, Germany
| | - Knut Reinert
- Department of Mathematics and Computer Science, Freie Universität Berlin, Takustr. 9, 14195, Berlin, Germany
- MPI for Molecular Genetics, Ihnestr. 63, 14195, Berlin, Germany
| |
Collapse
|
5
|
Darvish M, Seiler E, Mehringer S, Rahn R, Reinert K. Needle: a fast and space-efficient prefilter for estimating the quantification of very large collections of expression experiments. Bioinformatics 2022; 38:4100-4108. [PMID: 35801930 PMCID: PMC9438961 DOI: 10.1093/bioinformatics/btac492] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2022] [Revised: 05/23/2022] [Accepted: 07/07/2022] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The ever-growing size of sequencing data is a major bottleneck in bioinformatics as the advances of hardware development cannot keep up with the data growth. Therefore, an enormous amount of data is collected but rarely ever reused, because it is nearly impossible to find meaningful experiments in the stream of raw data. RESULTS As a solution, we propose Needle, a fast and space-efficient index which can be built for thousands of experiments in <2 h and can estimate the quantification of a transcript in these experiments in seconds, thereby outperforming its competitors. The basic idea of the Needle index is to create multiple interleaved Bloom filters that each store a set of representative k-mers depending on their multiplicity in the raw data. This is then used to quantify the query. AVAILABILITY AND IMPLEMENTATION https://github.com/seqan/needle. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Enrico Seiler
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany,Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, 14195 Berlin, Germany
| | - Svenja Mehringer
- Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, 14195 Berlin, Germany
| | - René Rahn
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany
| | - Knut Reinert
- Efficient Algorithms for Omics Data, Max Planck Institute for Molecular Genetics, Berlin, Germany,Algorithmic Bioinformatics, Institute for Bioinformatics, FU Berlin, 14195 Berlin, Germany
| |
Collapse
|