1
|
Depuydt L, Renders L, Van de Vyver S, Veys L, Gagie T, Fostier J. b-move: faster bidirectional character extensions in a run-length compressed index. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.30.596587. [PMID: 38854079 PMCID: PMC11160816 DOI: 10.1101/2024.05.30.596587] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2024]
Abstract
Due to the increasing availability of high-quality genome sequences, pan-genomes are gradually replacing single consensus reference genomes in many bioinformatics pipelines to better capture genetic diversity. Traditional bioinformatics tools using the FM-index face memory limitations with such large genome collections. Recent advancements in run-length compressed indices like Gagie et al.'s r-index and Nishimoto and Tabei's move structure, alleviate memory constraints but focus primarily on backward search for MEM-finding. Arakawa et al.'s br-index initiates complete approximate pattern matching using bidirectional search in run-length compressed space, but with significant computational overhead due to complex memory access patterns. We introduce b-move, a novel bidirectional extension of the move structure, enabling fast, cache-efficient bidirectional character extensions in run-length compressed space. It achieves bidirectional character extensions up to 8 times faster than the br-index, closing the performance gap with FM-index-based alternatives, while maintaining the br-index's favorable memory characteristics. For example, all available complete E. coli genomes on NCBI's RefSeq collection can be compiled into a b-move index that fits into the RAM of a typical laptop. Thus, b-move proves practical and scalable for pan-genome indexing and querying. We provide a C++ implementation of b-move, supporting efficient lossless approximate pattern matching including locate functionality, available at https://github.com/biointec/b-move under the AGPL-3.0 license. Funding Lore Depuydt : PhD Fellowship FR (1117322N), Research Foundation - Flanders (FWO) Luca Renders : PhD Fellowship SB (1SE7822N), Research Foundation - Flanders (FWO) Travis Gagie : NSERC Discovery Grant RGPIN-07185-2020 to Travis Gagie and NIH grant R01HG011392 to Ben Langmead.
Collapse
|
2
|
Hwang S, Brown NK, Ahmed OY, Jenike KM, Kovaka S, Schatz MC, Langmead B. MEM-based pangenome indexing for k-mer queries. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.05.20.595044. [PMID: 38826299 PMCID: PMC11142109 DOI: 10.1101/2024.05.20.595044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
Pangenomes are growing in number and size, thanks to the prevalence of high-quality long-read assemblies. However, current methods for studying sequence composition and conservation within pangenomes have limitations. Methods based on graph pangenomes require a computationally expensive multiple-alignment step, which can leave out some variation. Indexes based on k-mers and de Bruijn graphs are limited to answering questions at a specific substring length k. We present Maximal Exact Match Ordered (MEMO), a pangenome indexing method based on maximal exact matches (MEMs) between sequences. A single MEMO index can handle arbitrary-length queries over pangenomic windows. MEMO enables both queries that test k-mer presence/absence (membership queries) and that count the number of genomes containing k-mers in a window (conservation queries). MEMO's index for a pangenome of 89 human autosomal haplotypes fits in 2.04 GB, 8.8× smaller than a comparable KMC3 index and 11.4× smaller than a PanKmer index. MEMO indexes can be made smaller by sacrificing some counting resolution, with our decile-resolution HPRC index reaching 0.67 GB. MEMO can conduct a conservation query for 31-mers over the human leukocyte antigen locus in 13.89 seconds, 2.5x faster than other approaches. MEMO's small index size, lack of k-mer length dependence, and efficient queries make it a flexible tool for studying and visualizing substring conservation in pangenomes.
Collapse
Affiliation(s)
- Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore MD, USA
| | - Nathaniel K. Brown
- Department of Computer Science, Johns Hopkins University, Baltimore MD, USA
| | - Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore MD, USA
| | | | - Sam Kovaka
- Department of Computer Science, Johns Hopkins University, Baltimore MD, USA
| | - Michael C. Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore MD, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore MD, USA
| |
Collapse
|
3
|
Zakeri M, Brown NK, Ahmed OY, Gagie T, Langmead B. Movi: a fast and cache-efficient full-text pangenome index. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.11.04.565615. [PMID: 37961660 PMCID: PMC10635132 DOI: 10.1101/2023.11.04.565615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Efficient pangenome indexes are promising tools for many applications, including rapid classification of nanopore sequencing reads. Recently, a compressed-index data structure called the "move structure" was proposed as an alternative to other BWT-based indexes like the FM index and r-index. The move structure uniquely achieves both O(r) space and O(1)-time queries, where r is the number of runs in the pangenome BWT. We implemented Movi, an efficient tool for building and querying move-structure pangenome indexes. While the size of the Movi's index is larger than the r-index, it scales at a smaller rate for pangenome references, as its size is exactly proportional to r, the number of runs in the BWT of the reference. Movi can compute sophisticated matching queries needed for classification - such as pseudo-matching lengths and backward search - at least ten times faster than the fastest available methods, and in some cases more than 30-fold faster. Movi achieves this speed by leveraging the move structure's strong locality of reference, incurring close to the minimum possible number of cache misses for queries against large pangenomes. We achieve still further speed improvements by using memory prefetching to attain a degree of latency hiding that would be difficult with other index structures like the r-index. Movi's fast constant-time query loop makes it well suited to real-time applications like adaptive sampling for nanopore sequencing, where decisions must be made in a small and predictable time interval.
Collapse
Affiliation(s)
- Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University
| | | | - Omar Y Ahmed
- Department of Computer Science, Johns Hopkins University
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| |
Collapse
|
4
|
Cozzi D, Rossi M, Rubinacci S, Gagie T, Köppl D, Boucher C, Bonizzoni P. μ- PBWT: a lightweight r-indexing of the PBWT for storing and querying UK Biobank data. Bioinformatics 2023; 39:btad552. [PMID: 37688560 PMCID: PMC10502237 DOI: 10.1093/bioinformatics/btad552] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/05/2023] [Revised: 07/07/2023] [Accepted: 09/07/2023] [Indexed: 09/11/2023] Open
Abstract
MOTIVATION The Positional Burrows-Wheeler Transform (PBWT) is a data structure that indexes haplotype sequences in a manner that enables finding maximal haplotype matches in h sequences containing w variation sites in O(hw) time. This represents a significant improvement over classical quadratic-time approaches. However, the original PBWT data structure does not allow for queries over Biobank panels that consist of several millions of haplotypes, if an index of the haplotypes must be kept entirely in memory. RESULTS In this article, we leverage the notion of r-index proposed for the BWT to present a memory-efficient method for constructing and storing the run-length encoded PBWT, and computing set maximal matches (SMEMs) queries in haplotype sequences. We implement our method, which we refer to as μ-PBWT, and evaluate it on datasets of 1000 Genome Project and UK Biobank data. Our experiments demonstrate that the μ-PBWT reduces the memory usage up to a factor of 20% compared to the best current PBWT-based indexing. In particular, μ-PBWT produces an index that stores high-coverage whole genome sequencing data of chromosome 20 in about a third of the space of its BCF file. μ-PBWT is an adaptation of techniques for the run-length compressed BWT for the PBWT (RLPBWT) and it is based on keeping in memory only a succinct representation of the RLPBWT that still allows the efficient computation of set maximal matches (SMEMs) over the original panel. AVAILABILITY AND IMPLEMENTATION Our implementation is open source and available at https://github.com/dlcgold/muPBWT. The binary is available at https://bioconda.github.io/recipes/mupbwt/README.html.
Collapse
Affiliation(s)
- Davide Cozzi
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan 20126, Italy
| | - Massimiliano Rossi
- Department of Computer & Information Science & Engineering, Herbert-Wertheim College of Engineering, University of Florida, Gainesville, Florida 32611, United States
| | - Simone Rubinacci
- Department of Computational Biology, University of Lausanne, Lausanne 1015, Switzerland
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax B3H 4R2, Canada
| | - Dominik Köppl
- M&D Data Science Center, Tokyo Medical and Dental University, Tokyo 113-8510, Japan
- Department of Computer Science, University of Muenster, Muenster 48149, Germany
| | - Christina Boucher
- Department of Computer & Information Science & Engineering, Herbert-Wertheim College of Engineering, University of Florida, Gainesville, Florida 32611, United States
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan 20126, Italy
| |
Collapse
|
5
|
Shivakumar VS, Ahmed OY, Kovaka S, Zakeri M, Langmead B. Sigmoni: classification of nanopore signal with a compressed pangenome index. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.08.15.553308. [PMID: 37645873 PMCID: PMC10462034 DOI: 10.1101/2023.08.15.553308] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/31/2023]
Abstract
Improvements in nanopore sequencing necessitate efficient classification methods, including pre-filtering and adaptive sampling algorithms that enrich for reads of interest. Signal-based approaches circumvent the computational bottleneck of basecalling. But past methods for signal-based classification do not scale efficiently to large, repetitive references like pangenomes, limiting their utility to partial references or individual genomes. We introduce Sigmoni: a rapid, multiclass classification method based on the r-index that scales to references of hundreds of Gbps. Sigmoni quantizes nanopore signal into a discrete alphabet of picoamp ranges. It performs rapid, approximate matching using matching statistics, classifying reads based on distributions of picoamp matching statistics and co-linearity statistics. Sigmoni is 10-100× faster than previous methods for adaptive sampling in host depletion experiments with improved accuracy, and can query reads against large microbial or human pangenomes.
Collapse
Affiliation(s)
| | - Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University
| | - Sam Kovaka
- Department of Computer Science, Johns Hopkins University
| | - Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University
| |
Collapse
|
6
|
Ma J, Cáceres M, Salmela L, Mäkinen V, Tomescu AI. Chaining for accurate alignment of erroneous long reads to acyclic variation graphs. Bioinformatics 2023; 39:btad460. [PMID: 37494467 PMCID: PMC10423031 DOI: 10.1093/bioinformatics/btad460] [Citation(s) in RCA: 4] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2022] [Revised: 06/08/2023] [Accepted: 07/25/2023] [Indexed: 07/28/2023] Open
Abstract
MOTIVATION Aligning reads to a variation graph is a standard task in pangenomics, with downstream applications such as improving variant calling. While the vg toolkit [Garrison et al. (Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol 2018;36:875-9)] is a popular aligner of short reads, GraphAligner [Rautiainen and Marschall (GraphAligner: rapid and versatile sequence-to-graph alignment. Genome Biol 2020;21:253-28)] is the state-of-the-art aligner of erroneous long reads. GraphAligner works by finding candidate read occurrences based on individually extending the best seeds of the read in the variation graph. However, a more principled approach recognized in the community is to co-linearly chain multiple seeds. RESULTS We present a new algorithm to co-linearly chain a set of seeds in a string labeled acyclic graph, together with the first efficient implementation of such a co-linear chaining algorithm into a new aligner of erroneous long reads to acyclic variation graphs, GraphChainer. We run experiments aligning real and simulated PacBio CLR reads with average error rates 15% and 5%. Compared to GraphAligner, GraphChainer aligns 12-17% more reads, and 21-28% more total read length, on real PacBio CLR reads from human chromosomes 1, 22, and the whole human pangenome. On both simulated and real data, GraphChainer aligns between 95% and 99% of all reads, and of total read length. We also show that minigraph [Li et al. (The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265-19.)] and minichain [Chandra and Jain (Sequence to graph alignment using gap-sensitive co-linear chaining. In: Proceedings of the 27th Annual International Conference on Research in Computational Molecular Biology (RECOMB 2023). Springer, 2023, 58-73.)] obtain an accuracy of <60% on this setting. AVAILABILITY AND IMPLEMENTATION GraphChainer is freely available at https://github.com/algbio/GraphChainer. The datasets and evaluation pipeline can be reached from the previous address.
Collapse
Affiliation(s)
- Jun Ma
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Manuel Cáceres
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Leena Salmela
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Veli Mäkinen
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| | - Alexandru I Tomescu
- Department of Computer Science, University of Helsinki, 00014 Helsinki, Finland
| |
Collapse
|
7
|
Ahmed O, Rossi M, Boucher C, Langmead B. Efficient taxa identification using a pangenome index. Genome Res 2023; 33:1069-1077. [PMID: 37258301 PMCID: PMC10538492 DOI: 10.1101/gr.277642.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2023] [Accepted: 05/22/2023] [Indexed: 06/02/2023]
Abstract
Tools that classify sequencing reads against a database of reference sequences require efficient index data-structures. The r-index is a compressed full-text index that answers substring presence/absence, count, and locate queries in space proportional to the amount of distinct sequence in the database: [Formula: see text] space, where r is the number of Burrows-Wheeler runs. To date, the r-index has lacked the ability to quickly classify matches according to which reference sequences (or sequence groupings, i.e., taxa) a match overlaps. We present new algorithms and methods for solving this problem. Specifically, given a collection D of d documents, [Formula: see text] over an alphabet of size σ, we extend the r-index with [Formula: see text] additional words to support document listing queries for a pattern [Formula: see text] that occurs in [Formula: see text] documents in D in [Formula: see text] time and [Formula: see text] space, where w is the machine word size. Applied in a bacterial mock community experiment, our method is up to three times faster than a comparable method that uses the standard r-index locate queries. We show that our method classifies both simulated and real nanopore reads at the strain level with higher accuracy compared with other approaches. Finally, we present strategies for compacting this structure in applications in which read lengths or match lengths can be bounded.
Collapse
Affiliation(s)
- Omar Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA;
| | - Massimiliano Rossi
- Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, Herbert Wertheim College of Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland 21218, USA
| |
Collapse
|
8
|
Leonard AS, Crysnanto D, Mapel XM, Bhati M, Pausch H. Graph construction method impacts variation representation and analyses in a bovine super-pangenome. Genome Biol 2023; 24:124. [PMID: 37217946 PMCID: PMC10204317 DOI: 10.1186/s13059-023-02969-y] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2022] [Accepted: 05/10/2023] [Indexed: 05/24/2023] Open
Abstract
BACKGROUND Several models and algorithms have been proposed to build pangenomes from multiple input assemblies, but their impact on variant representation, and consequently downstream analyses, is largely unknown. RESULTS We create multi-species super-pangenomes using pggb, cactus, and minigraph with the Bos taurus taurus reference sequence and eleven haplotype-resolved assemblies from taurine and indicine cattle, bison, yak, and gaur. We recover 221 k nonredundant structural variations (SVs) from the pangenomes, of which 135 k (61%) are common to all three. SVs derived from assembly-based calling show high agreement with the consensus calls from the pangenomes (96%), but validate only a small proportion of variations private to each graph. Pggb and cactus, which also incorporate base-level variation, have approximately 95% exact matches with assembly-derived small variant calls, which significantly improves the edit rate when realigning assemblies compared to minigraph. We use the three pangenomes to investigate 9566 variable number tandem repeats (VNTRs), finding 63% have identical predicted repeat counts in the three graphs, while minigraph can over or underestimate the count given its approximate coordinate system. We examine a highly variable VNTR locus and show that repeat unit copy number impacts the expression of proximal genes and non-coding RNA. CONCLUSIONS Our findings indicate good consensus between the three pangenome methods but also show their individual strengths and weaknesses that need to be considered when analysing different types of variants from multiple input assemblies.
Collapse
Affiliation(s)
- Alexander S Leonard
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland.
| | - Danang Crysnanto
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Xena M Mapel
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Meenu Bhati
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland
| | - Hubert Pausch
- Animal Genomics, ETH Zurich, Universitaetstrasse 2, 8092, Zurich, Switzerland.
| |
Collapse
|
9
|
Ahmed OY, Rossi M, Gagie T, Boucher C, Langmead B. SPUMONI 2: improved classification using a pangenome index of minimizer digests. Genome Biol 2023; 24:122. [PMID: 37202771 PMCID: PMC10197461 DOI: 10.1186/s13059-023-02958-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2022] [Accepted: 05/03/2023] [Indexed: 05/20/2023] Open
Abstract
Genomics analyses use large reference sequence collections, like pangenomes or taxonomic databases. SPUMONI 2 is an efficient tool for sequence classification of both short and long reads. It performs multi-class classification using a novel sampled document array. By incorporating minimizers, SPUMONI 2's index is 65 times smaller than minimap2's for a mock community pangenome. SPUMONI 2 achieves a speed improvement of 3-fold compared to SPUMONI and 15-fold compared to minimap2. We show SPUMONI 2 achieves an advantageous mix of accuracy and efficiency in practical scenarios such as adaptive sampling, contamination detection and multi-class metagenomics classification.
Collapse
Affiliation(s)
- Omar Y. Ahmed
- Department of Computer Science, Johns Hopkins University, Baltimore, MD USA
| | - Massimiliano Rossi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL USA
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, NS Canada
| | - Christina Boucher
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD USA
| |
Collapse
|
10
|
Mun T, Vaddadi NSK, Langmead B. Pangenomic genotyping with the marker array. Algorithms Mol Biol 2023; 18:2. [PMID: 37147657 PMCID: PMC10161648 DOI: 10.1186/s13015-023-00225-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 04/22/2023] [Indexed: 05/07/2023] Open
Abstract
We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while reducing the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods. The method is implemented in the open source software tool rowbowt available at https://github.com/alshai/rowbowt .
Collapse
Affiliation(s)
- Taher Mun
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| | | | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
11
|
Martínez-Guardiola C, Brown NK, Silva-Coira F, Köppl D, Gagie T, Ladra S. Augmented Thresholds for MONI. PROCEEDINGS. DATA COMPRESSION CONFERENCE 2023; 2023:268-277. [PMID: 38818281 PMCID: PMC11138128 DOI: 10.1109/dcc55655.2023.00035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
MONI (Rossi et al., 2022) can store a pangenomic dataset T in small space and later, given a pattern P, quickly find the maximal exact matches (MEMs) of P with respect to T. In this paper we consider its one-pass version (Boucher et al., 2021), whose query times are dominated in our experiments by longest common extension (LCE) queries. We show how a small modification lets us avoid most of these queries which significantly speeds up MONI in practice while only slightly increasing its size.
Collapse
|
12
|
Mun T, Vaddadi NSK, Langmead B. Pangenomic Genotyping with the Marker Array. ALGORITHMS IN BIOINFORMATICS : ... INTERNATIONAL WORKSHOP, WABI ..., PROCEEDINGS. WABI (WORKSHOP) 2022; 242:19. [PMID: 36409181 PMCID: PMC9674407] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/16/2023]
Abstract
We present a new method and software tool called rowbowt that applies a pangenome index to the problem of inferring genotypes from short-read sequencing data. The method uses a novel indexing structure called the marker array. Using the marker array, we can genotype variants with respect from large panels like the 1000 Genomes Project while avoiding the reference bias that results when aligning to a single linear reference. rowbowt can infer accurate genotypes in less time and memory compared to existing graph-based methods.
Collapse
Affiliation(s)
- Taher Mun
- Johns Hopkins University, Baltimore MD, USA; Illumina, San Diego, USA
| | | | | |
Collapse
|
13
|
Rossi M, Oliva M, Bonizzoni P, Langmead B, Gagie T, Boucher C. Finding Maximal Exact Matches Using the r-Index. J Comput Biol 2022; 29:188-194. [PMID: 35041518 PMCID: PMC8902461 DOI: 10.1089/cmb.2021.0445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/03/2023] Open
Abstract
Efficiently finding maximal exact matches (MEMs) between a sequence read and a database of genomes is a key first step in read alignment. But until recently, it was unknown how to build a data structure in [Formula: see text] space that supports efficient MEM finding, where r is the number of runs in the Burrows-Wheeler Transform. In 2021, Rossi et al. showed how to build a small auxiliary data structure called thresholds in addition to the r-index in [Formula: see text] space. This addition enables efficient MEM finding using the r-index. In this article, we present the tool that implements this solution, which we call MONI. Namely, we give a high-level view of the main components of the data structure and show how the source code can be downloaded, compiled, and used to find MEMs between a set of sequence reads and a set of genomes.
Collapse
Affiliation(s)
- Massimiliano Rossi
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA.,Address correspondence to: Dr. Massimiliano Rossi, Department of Computer and Information Science and Engineering, P.O. Box 116120, University of Florida, Gainesville, FL 32611-6550, USA
| | - Marco Oliva
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano, Italy
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA
| | - Travis Gagie
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida, USA
| |
Collapse
|