1
|
Guo LL. Effloc: An Efficient Locating Algorithm for Mass-Occurrence Biological Patterns with FM-Index. J Comput Biol 2025. [PMID: 40314133 DOI: 10.1089/cmb.2024.0925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2025] Open
Abstract
Pattern locating is a crucial step in various biological sequence analysis tasks. As a compressed full-text indexing technology, full-text minute-space index has been introduced for biological pattern locating over ultra-long genomes, with a low memory footprint and retrieving time independent of genome size. However, its locating time is limited by the number of occurrences of the biological pattern in the genome, and it is not efficient enough when dealing with mass-occurrence biological patterns. To solve this problem, we propose an efficient locating algorithm for mass-occurrence biological patterns in genomic sequence, namely Effloc. It is developed on two optimization techniques. One is that rankings with the same Burrows-Wheeler Transform character are organized into a group and calculated together, thereby reducing the number of last-to-first column (LF) mapping operations required to jump forward to find suffix array (SA) sampling points; the other is to design a specific structure to record the jump status, thus avoiding the redundant LF mapping operations that exist in the process of finding SA sampling points for those adjacent patterns that share the same sampling point. Compared with the existing algorithm, Effloc can significantly reduce the number of time-consuming LF mapping operations in mass-occurrence pattern locating. Ablation experiments verified our algorithm's effectiveness, exhibiting faster locating speed compared with five state-of-the-art competing algorithms. The source code and data are released at https://github.com/Lilu-guo/Effloc.
Collapse
Affiliation(s)
- Li-Lu Guo
- Department of Computer Science, Xidian University, Xi'an, Shaanxi, China
| |
Collapse
|
2
|
Singh NP, Khan J, Patro R. Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-seq data using virtual colors for accurate genomic pseudoalignment. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.11.27.625771. [PMID: 39677745 PMCID: PMC11642815 DOI: 10.1101/2024.11.27.625771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Ultrafast mapping of short reads via lightweight mapping techniques such as pseudoalignment has significantly accelerated transcriptomic and metagenomic analyses, often with minimal accuracy loss compared to alignment-based methods. However, applying pseudoalignment to large genomic references, like chromosomes, is challenging due to their size and repetitive sequences. We introduce a new and modified pseudoalignment scheme that partitions each reference into "virtual colors…. These are essentially overlapping bins of fixed maximal extent on the reference sequences that are treated as distinct "colors" from the perspective of the pseudoalignment algorithm. We apply this modified pseudoalignment procedure to process and map single-cell ATAC-seq data in our new tool alevin-fry-atac . We compare alevin-fry-atac to both Chromap and Cell Ranger ATAC . Alevin-fry-atac is highly scalable and, when using 32 threads, is approximately 2.8 times faster than Chromap (the second fastest approach) while using approximately one third of the memory and mapping slightly more reads. The resulting peaks and clusters generated from alevin-fry-atac show high concordance with those obtained from both Chromap and the Cell Ranger ATAC pipeline, demonstrating that virtual colorenhanced pseudoalignment directly to the genome provides a fast, memory-frugal, and accurate alternative to existing approaches for single-cell ATAC-seq processing. The development of alevin-fry-atac brings single-cell ATAC-seq processing into a unified ecosystem with single-cell RNA-seq processing (via alevin-fry ) to work toward providing a truly open alternative to many of the varied capabilities of CellRanger . Furthermore, our modified pseudoalignment approach should be easily applicable and extendable to other genome-centric mapping-based tasks and modalities such as standard DNA-seq, DNase-seq, Chip-seq and Hi-C.
Collapse
|
3
|
Mouratidis I, Baltoumas FA, Chantzi N, Patsakis M, Chan CS, Montgomery A, Konnaris MA, Aplakidou E, Georgakopoulos GC, Das A, Chartoumpekis DV, Kovac J, Pavlopoulos GA, Georgakopoulos-Soares I. kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species. Comput Struct Biotechnol J 2024; 23:1919-1928. [PMID: 38711760 PMCID: PMC11070822 DOI: 10.1016/j.csbj.2024.04.050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 04/17/2024] [Accepted: 04/18/2024] [Indexed: 05/08/2024] Open
Abstract
The decrease in sequencing expenses has facilitated the creation of reference genomes and proteomes for an expanding array of organisms. Nevertheless, no established repository that details organism-specific genomic and proteomic sequences of specific lengths, referred to as kmers, exists to our knowledge. In this article, we present kmerDB, a database accessible through an interactive web interface that provides kmer-based information from genomic and proteomic sequences in a systematic way. kmerDB currently contains 202,340,859,107 base pairs and 19,304,903,356 amino acids, spanning 54,039 and 21,865 reference genomes and proteomes, respectively, as well as 6,905,362 and 149,305,183 genomic and proteomic species-specific sequences, termed quasi-primes. Additionally, we provide access to 5,186,757 nucleic and 214,904,089 peptide sequences absent from every genome and proteome, termed primes. kmerDB features a user-friendly interface offering various search options and filters for easy parsing and searching. The service is available at: www.kmerdb.com.
Collapse
Affiliation(s)
- Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
| | - Fotis A. Baltoumas
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Michail Patsakis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
- Department of Statistics, The Pennsylvania State University, University Park, PA, USA
| | - Eleni Aplakidou
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece
| | - George C. Georgakopoulos
- National Technical University of Athens, School of Electrical and Computer Engineering, Athens, Greece
| | - Anshuman Das
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Dionysios V. Chartoumpekis
- Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, Lausanne, Switzerland
| | - Jasna Kovac
- Department of Food Science, The Pennsylvania State University, University Park, PA 16802, USA
| | - Georgios A. Pavlopoulos
- Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
- Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, 11527, Greece
| | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| |
Collapse
|
4
|
Campanelli A, Pibiri GE, Fan J, Patro R. Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs . J Comput Biol 2024; 31:1022-1044. [PMID: 39381838 PMCID: PMC11631793 DOI: 10.1089/cmb.2024.0714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2024] Open
Abstract
We describe lossless compressed data structures for the colored de Bruijn graph (or c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k-mers to their color sets. The color set of a k-mer is the set of all identifiers, or colors, of the references that contain the k-mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes.
Collapse
Affiliation(s)
| | | | - Jason Fan
- Fulcrum Genomics LLC, Somerville, Massachusetts, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, Maryland, USA
| |
Collapse
|
5
|
Taylor DJ, Eizenga JM, Li Q, Das A, Jenike KM, Kenny EE, Miga KH, Monlong J, McCoy RC, Paten B, Schatz MC. Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References. Annu Rev Genomics Hum Genet 2024; 25:77-104. [PMID: 38663087 PMCID: PMC11451085 DOI: 10.1146/annurev-genom-021623-081639] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/29/2024]
Abstract
The Human Genome Project was an enormous accomplishment, providing a foundation for countless explorations into the genetics and genomics of the human species. Yet for many years, the human genome reference sequence remained incomplete and lacked representation of human genetic diversity. Recently, two major advances have emerged to address these shortcomings: complete gap-free human genome sequences, such as the one developed by the Telomere-to-Telomere Consortium, and high-quality pangenomes, such as the one developed by the Human Pangenome Reference Consortium. Facilitated by advances in long-read DNA sequencing and genome assembly algorithms, complete human genome sequences resolve regions that have been historically difficult to sequence, including centromeres, telomeres, and segmental duplications. In parallel, pangenomes capture the extensive genetic diversity across populations worldwide. Together, these advances usher in a new era of genomics research, enhancing the accuracy of genomic analysis, paving the path for precision medicine, and contributing to deeper insights into human biology.
Collapse
Affiliation(s)
- Dylan J Taylor
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| | - Jordan M Eizenga
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Qiuhui Li
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
| | - Arun Das
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
| | - Katharine M Jenike
- Department of Genetic Medicine, Johns Hopkins University School of Medicine, Baltimore, Maryland, USA;
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA;
| | - Karen H Miga
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Jean Monlong
- Institut de Recherche en Santé Digestive, Université de Toulouse, INSERM, INRA, ENVT, UPS, Toulouse, France;
| | - Rajiv C McCoy
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| | - Benedict Paten
- Department of Biomolecular Engineering, University of California, Santa Cruz, California, USA
- Genomics Institute, University of California, Santa Cruz, California, USA; , ,
| | - Michael C Schatz
- Department of Computer Science, Johns Hopkins University, Baltimore, Maryland, USA; ,
- Department of Biology, Johns Hopkins University, Baltimore, Maryland, USA; , ,
| |
Collapse
|
6
|
Campanelli A, Pibiri GE, Fan J, Patro R. Where the patterns are: repetition-aware compression for colored de Bruijn graphs ⋆. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.09.602727. [PMID: 39026859 PMCID: PMC11257547 DOI: 10.1101/2024.07.09.602727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]
Abstract
We describe lossless compressed data structures for the colored de Bruijn graph (or, c-dBG). Given a collection of reference sequences, a c-dBG can be essentially regarded as a map from k -mers to their color sets . The color set of a k -mer is the set of all identifiers, or colors , of the references that contain the k -mer. While these maps find countless applications in computational biology (e.g., basic query, reading mapping, abundance estimation, etc.), their memory usage represents a serious challenge for large-scale sequence indexing. Our solutions leverage on the intrinsic repetitiveness of the color sets when indexing large collections of related genomes. Hence, the described algorithms factorize the color sets into patterns that repeat across the entire collection and represent these patterns once, instead of redundantly replicating their representation as would happen if the sets were encoded as atomic lists of integers. Experimental results across a range of datasets and query workloads show that these representations substantially improve over the space effectiveness of the best previous solutions (sometimes, even dramatically, yielding indexes that are smaller by an order of magnitude). Despite the space reduction, these indexes only moderately impact the efficiency of the queries compared to the fastest indexes. Software The implementation of the indexes used for all experiments in this work is written in C++17 and is available at https://github.com/jermp/fulgor .
Collapse
|
7
|
Mustafa H, Karasikov M, Mansouri Ghiasi N, Rätsch G, Kahles A. Label-guided seed-chain-extend alignment on annotated De Bruijn graphs. Bioinformatics 2024; 40:i337-i346. [PMID: 38940164 PMCID: PMC11211850 DOI: 10.1093/bioinformatics/btae226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open
Abstract
MOTIVATION Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy. RESULTS We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment. AVAILABILITY AND IMPLEMENTATION The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.
Collapse
Affiliation(s)
- Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
| | - Nika Mansouri Ghiasi
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, 8092, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- ETH AI Center, Zurich, 8092, Switzerland
- Department of Biology, ETH Zurich, Zurich, 8093, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich, 8092, Switzerland
- Biomedical Informatics Group, University Hospital Zurich, Zurich, 8091, Switzerland
- Biomedical Informatics, Swiss Institute of Bioinformatics, Zurich, 8092, Switzerland
- The LOOP Zurich—Medical Research Center, Zurich, 8044, Switzerland
| |
Collapse
|
8
|
Shiryev SA, Agarwala R. Indexing and searching petabase-scale nucleotide resources. Nat Methods 2024; 21:994-1002. [PMID: 38755321 PMCID: PMC11166510 DOI: 10.1038/s41592-024-02280-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 04/08/2024] [Indexed: 05/18/2024]
Abstract
Searching vast and rapidly growing nucleotide content in resources, such as runs in the Sequence Read Archive and assemblies for whole-genome shotgun sequencing projects in GenBank, is currently impractical for most researchers. Here we present Pebblescout, a tool that navigates such content by providing indexing and search capabilities. Indexing uses dense sampling of the sequences in the resource. Search finds subjects (runs or assemblies) that have short sequence matches to a user query, with well-defined guarantees and ranks them using informativeness of the matches. We illustrate the functionality of Pebblescout by creating eight databases that index over 3.7 petabases. The web service of Pebblescout can be reached at https://pebblescout.ncbi.nlm.nih.gov . We show that for a wide range of query lengths, Pebblescout provides a data-driven way for finding relevant subsets of large nucleotide resources, reducing the effort for downstream analysis substantially. We also show that Pebblescout results compare favorably to MetaGraph and Sourmash.
Collapse
Affiliation(s)
- Sergey A Shiryev
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Richa Agarwala
- Department of Health and Human Services, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
9
|
Yu C, Zhao Y, Zhao C, Jin J, Mao K, Wang G. MiniDBG: A Novel and Minimal De Bruijn Graph for Read Mapping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:129-142. [PMID: 38060353 DOI: 10.1109/tcbb.2023.3340251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]
Abstract
The De Bruijn graph (DBG) has been widely used in the algorithms for indexing or organizing read and reference sequences in bioinformatics. However, a DBG model that can locate each node, edge and path on sequence has not been proposed so far. Recently, DBG has been used for representing reference sequences in read mapping tasks. In this process, it is not a one-to-one correspondence between the paths of DBG and the substrings of reference sequence. This results in the false path on DBG, which means no substrings of reference producing the path. Moreover, if a candidate path of a read is true, we need to locate it and verify the candidate on sequence. To solve these problems, we proposed a DBG model, called MiniDBG, which stores the position lists of a minimal set of edges. With the position lists, MiniDBG can locate any node, edge and path efficiently. We also proposed algorithms for generating MiniDBG based on an original DBG and algorithms for locating edges or paths on sequence. We designed and ran experiments on real datasets for comparing them with BWT-based and position list-based methods. The experimental results show that MiniDBG can locate the edges and paths efficiently with lower memory costs.
Collapse
|
10
|
Pibiri GE, Fan J, Patro R. Meta-colored compacted de Bruijn graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.21.550101. [PMID: 37546988 PMCID: PMC10401949 DOI: 10.1101/2023.07.21.550101] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]
Abstract
MOTIVATION The colored compacted de Bruijn graph (c-dBG) has become a fundamental tool used across several areas of genomics and pangenomics. For example, it has been widely adopted by methods that perform read mapping or alignment, abundance estimation, and subsequent downstream analyses. These applications essentially regard the c-dBG as a map from k-mers to the set of references in which they appear. The c-dBG data structure should retrieve this set -- the color of the k-mer -- efficiently for any given k-mer, while using little memory. To aid retrieval, the colors are stored explicitly in the data structure and take considerable space for large reference collections, even when compressed. Reducing the space of the colors is therefore of utmost importance for large-scale sequence indexing. RESULTS We describe the meta-colored compacted de Bruijn graph (Mac-dBG) -- a new colored de Bruijn graph data structure where colors are represented holistically, i.e., taking into account their redundancy across the whole collection being indexed, rather than individually as atomic integer lists. This allows the factorization and compression of common sub-patterns across colors. While optimizing the space of our data structure is NP-hard, we propose a simple heuristic algorithm that yields practically good solutions. Results show that the Mac-dBG data structure improves substantially over the best previous space/time trade-off, by providing remarkably better compression effectiveness for the same (or better) query efficiency. This improved space/time trade-off is robust across different datasets and query workloads. Code availability: A C++17 implementation of the Mac-dBG is publicly available on GitHub at: https://github.com/jermp/fulgor.
Collapse
|
11
|
Depuydt L, Renders L, Abeel T, Fostier J. Pan-genome de Bruijn graph using the bidirectional FM-index. BMC Bioinformatics 2023; 24:400. [PMID: 37884897 PMCID: PMC10605969 DOI: 10.1186/s12859-023-05531-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Accepted: 10/12/2023] [Indexed: 10/28/2023] Open
Abstract
BACKGROUND Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. RESULTS We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment. CONCLUSIONS We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.
Collapse
Affiliation(s)
- Lore Depuydt
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium.
| | - Luca Renders
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium
| | - Thomas Abeel
- Delft Bioinformatics Lab, Delft University of Technology, 2628 XE, Delft, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Jan Fostier
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium.
| |
Collapse
|
12
|
He D, Patro R. simpleaf: a simple, flexible, and scalable framework for single-cell data processing using alevin-fry. Bioinformatics 2023; 39:btad614. [PMID: 37802884 PMCID: PMC10580267 DOI: 10.1093/bioinformatics/btad614] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 09/02/2023] [Accepted: 10/05/2023] [Indexed: 10/08/2023] Open
Abstract
SUMMARY The alevin-fry ecosystem provides a robust and growing suite of programs for single-cell data processing. However, as new single-cell technologies are introduced, as the community continues to adjust best practices for data processing, and as the alevin-fry ecosystem itself expands and grows, it is becoming increasingly important to manage the complexity of alevin-fry's single-cell preprocessing workflows while retaining the performance and flexibility that make these tools enticing. We introduce simpleaf, a program that simplifies the processing of single-cell data using tools from the alevin-fry ecosystem, and adds new functionality and capabilities, while retaining the flexibility and performance of the underlying tools. AVAILABILITY AND IMPLEMENTATION Simpleaf is written in Rust and released under a BSD 3-Clause license. It is freely available from its GitHub repository https://github.com/COMBINE-lab/simpleaf, and via bioconda. Documentation for simpleaf is available at https://simpleaf.readthedocs.io/en/latest/ and tutorials for simpleaf that have been developed can be accessed at https://combine-lab.github.io/alevin-fry-tutorials.
Collapse
Affiliation(s)
- Dongze He
- Department of Cell Biology and Molecular Genetics and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, United States
| | - Rob Patro
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, 20742, United States
| |
Collapse
|
13
|
Aylward AJ, Petrus S, Mamerto A, Hartwick NT, Michael TP. PanKmer: k-mer-based and reference-free pangenome analysis. Bioinformatics 2023; 39:btad621. [PMID: 37846049 PMCID: PMC10603592 DOI: 10.1093/bioinformatics/btad621] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 08/29/2023] [Accepted: 10/13/2023] [Indexed: 10/18/2023] Open
Abstract
SUMMARY Pangenomes are replacing single reference genomes as the definitive representation of DNA sequence within a species or clade. Pangenome analysis predominantly leverages graph-based methods that require computationally intensive multiple genome alignments, do not scale to highly complex eukaryotic genomes, limit their scope to identifying structural variants (SVs), or incur bias by relying on a reference genome. Here, we present PanKmer, a toolkit designed for reference-free analysis of pangenome datasets consisting of dozens to thousands of individual genomes. PanKmer decomposes a set of input genomes into a table of observed k-mers and their presence-absence values in each genome. These are stored in an efficient k-mer index data format that encodes SNPs, INDELs, and SVs. It also includes functions for downstream analysis of the k-mer index, such as calculating sequence similarity statistics between individuals at whole-genome or local scales. For example, k-mers can be "anchored" in any individual genome to quantify sequence variability or conservation at a specific locus. This facilitates workflows with various biological applications, e.g. identifying cases of hybridization between plant species. PanKmer provides researchers with a valuable and convenient means to explore the full scope of genetic variation in a population, without reference bias. AVAILABILITY AND IMPLEMENTATION PanKmer is implemented as a Python package with components written in Rust, released under a BSD license. The source code is available from the Python Package Index (PyPI) at https://pypi.org/project/pankmer/ as well as Gitlab at https://gitlab.com/salk-tm/pankmer. Full documentation is available at https://salk-tm.gitlab.io/pankmer/.
Collapse
Affiliation(s)
- Anthony J Aylward
- The Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, United States
| | - Semar Petrus
- The Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, United States
| | - Allen Mamerto
- The Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, United States
| | - Nolan T Hartwick
- The Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, United States
| | - Todd P Michael
- The Plant Molecular and Cellular Biology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA 92037, United States
| |
Collapse
|
14
|
Medvedev P. Theoretical Analysis of Sequencing Bioinformatics Algorithms and Beyond. COMMUNICATIONS OF THE ACM 2023; 66:118-125. [PMID: 38736702 PMCID: PMC11087067 DOI: 10.1145/3571723] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/14/2024]
Abstract
A case study reveals the theoretical analysis of algorithms is not always as helpful as standard dogma might suggest.
Collapse
Affiliation(s)
- Paul Medvedev
- Department of Computer Science and Engineering and the Department of Biochemistry and Molecular Biology and the Director of the Center for Computational Biology and Bioinformatics at Pennsylvania State University, University Park, PA, USA
| |
Collapse
|
15
|
Marchet C, Limasset A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 2023; 39:i252-i259. [PMID: 37387170 DOI: 10.1093/bioinformatics/btad225] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3-6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC's ability to query 500 000 transcript sequences in less than an hour. AVAILABILITY AND IMPLEMENTATION PAC's open-source software is available at https://github.com/Malfoy/PAC.
Collapse
Affiliation(s)
- Camille Marchet
- University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Antoine Limasset
- University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| |
Collapse
|
16
|
Pibiri GE, Shibuya Y, Limasset A. Locality-preserving minimal perfect hashing of k-mers. Bioinformatics 2023; 39:i534-i543. [PMID: 37387137 PMCID: PMC10311298 DOI: 10.1093/bioinformatics/btad219] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Minimal perfect hashing is the problem of mapping a static set of n distinct keys into the address space {1,…,n} bijectively. It is well-known that n log 2(e) bits are necessary to specify a minimal perfect hash function (MPHF) f, when no additional knowledge of the input keys is to be used. However, it is often the case in practice that the input keys have intrinsic relationships that we can exploit to lower the bit complexity of f. For example, consider a string and the set of all its distinct k-mers as input keys: since two consecutive k-mers share an overlap of k-1 symbols, it seems possible to beat the classic log 2(e) bits/key barrier in this case. Moreover, we would like f to map consecutive k-mers to consecutive addresses, as to also preserve as much as possible their relationship in the codomain. This is a useful feature in practice as it guarantees a certain degree of locality of reference for f, resulting in a better evaluation time when querying consecutive k-mers. RESULTS Motivated by these premises, we initiate the study of a new type of locality-preserving MPHF designed for k-mers extracted consecutively from a collection of strings. We design a construction whose space usage decreases for growing k and discuss experiments with a practical implementation of the method: in practice, the functions built with our method can be several times smaller and even faster to query than the most efficient MPHFs in the literature.
Collapse
Affiliation(s)
| | | | - Antoine Limasset
- University of Lille, CRIStAL CNRS, UMR 9189 , Lille F-59000, France
| |
Collapse
|
17
|
Pibiri GE. On weighted k-mer dictionaries. Algorithms Mol Biol 2023; 18:3. [PMID: 37328897 DOI: 10.1186/s13015-023-00226-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 05/13/2023] [Indexed: 06/18/2023] Open
Abstract
We consider the problem of representing a set of [Formula: see text]-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a [Formula: see text]-mer is efficient. The representation is called a weighted dictionary of [Formula: see text]-mers and finds application in numerous tasks in Bioinformatics that usually count [Formula: see text]-mers as a pre-processing step. In fact, [Formula: see text]-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185-194, 2022) to also store compactly the weights of the [Formula: see text]-mers. From a technical perspective, we exploit the order of the [Formula: see text]-mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only [Formula: see text]-mer dictionary that is exact, weighted, associative, fast, and small.
Collapse
Affiliation(s)
- Giulio Ermanno Pibiri
- Department of Environmental Sciences, Informatics and Statistics (DAIS), Ca' Foscari University of Venice, Venice, Italy.
- ISTI-CNR, Pisa, Italy.
| |
Collapse
|
18
|
He D, Patro R. simpleaf: A simple, flexible, and scalable framework for single-cell transcriptomics data processing using alevin-fry. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.28.534653. [PMID: 37034702 PMCID: PMC10081176 DOI: 10.1101/2023.03.28.534653] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]
Abstract
Summary The alevin-fry ecosystem provides a robust and growing suite of programs for single-cell data processing. However, as new single-cell technologies are introduced, as the community continues to adjust best practices for data processing, and as the alevin-fry ecosystem itself expands and grows, it is becoming increasingly important to manage the complexity of alevin-fry ’s single-cell preprocessing workflows while retaining the performance and flexibility that make these tools enticing. We introduce simpleaf , a program that simplifies the processing of single-cell data using tools from the alevin-fry ecosystem, and adds new functionality and capabilities, while retaining the flexibility and performance of the underlying tools. Availability and implementation Simpleaf is written in Rust and released under a BSD 3-Clause license. It is freely available from its GitHub repository https://github.com/COMBINE-lab/simpleaf , and via bioconda. Documentation for simpleaf is available at https://simpleaf.readthedocs.io/en/latest/ and tutorials for simpleaf are being developed that can be accessed at https://combine-lab.github.io/alevin-fry-tutorials .
Collapse
Affiliation(s)
- Dongze He
- Department of Cell Biology and Molecular Genetics and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - Rob Patro
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| |
Collapse
|
19
|
He D, Soneson C, Patro R. Understanding and evaluating ambiguity in single-cell and single-nucleus RNA-sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.04.522742. [PMID: 36711921 PMCID: PMC9881993 DOI: 10.1101/2023.01.04.522742] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]
Abstract
Recently, a new modification has been proposed by Hjörleifsson and Sullivan et al. to the model used to classify the splicing status of reads (as spliced (mature), unspliced (nascent), or ambiguous) in single-cell and single-nucleus RNA-seq data. Here, we evaluate both the theoretical basis and practical implementation of the proposed method. The proposed method is highly-conservative, and therefore, unlikely to mischaracterize reads as spliced (mature) or unspliced (nascent) when they are not. However, we find that it leaves a large fraction of reads classified as ambiguous, and, in practice, allocates these ambiguous reads in an all-or-nothing manner, and differently between single-cell and single-nucleus RNA-seq data. Further, as implemented in practice, the ambiguous classification is implicit and based on the index against which the reads are mapped, which leads to several drawbacks compared to methods that consider both spliced (mature) and unspliced (nascent) mapping targets simultaneously - for example, the ability to use confidently assigned reads to rescue ambiguous reads based on shared UMIs and gene targets. Nonetheless, we show that these conservative assignment rules can be obtained directly in existing approaches simply by altering the set of targets that are indexed. To this end, we introduce the spliceu reference and show that its use with alevin-fry recapitulates the more conservative proposed classification. We also observe that, on experimental data, and under the proposed allocation rules for ambiguous UMIs, the difference between the proposed classification scheme and existing conventions appears much smaller than previously reported. We demonstrate the use of the new piscem index for mapping simultaneously against spliced (mature) and unspliced (nascent) targets, allowing classification against the full nascent and mature transcriptome in human or mouse in <3GB of memory. Finally, we discuss the potential of incorporating probabilistic evidence into the inference of splicing status, and suggest that it may provide benefits beyond what can be obtained from discrete classification of UMIs as splicing-ambiguous.
Collapse
Affiliation(s)
- Dongze He
- Department of Cell Biology and Molecular Genetics and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| | - Charlotte Soneson
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland
- SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Rob Patro
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD, USA
| |
Collapse
|
20
|
Gibney D, Thankachan SV, Aluru S. On the Hardness of Sequence Alignment on De Bruijn Graphs. J Comput Biol 2022; 29:1377-1396. [DOI: 10.1089/cmb.2022.0411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Affiliation(s)
- Daniel Gibney
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| | - Sharma V. Thankachan
- Department of Computer Science, North Carolina State University, Raleigh, North Carolina, USA
| | - Srinivas Aluru
- School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, Georgia, USA
| |
Collapse
|
21
|
Karasikov M, Mustafa H, Rätsch G, Kahles A. Lossless indexing with counting de Bruijn graphs. Genome Res 2022; 32:1754-1764. [PMID: 35609994 PMCID: PMC9528980 DOI: 10.1101/gr.276607.122] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 05/05/2022] [Indexed: 11/25/2022]
Abstract
Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.
Collapse
Affiliation(s)
- Mikhail Karasikov
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Harun Mustafa
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Biology at ETH Zurich, 8093 Zurich, Switzerland
- ETH AI Center, ETH Zurich, 8092 Zurich, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
22
|
Santoro D, Pellegrina L, Comin M, Vandin F. SPRISS: approximating frequent k-mers by sampling reads, and applications. Bioinformatics 2022; 38:3343-3350. [PMID: 35583271 PMCID: PMC9237683 DOI: 10.1093/bioinformatics/btac180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/25/2022] [Accepted: 05/16/2022] [Indexed: 11/29/2022] Open
Abstract
MOTIVATION The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. RESULTS In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset. AVAILABILITY AND IMPLEMENTATION SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Diego Santoro
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Leonardo Pellegrina
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Fabio Vandin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| |
Collapse
|
23
|
Qiu Y, Kingsford C. The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance. Bioinformatics 2022; 38:i404-i412. [PMID: 35758819 PMCID: PMC9235494 DOI: 10.1093/bioinformatics/btac264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open
Abstract
Motivation Intra-sample heterogeneity describes the phenomenon where a genomic sample contains a diverse set of genomic sequences. In practice, the true string sets in a sample are often unknown due to limitations in sequencing technology. In order to compare heterogeneous samples, genome graphs can be used to represent such sets of strings. However, a genome graph is generally able to represent a string set universe that contains multiple sets of strings in addition to the true string set. This difference between genome graphs and string sets is not well characterized. As a result, a distance metric between genome graphs may not match the distance between true string sets. Results We extend a genome graph distance metric, Graph Traversal Edit Distance (GTED) proposed by Ebrahimpour Boroojeny et al., to FGTED to model the distance between heterogeneous string sets and show that GTED and FGTED always underestimate the Earth Mover’s Edit Distance (EMED) between string sets. We introduce the notion of string set universe diameter of a genome graph. Using the diameter, we are able to upper-bound the deviation of FGTED from EMED and to improve FGTED so that it reduces the average error in empirically estimating the similarity between true string sets. On simulated T-cell receptor sequences and actual Hepatitis B virus genomes, we show that the diameter-corrected FGTED reduces the average deviation of the estimated distance from the true string set distances by more than 250%. Availability and implementation Data and source code for reproducing the experiments are available at: https://github.com/Kingsford-Group/gtedemedtest/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yutong Qiu
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15232, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15232, USA
| |
Collapse
|
24
|
Abstract
MOTIVATION A dictionary of k-mers is a data structure that stores a set of n distinct k-mers and supports membership queries. This data structure is at the hearth of many important tasks in computational biology. High-throughput sequencing of DNA can produce very large k-mer sets, in the size of billions of strings-in such cases, the memory consumption and query efficiency of the data structure is a concrete challenge. RESULTS To tackle this problem, we describe a compressed and associative dictionary for k-mers, that is: a data structure where strings are represented in compact form and each of them is associated to a unique integer identifier in the range [0,n). We show that some statistical properties of k-mer minimizers can be exploited by minimal perfect hashing to substantially improve the space/time trade-off of the dictionary compared to the best-known solutions. AVAILABILITY AND IMPLEMENTATION https://github.com/jermp/sshash. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
25
|
Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 2022; 19:316-322. [PMID: 35277707 PMCID: PMC8933848 DOI: 10.1038/s41592-022-01408-3] [Citation(s) in RCA: 52] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Accepted: 01/27/2022] [Indexed: 01/19/2023]
Abstract
The rapid growth of high-throughput single-cell and single-nucleus RNA-sequencing (sc/snRNA-seq) technologies has produced a wealth of data over the past few years. The size, volume, and distinctive characteristics of these data necessitate the development of new computational methods to accurately and efficiently quantify sc/snRNA-seq data into count matrices that constitute the input to downstream analyses. We introduce the alevin-fry framework for quantifying sc/snRNA-seq data. In addition to being faster and more memory frugal than other accurate quantification approaches, alevin-fry ameliorates the memory scalability and false-positive expression issues that are exhibited by other lightweight tools. We demonstrate how alevin-fry can be effectively used to quantify sc/snRNA-seq data, and also how the spliced and unspliced molecule quantification required as input for RNA velocity analyses can be seamlessly extracted from the same preprocessed data used to generate regular gene expression count matrices.
Collapse
|
26
|
Skoufos G, Almodaresi F, Zakeri M, Paulson JN, Patro R, Hatzigeorgiou AG, Vlachos IS. AGAMEMNON: an Accurate metaGenomics And MEtatranscriptoMics quaNtificatiON analysis suite. Genome Biol 2022; 23:39. [PMID: 35101114 PMCID: PMC8802518 DOI: 10.1186/s13059-022-02610-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Accepted: 01/03/2022] [Indexed: 12/03/2022] Open
Abstract
We introduce AGAMEMNON ( https://github.com/ivlachos/agamemnon ) for the acquisition of microbial abundances from shotgun metagenomics and metatranscriptomic samples, single-microbe sequencing experiments, or sequenced host samples. AGAMEMNON delivers accurate abundances at genus, species, and strain resolution. It incorporates a time and space-efficient indexing scheme for fast pattern matching, enabling indexing and analysis of vast datasets with widely available computational resources. Host-specific modules provide exceptional accuracy for microbial abundance quantification from tissue RNA/DNA sequencing, enabling the expansion of experiments lacking metagenomic/metatranscriptomic analyses. AGAMEMNON provides an R-Shiny application, permitting performance of investigations and visualizations from a graphics interface.
Collapse
Affiliation(s)
- Giorgos Skoufos
- Department of Electrical & Computer Engineering, University of Thessaly, 38221, Volos, Greece.
- Hellenic Pasteur Institute, 11521, Athens, Greece.
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, Univ. of Thessaly, 351 31, Lamia, Greece.
| | - Fatemeh Almodaresi
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Mohsen Zakeri
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Joseph N Paulson
- Department of Data Sciences, Genentech Inc., South San Francisco, CA, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD, USA
| | - Artemis G Hatzigeorgiou
- Department of Electrical & Computer Engineering, University of Thessaly, 38221, Volos, Greece.
- Hellenic Pasteur Institute, 11521, Athens, Greece.
- DIANA-Lab, Department of Computer Science and Biomedical Informatics, Univ. of Thessaly, 351 31, Lamia, Greece.
| | - Ioannis S Vlachos
- Cancer Research Institute | HMS Initiative for RNA Medicine | Department of Pathology, Beth Israel Deaconess Medical Center, Harvard Medical School, Boston, MA, 02115, USA.
- Spatial Technologies Unit, Beth Israel Deaconess Medical Center, MA, Boston, USA.
- Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA.
| |
Collapse
|
27
|
Krannich T, White WTJ, Niehus S, Holley G, Halldórsson BV, Kehr B. Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics 2021; 38:604-611. [PMID: 34726732 PMCID: PMC8756200 DOI: 10.1093/bioinformatics/btab749] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 09/27/2021] [Accepted: 10/28/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes. RESULTS We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets. AVAILABILITY AND IMPLEMENTATION The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Sebastian Niehus
- Regensburg Center for Interventional Immunology (RCI), 93053 Regensburg, Germany
| | | | - Bjarni V Halldórsson
- deCODE Genetics, Reykjavík 102, Iceland,Department of Engineering, School of Technology, Reykjavík University, Reykjavík 102, Iceland
| | - Birte Kehr
- To whom correspondence should be addressed. or
| |
Collapse
|
28
|
Marchet C, Kerbiriou M, Limasset A. BLight: efficient exact associative structure for k-mers. Bioinformatics 2021; 37:2858-2865. [PMID: 33821954 DOI: 10.1093/bioinformatics/btab217] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 02/18/2021] [Accepted: 04/01/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough. RESULTS We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range. AVAILABILITY AND IMPLEMENTATION We wrote the BLight index as an open source C++ library under the AGPL3 license available at github.com/Malfoy/BLight. It is designed as a user-friendly library and comes along with code usage samples.
Collapse
Affiliation(s)
- Camille Marchet
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| | - Mael Kerbiriou
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| | - Antoine Limasset
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| |
Collapse
|
29
|
Khan J, Patro R. Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections. Bioinformatics 2021; 37:i177-i186. [PMID: 34252958 PMCID: PMC8275350 DOI: 10.1093/bioinformatics/btab309] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Motivation The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem. Results We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using ∼126 GB of memory, and over 16 h using ∼289 GB of memory, respectively. Availability and implementation Cuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
- To whom correspondence should be addressed.
| |
Collapse
|
30
|
Alanko J, Alipanahi B, Settle J, Boucher C, Gagie T. Buffering updates enables efficient dynamic de Bruijn graphs. Comput Struct Biotechnol J 2021; 19:4067-4078. [PMID: 34377371 PMCID: PMC8326735 DOI: 10.1016/j.csbj.2021.06.047] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 06/29/2021] [Accepted: 06/29/2021] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduction in the late 1990s. It has been used for a variety of purposes including genome assembly (Zerbino and Birney, 2008; Bankevich et al., 2012; Peng et al., 2012), variant detection (Alipanahi et al., 2020b; Iqbal et al., 2012), and storage of assembled genomes (Chikhi et al., 2016). For this reason, there have been over a dozen methods for building and representing the de Bruijn graph and its variants in a space and time efficient manner. RESULTS With the exception of a few data structures (Muggli et al., 2019; Holley and Melsted, 2020; Crawford et al.,2018), compressed and compact de Bruijn graphs do not allow for the graph to be efficiently updated, meaning that data can be added or deleted. The most recent compressed dynamic de Bruijn graph (Alipanahi et al., 2020a), relies on dynamic bit vectors which are slow in theory and practice. To address this shortcoming, we present a compressed dynamic de Bruijn graph that removes the necessity of dynamic bit vectors by buffering data that should be added or removed from the graph. We implement our method, which we refer to as BufBOSS, and compare its performance to Bifrost, DynamicBOSS, and FDBG. Our experiments demonstrate that BufBOSS achieves attractive trade-offs compared to other tools in terms of time, memory and disk, and has the best deletion performance by an order of magnitude.
Collapse
Affiliation(s)
- Jarno Alanko
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Bahar Alipanahi
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Jonathen Settle
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Travis Gagie
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| |
Collapse
|
31
|
Almodaresi F, Zakeri M, Patro R. Puffaligner : A Fast, Efficient, and Accurate Aligner Based on the Pufferfish Index. Bioinformatics 2021; 37:4048-4055. [PMID: 34117875 PMCID: PMC9502150 DOI: 10.1093/bioinformatics/btab408] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 04/30/2021] [Accepted: 06/11/2021] [Indexed: 12/22/2022] Open
Abstract
Motivation Sequence alignment is one of the first steps in many modern genomic analyses, such as variant detection, transcript abundance estimation and metagenomic profiling. Unfortunately, it is often a computationally expensive procedure. As the quantity of data and wealth of different assays and applications continue to grow, the need for accurate and fast alignment tools that scale to large collections of reference sequences persists. Results In this article, we introduce PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space and accuracy tradeoffs made by different alignment tools and provides a promising foundation on which to test new alignment ideas over large collections of sequences. Availability and implementation All the data used for preparing the results of this paper can be found with 10.5281/zenodo.4902332. PuffAligner is a free and open-source software. It is implemented in C++14 and can be obtained from https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Mohsen Zakeri
- Computer Science Department, University of Maryland, College Park, USA
| | - Rob Patro
- Computer Science Department, University of Maryland, College Park, USA
| |
Collapse
|
32
|
Denti L, Pirola Y, Previtali M, Ceccato T, Della Vedova G, Rizzi R, Bonizzoni P. Shark: fishing relevant reads in an RNA-Seq sample. Bioinformatics 2021; 37:464-472. [PMID: 32926128 PMCID: PMC8088329 DOI: 10.1093/bioinformatics/btaa779] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 08/17/2020] [Accepted: 09/02/2020] [Indexed: 11/19/2022] Open
Abstract
Motivation Recent advances in high-throughput RNA-Seq technologies allow to produce massive datasets. When a study focuses only on a handful of genes, most reads are not relevant and degrade the performance of the tools used to analyze the data. Removing irrelevant reads from the input dataset leads to improved efficiency without compromising the results of the study. Results We introduce a novel computational problem, called gene assignment and we propose an efficient alignment-free approach to solve it. Given an RNA-Seq sample and a panel of genes, a gene assignment consists in extracting from the sample, the reads that most probably were sequenced from those genes. The problem becomes more complicated when the sample exhibits evidence of novel alternative splicing events. We implemented our approach in a tool called Shark and assessed its effectiveness in speeding up differential splicing analysis pipelines. This evaluation shows that Shark is able to significantly improve the performance of RNA-Seq analysis tools without having any impact on the final results. Availability and implementation The tool is distributed as a stand-alone module and the software is freely available at https://github.com/AlgoLab/shark. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Luca Denti
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Yuri Pirola
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Marco Previtali
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Tamara Ceccato
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Gianluca Della Vedova
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Raffaella Rizzi
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano 20126, Italy
| |
Collapse
|
33
|
Guo J, Pang E, Song H, Lin K. A tri-tuple coordinate system derived for fast and accurate analysis of the colored de Bruijn graph-based pangenomes. BMC Bioinformatics 2021; 22:282. [PMID: 34044757 PMCID: PMC8161984 DOI: 10.1186/s12859-021-04149-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 04/25/2021] [Indexed: 11/25/2022] Open
Abstract
Background With the rapid development of accurate sequencing and assembly technologies, an increasing number of high-quality chromosome-level and haplotype-resolved assemblies of genomic sequences have been derived, from which there will be great opportunities for computational pangenomics. Although genome graphs are among the most useful models for pangenome representation, their structural complexity makes it difficult to present genome information intuitively, such as the linear reference genome. Thus, efficiently and accurately analyzing the genome graph spatial structure and coordinating the information remains a substantial challenge. Results We developed a new method, a colored superbubble (cSupB), that can overcome the complexity of graphs and organize a set of species- or population-specific haplotype sequences of interest. Based on this model, we propose a tri-tuple coordinate system that combines an offset value, topological structure and sample information. Additionally, cSupB provides a novel method that utilizes complete topological information and efficiently detects small indels (< 50 bp) for highly similar samples, which can be validated by simulated datasets. Moreover, we demonstrated that cSupB can adapt to the complex cycle structure. Conclusions Although the solution is made suitable for increasingly complex genome graphs by relaxing the constraint, the directed acyclic graph, the motif cSupB and the cSupB method can be extended to any colored directed acyclic graph. We anticipate that our method will facilitate the analysis of individual haplotype variants and population genomic diversity. We have developed a C + + program for implementing our method that is available at https://github.com/eggleader/cSupB. Supplementary information The online version contains supplementary material available at 10.1186/s12859-021-04149-w.
Collapse
Affiliation(s)
- Jindan Guo
- State Key Laboratory of Earth Surface Processes and Resource Ecology, Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Erli Pang
- State Key Laboratory of Earth Surface Processes and Resource Ecology, Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Hongtao Song
- State Key Laboratory of Earth Surface Processes and Resource Ecology, Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Kui Lin
- State Key Laboratory of Earth Surface Processes and Resource Ecology, Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China.
| |
Collapse
|
34
|
Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol 2021; 22:96. [PMID: 33823902 PMCID: PMC8025321 DOI: 10.1186/s13059-021-02297-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 02/10/2021] [Indexed: 12/30/2022] Open
Abstract
de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.
Collapse
Affiliation(s)
- Karel Břinda
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA.
- Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, USA.
| | - Michael Baym
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA
| | - Gregory Kucherov
- CNRS/LIGM Univ Gustave Eiffel, Marne-la-Vallée, France
- Skolkovo Institute of Science and Technology, Moscow, Russia
| |
Collapse
|
35
|
Abstract
Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST (Unitig-STitch) that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which, we show, can store a set of k-mers by using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which, we show, improves index size by 10%-44% compared with other state-of-the-art low-memory indices.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, Penn State, University Park, State College, PA, USA
| | - Paul Medevedev
- Department of Computer Science and Engineering, Penn State, University Park, State College, PA, USA
- Department of Biochemistry and Molecular Biology, Penn State, University Park, State College, PA, USA
- Center for Computational Biology and Bioinformatics, Penn State, University Park, State College, PA, USA
| |
Collapse
|
36
|
Patro R, Salmela L. Algorithms meet sequencing technologies - 10th edition of the RECOMB-Seq workshop. iScience 2021; 24:101956. [PMID: 33437938 PMCID: PMC7788091 DOI: 10.1016/j.isci.2020.101956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
DNA and RNA sequencing is a core technology in biological and medical research. The high throughput of these technologies and the consistent development of new experimental assays and biotechnologies demand the continuous development of methods to analyze the resulting data. The RECOMB Satellite Workshop on Massively Parallel Sequencing brings together leading researchers in computational genomics to discuss emerging frontiers in algorithm development for massively parallel sequencing data. The 10th meeting in this series, RECOMB-Seq 2020, was scheduled to be held in Padua, Italy, but due to the ongoing COVID-19 pandemic, the meeting was carried out virtually instead. The online workshop featured keynote talks by Paola Bonizzoni and Zamin Iqbal, two highlight talks, ten regular talks, and three short talks. Seven of the works presented in the workshop are featured in this edition of iScience, and many of the talks are available online in the RECOMB-Seq 2020 YouTube channel.
Collapse
Affiliation(s)
- Rob Patro
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, MD, USA
| | - Leena Salmela
- Department of Computer Science and Helsinki Institute for Information Technology HIIT, University of Helsinki, Helsinki, Finland
| |
Collapse
|
37
|
Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res 2021; 31:1-12. [PMID: 33328168 PMCID: PMC7849385 DOI: 10.1101/gr.260604.119] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2019] [Accepted: 09/14/2020] [Indexed: 12/19/2022]
Abstract
High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
Collapse
Affiliation(s)
- Camille Marchet
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Simon J Puglisi
- Department of Computer Science, University of Helsinki, FI-00014, Helsinki, Finland
| | - Paul Medvedev
- Department of Computer Science, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Mikaël Salson
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Rayan Chikhi
- Institut Pasteur & CNRS, C3BI USR 3756, F-75015 Paris, France
| |
Collapse
|
38
|
Li H, Feng X, Chu C. The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020; 21:265. [PMID: 33066802 PMCID: PMC7568353 DOI: 10.1186/s13059-020-02168-z] [Citation(s) in RCA: 219] [Impact Index Per Article: 43.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Accepted: 09/23/2020] [Indexed: 12/21/2022] Open
Abstract
The recent advances in sequencing technologies enable the assembly of individual genomes to the quality of the reference genome. How to integrate multiple genomes from the same species and make the integrated representation accessible to biologists remains an open challenge. Here, we propose a graph-based data model and associated formats to represent multiple genomes while preserving the coordinate of the linear reference genome. We implement our ideas in the minigraph toolkit and demonstrate that we can efficiently construct a pangenome graph and compactly encode tens of thousands of structural variants missing from the current reference genome.
Collapse
Affiliation(s)
- Heng Li
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, 02215, MA, USA.
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02215, MA, USA.
| | - Xiaowen Feng
- Department of Data Sciences, Dana-Farber Cancer Institute, Boston, 02215, MA, USA
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02215, MA, USA
| | - Chong Chu
- Department of Biomedical Informatics, Harvard Medical School, Boston, 02215, MA, USA
| |
Collapse
|
39
|
Srivastava A, Malik L, Sarkar H, Zakeri M, Almodaresi F, Soneson C, Love MI, Kingsford C, Patro R. Alignment and mapping methodology influence transcript abundance estimation. Genome Biol 2020; 21:239. [PMID: 32894187 PMCID: PMC7487471 DOI: 10.1186/s13059-020-02151-8] [Citation(s) in RCA: 91] [Impact Index Per Article: 18.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Accepted: 08/19/2020] [Indexed: 01/23/2023] Open
Abstract
Background The accuracy of transcript quantification using RNA-seq data depends on many factors, such as the choice of alignment or mapping method and the quantification model being adopted. While the choice of quantification model has been shown to be important, considerably less attention has been given to comparing the effect of various read alignment approaches on quantification accuracy. Results We investigate the influence of mapping and alignment on the accuracy of transcript quantification in both simulated and experimental data, as well as the effect on subsequent differential expression analysis. We observe that, even when the quantification model itself is held fixed, the effect of choosing a different alignment methodology, or aligning reads using different parameters, on quantification estimates can sometimes be large and can affect downstream differential expression analyses as well. These effects can go unnoticed when assessment is focused too heavily on simulated data, where the alignment task is often simpler than in experimentally acquired samples. We also introduce a new alignment methodology, called selective alignment, to overcome the shortcomings of lightweight approaches without incurring the computational cost of traditional alignment. Conclusion We observe that, on experimental datasets, the performance of lightweight mapping and alignment-based approaches varies significantly, and highlight some of the underlying factors. We show this variation both in terms of quantification and downstream differential expression analysis. In all comparisons, we also show the improved performance of our proposed selective alignment method and suggest best practices for performing RNA-seq quantification.
Collapse
Affiliation(s)
- Avi Srivastava
- Department of Computer Science, Stony Brook University, Stony Brook, USA
| | - Laraib Malik
- Department of Computer Science, Stony Brook University, Stony Brook, USA
| | - Hirak Sarkar
- Department of Computer Science, University of Maryland, College Park, USA
| | - Mohsen Zakeri
- Department of Computer Science, University of Maryland, College Park, USA
| | - Fatemeh Almodaresi
- Department of Computer Science, University of Maryland, College Park, USA
| | - Charlotte Soneson
- Friedrich Miescher Institute for Biomedical Research, Basel, Switzerland.,SIB Swiss Institute of Bioinformatics, Basel, Switzerland
| | - Michael I Love
- Department of Biostatistics, University of North Carolina at Chapel Hill, Chapel Hill, USA.,Department of Genetics, University of North Carolina at Chapel Hill, Chapel Hill, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, USA.
| |
Collapse
|
40
|
Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters. Proc Natl Acad Sci U S A 2020; 117:16961-16968. [PMID: 32641514 PMCID: PMC7382288 DOI: 10.1073/pnas.1903436117] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that use k-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.
Collapse
|
41
|
Minkin I, Medvedev P. Scalable Pairwise Whole-Genome Homology Mapping of Long Genomes with BubbZ. iScience 2020; 23:101224. [PMID: 32563153 PMCID: PMC7303978 DOI: 10.1016/j.isci.2020.101224] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 05/25/2020] [Accepted: 05/28/2020] [Indexed: 11/17/2022] Open
Abstract
Pairwise whole-genome homology mapping is the problem of finding all pairs of homologous intervals between a pair of genomes. As the number of available whole genomes has been rising dramatically in the last few years, there has been a need for more scalable homology mappers. In this paper, we develop an algorithm (BubbZ) for computing whole-genome pairwise homology mappings, especially in the context of all-to-all comparison for multiple genomes. BubbZ is based on an algorithm for computing chains in compacted de Bruijn graphs. We evaluate BubbZ on simulated datasets, a dataset composed of 16 long mouse genomes, and a large dataset of 1,600 Salmonella genomes. We show up to approximately an order of magnitude speed improvement, compared with MashMap2 and Minimap2, while retaining similar accuracy.
Collapse
Affiliation(s)
- Ilia Minkin
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA.
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA; Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA; Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, PA 16802, USA.
| |
Collapse
|
42
|
Almodaresi F, Pandey P, Ferdman M, Johnson R, Patro R. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. J Comput Biol 2020; 27:485-499. [PMID: 32176522 PMCID: PMC7185321 DOI: 10.1089/cmb.2019.0322] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).
Collapse
Affiliation(s)
- Fatemeh Almodaresi
- Department of Computer Science, University of Maryland, College Park, Maryland
| | - Prashant Pandey
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
| | - Michael Ferdman
- Department of Computer Science, Stony Brook University, Stony Brook, New York
| | - Rob Johnson
- Department of Computer Science, Stony Brook University, Stony Brook, New York
- VMware Research, Palo Alto, California
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, Maryland
| |
Collapse
|
43
|
Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. QUANTITATIVE BIOLOGY 2019. [DOI: 10.1007/s40484-019-0181-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|