Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Almodaresi F, Sarkar H, Srivastava A, Patro R. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 2019;34:i169-i177. [PMID: 29949982 PMCID: PMC6022659 DOI: 10.1093/bioinformatics/bty292] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open

For:	Almodaresi F, Sarkar H, Srivastava A, Patro R. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 2019;34:i169-i177. [PMID: 29949982 PMCID: PMC6022659 DOI: 10.1093/bioinformatics/bty292] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open

Number

Cited by Other Article(s)

Guo LL. Effloc: An Efficient Locating Algorithm for Mass-Occurrence Biological Patterns with FM-Index. J Comput Biol 2025. [PMID: 40314133 DOI: 10.1089/cmb.2024.0925] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/03/2025] Open

Singh NP, Khan J, Patro R. Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-seq data using virtual colors for accurate genomic pseudoalignment. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.11.27.625771. [PMID: 39677745 PMCID: PMC11642815 DOI: 10.1101/2024.11.27.625771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]

Abstract

Ultrafast mapping of short reads via lightweight mapping techniques such as pseudoalignment has significantly accelerated transcriptomic and metagenomic analyses, often with minimal accuracy loss compared to alignment-based methods. However, applying pseudoalignment to large genomic references, like chromosomes, is challenging due to their size and repetitive sequences. We introduce a new and modified pseudoalignment scheme that partitions each reference into "virtual colors…. These are essentially overlapping bins of fixed maximal extent on the reference sequences that are treated as distinct "colors" from the perspective of the pseudoalignment algorithm. We apply this modified pseudoalignment procedure to process and map single-cell ATAC-seq data in our new tool alevin-fry-atac . We compare alevin-fry-atac to both Chromap and Cell Ranger ATAC . Alevin-fry-atac is highly scalable and, when using 32 threads, is approximately 2.8 times faster than Chromap (the second fastest approach) while using approximately one third of the memory and mapping slightly more reads. The resulting peaks and clusters generated from alevin-fry-atac show high concordance with those obtained from both Chromap and the Cell Ranger ATAC pipeline, demonstrating that virtual colorenhanced pseudoalignment directly to the genome provides a fast, memory-frugal, and accurate alternative to existing approaches for single-cell ATAC-seq processing. The development of alevin-fry-atac brings single-cell ATAC-seq processing into a unified ecosystem with single-cell RNA-seq processing (via alevin-fry ) to work toward providing a truly open alternative to many of the varied capabilities of CellRanger . Furthermore, our modified pseudoalignment approach should be easily applicable and extendable to other genome-centric mapping-based tasks and modalities such as standard DNA-seq, DNase-seq, Chip-seq and Hi-C.

Collapse

Mouratidis I, Baltoumas FA, Chantzi N, Patsakis M, Chan CS, Montgomery A, Konnaris MA, Aplakidou E, Georgakopoulos GC, Das A, Chartoumpekis DV, Kovac J, Pavlopoulos GA, Georgakopoulos-Soares I. kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species. Comput Struct Biotechnol J 2024;23:1919-1928. [PMID: 38711760 PMCID: PMC11070822 DOI: 10.1016/j.csbj.2024.04.050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 04/17/2024] [Accepted: 04/18/2024] [Indexed: 05/08/2024] Open

Affiliation(s)

Ioannis Mouratidis Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
Fotis A. Baltoumas Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
Nikol Chantzi Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
Michail Patsakis Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
Candace S.Y. Chan Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
Austin Montgomery Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
Maxwell A. Konnaris Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA Department of Statistics, The Pennsylvania State University, University Park, PA, USA
Eleni Aplakidou Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece
George C. Georgakopoulos National Technical University of Athens, School of Electrical and Computer Engineering, Athens, Greece
Anshuman Das Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
Dionysios V. Chartoumpekis Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, Lausanne, Switzerland
Jasna Kovac Department of Food Science, The Pennsylvania State University, University Park, PA 16802, USA
Georgios A. Pavlopoulos Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, 11527, Greece
Ilias Georgakopoulos-Soares Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA

Collapse

Campanelli A, Pibiri GE, Fan J, Patro R. Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs^{. J Comput Biol 2024;31:1022-1044. [PMID: 39381838 PMCID: PMC11631793 DOI: 10.1089/cmb.2024.0714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2024] Open}

Taylor DJ, Eizenga JM, Li Q, Das A, Jenike KM, Kenny EE, Miga KH, Monlong J, McCoy RC, Paten B, Schatz MC. Beyond the Human Genome Project: The Age of Complete Human Genome Sequences and Pangenome References. Annu Rev Genomics Hum Genet 2024;25:77-104. [PMID: 38663087 PMCID: PMC11451085 DOI: 10.1146/annurev-genom-021623-081639] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/29/2024]

Campanelli A, Pibiri GE, Fan J, Patro R. Where the patterns are: repetition-aware compression for colored de Bruijn graphs ^⋆. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.09.602727. [PMID: 39026859 PMCID: PMC11257547 DOI: 10.1101/2024.07.09.602727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]

Mustafa H, Karasikov M, Mansouri Ghiasi N, Rätsch G, Kahles A. Label-guided seed-chain-extend alignment on annotated De Bruijn graphs. Bioinformatics 2024;40:i337-i346. [PMID: 38940164 PMCID: PMC11211850 DOI: 10.1093/bioinformatics/btae226] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open

Abstract

MOTIVATION

Exponential growth in sequencing databases has motivated scalable De Bruijn graph-based (DBG) indexing for searching these data, using annotations to label nodes with sample IDs. Low-depth sequencing samples correspond to fragmented subgraphs, complicating finding the long contiguous walks required for alignment queries. Aligners that target single-labelled subgraphs reduce alignment lengths due to fragmentation, leading to low recall for long reads. While some (e.g. label-free) aligners partially overcome fragmentation by combining information from multiple samples, biologically irrelevant combinations in such approaches can inflate the search space or reduce accuracy.

RESULTS

We introduce a new scoring model, 'multi-label alignment' (MLA), for annotated DBGs. MLA leverages two new operations: To promote biologically relevant sample combinations, 'Label Change' incorporates more informative global sample similarity into local scores. To improve connectivity, 'Node Length Change' dynamically adjusts the DBG node length during traversal. Our fast, approximate, yet accurate MLA implementation has two key steps: a single-label seed-chain-extend aligner (SCA) and a multi-label chainer (MLC). SCA uses a traditional scoring model adapting recent chaining improvements to assembly graphs and provides a curated pool of alignments. MLC extracts seed anchors from SCAs alignments, produces multi-label chains using MLA scoring, then finally forms multi-label alignments. We show via substantial improvements in taxonomic classification accuracy that MLA produces biologically relevant alignments, decreasing average weighted UniFrac errors by 63.1%-66.8% and covering 45.5%-47.4% (median) more long-read query characters than state-of-the-art aligners. MLAs runtimes are competitive with label-combining alignment and substantially faster than single-label alignment.

AVAILABILITY AND IMPLEMENTATION

The data, scripts, and instructions for generating our results are available at https://github.com/ratschlab/mla.

Collapse

Shiryev SA, Agarwala R. Indexing and searching petabase-scale nucleotide resources. Nat Methods 2024;21:994-1002. [PMID: 38755321 PMCID: PMC11166510 DOI: 10.1038/s41592-024-02280-z] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2023] [Accepted: 04/08/2024] [Indexed: 05/18/2024]

Yu C, Zhao Y, Zhao C, Jin J, Mao K, Wang G. MiniDBG: A Novel and Minimal De Bruijn Graph for Read Mapping. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024;21:129-142. [PMID: 38060353 DOI: 10.1109/tcbb.2023.3340251] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/07/2024]

Pibiri GE, Fan J, Patro R. Meta-colored compacted de Bruijn graphs. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.07.21.550101. [PMID: 37546988 PMCID: PMC10401949 DOI: 10.1101/2023.07.21.550101] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 08/08/2023]

Depuydt L, Renders L, Abeel T, Fostier J. Pan-genome de Bruijn graph using the bidirectional FM-index. BMC Bioinformatics 2023;24:400. [PMID: 37884897 PMCID: PMC10605969 DOI: 10.1186/s12859-023-05531-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Accepted: 10/12/2023] [Indexed: 10/28/2023] Open

Abstract

BACKGROUND

Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance.

RESULTS

We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment.

CONCLUSIONS

We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.

Collapse

He D, Patro R. simpleaf: a simple, flexible, and scalable framework for single-cell data processing using alevin-fry. Bioinformatics 2023;39:btad614. [PMID: 37802884 PMCID: PMC10580267 DOI: 10.1093/bioinformatics/btad614] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2023] [Revised: 09/02/2023] [Accepted: 10/05/2023] [Indexed: 10/08/2023] Open

Aylward AJ, Petrus S, Mamerto A, Hartwick NT, Michael TP. PanKmer: k-mer-based and reference-free pangenome analysis. Bioinformatics 2023;39:btad621. [PMID: 37846049 PMCID: PMC10603592 DOI: 10.1093/bioinformatics/btad621] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Revised: 08/29/2023] [Accepted: 10/13/2023] [Indexed: 10/18/2023] Open

Medvedev P. Theoretical Analysis of Sequencing Bioinformatics Algorithms and Beyond. COMMUNICATIONS OF THE ACM 2023;66:118-125. [PMID: 38736702 PMCID: PMC11087067 DOI: 10.1145/3571723] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/14/2024]

Marchet C, Limasset A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 2023;39:i252-i259. [PMID: 37387170 DOI: 10.1093/bioinformatics/btad225] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open

Pibiri GE, Shibuya Y, Limasset A. Locality-preserving minimal perfect hashing of k-mers. Bioinformatics 2023;39:i534-i543. [PMID: 37387137 PMCID: PMC10311298 DOI: 10.1093/bioinformatics/btad219] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open

Pibiri GE. On weighted k-mer dictionaries. Algorithms Mol Biol 2023;18:3. [PMID: 37328897 DOI: 10.1186/s13015-023-00226-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 05/13/2023] [Indexed: 06/18/2023] Open

He D, Patro R. simpleaf: A simple, flexible, and scalable framework for single-cell transcriptomics data processing using alevin-fry. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.03.28.534653. [PMID: 37034702 PMCID: PMC10081176 DOI: 10.1101/2023.03.28.534653] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/19/2023]

He D, Soneson C, Patro R. Understanding and evaluating ambiguity in single-cell and single-nucleus RNA-sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.04.522742. [PMID: 36711921 PMCID: PMC9881993 DOI: 10.1101/2023.01.04.522742] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]

Abstract

Recently, a new modification has been proposed by Hjörleifsson and Sullivan et al. to the model used to classify the splicing status of reads (as spliced (mature), unspliced (nascent), or ambiguous) in single-cell and single-nucleus RNA-seq data. Here, we evaluate both the theoretical basis and practical implementation of the proposed method. The proposed method is highly-conservative, and therefore, unlikely to mischaracterize reads as spliced (mature) or unspliced (nascent) when they are not. However, we find that it leaves a large fraction of reads classified as ambiguous, and, in practice, allocates these ambiguous reads in an all-or-nothing manner, and differently between single-cell and single-nucleus RNA-seq data. Further, as implemented in practice, the ambiguous classification is implicit and based on the index against which the reads are mapped, which leads to several drawbacks compared to methods that consider both spliced (mature) and unspliced (nascent) mapping targets simultaneously - for example, the ability to use confidently assigned reads to rescue ambiguous reads based on shared UMIs and gene targets. Nonetheless, we show that these conservative assignment rules can be obtained directly in existing approaches simply by altering the set of targets that are indexed. To this end, we introduce the spliceu reference and show that its use with alevin-fry recapitulates the more conservative proposed classification. We also observe that, on experimental data, and under the proposed allocation rules for ambiguous UMIs, the difference between the proposed classification scheme and existing conventions appears much smaller than previously reported. We demonstrate the use of the new piscem index for mapping simultaneously against spliced (mature) and unspliced (nascent) targets, allowing classification against the full nascent and mature transcriptome in human or mouse in <3GB of memory. Finally, we discuss the potential of incorporating probabilistic evidence into the inference of splicing status, and suggest that it may provide benefits beyond what can be obtained from discrete classification of UMIs as splicing-ambiguous.

Collapse

Gibney D, Thankachan SV, Aluru S. On the Hardness of Sequence Alignment on De Bruijn Graphs. J Comput Biol 2022;29:1377-1396. [DOI: 10.1089/cmb.2022.0411] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open

Karasikov M, Mustafa H, Rätsch G, Kahles A. Lossless indexing with counting de Bruijn graphs. Genome Res 2022;32:1754-1764. [PMID: 35609994 PMCID: PMC9528980 DOI: 10.1101/gr.276607.122] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 05/05/2022] [Indexed: 11/25/2022]

Santoro D, Pellegrina L, Comin M, Vandin F. SPRISS: approximating frequent k-mers by sampling reads, and applications. Bioinformatics 2022;38:3343-3350. [PMID: 35583271 PMCID: PMC9237683 DOI: 10.1093/bioinformatics/btac180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/25/2022] [Accepted: 05/16/2022] [Indexed: 11/29/2022] Open

Qiu Y, Kingsford C. The effect of genome graph expressiveness on the discrepancy between genome graph distance and string set distance. Bioinformatics 2022;38:i404-i412. [PMID: 35758819 PMCID: PMC9235494 DOI: 10.1093/bioinformatics/btac264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022] Open

Pibiri GE. Sparse and skew hashing of K-mers. Bioinformatics 2022;38:i185-i194. [PMID: 35758794 PMCID: PMC9235479 DOI: 10.1093/bioinformatics/btac245] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Alevin-fry unlocks rapid, accurate and memory-frugal quantification of single-cell RNA-seq data. Nat Methods 2022;19:316-322. [PMID: 35277707 PMCID: PMC8933848 DOI: 10.1038/s41592-022-01408-3] [Citation(s) in RCA: 52] [Impact Index Per Article: 17.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2021] [Accepted: 01/27/2022] [Indexed: 01/19/2023]

Skoufos G, Almodaresi F, Zakeri M, Paulson JN, Patro R, Hatzigeorgiou AG, Vlachos IS. AGAMEMNON: an Accurate metaGenomics And MEtatranscriptoMics quaNtificatiON analysis suite. Genome Biol 2022;23:39. [PMID: 35101114 PMCID: PMC8802518 DOI: 10.1186/s13059-022-02610-4] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2020] [Accepted: 01/03/2022] [Indexed: 12/03/2022] Open

Krannich T, White WTJ, Niehus S, Holley G, Halldórsson BV, Kehr B. Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics 2021;38:604-611. [PMID: 34726732 PMCID: PMC8756200 DOI: 10.1093/bioinformatics/btab749] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 09/27/2021] [Accepted: 10/28/2021] [Indexed: 02/03/2023] Open

Marchet C, Kerbiriou M, Limasset A. BLight: efficient exact associative structure for k-mers. Bioinformatics 2021;37:2858-2865. [PMID: 33821954 DOI: 10.1093/bioinformatics/btab217] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 02/18/2021] [Accepted: 04/01/2021] [Indexed: 02/02/2023] Open

Khan J, Patro R. Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections. Bioinformatics 2021;37:i177-i186. [PMID: 34252958 PMCID: PMC8275350 DOI: 10.1093/bioinformatics/btab309] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open

Abstract

Motivation

The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem.

Results

We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using ∼126 GB of memory, and over 16 h using ∼289 GB of memory, respectively.

Availability and implementation

Cuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish.

Supplementary information

Supplementary data are available at Bioinformatics online.

Collapse

Alanko J, Alipanahi B, Settle J, Boucher C, Gagie T. Buffering updates enables efficient dynamic de Bruijn graphs. Comput Struct Biotechnol J 2021;19:4067-4078. [PMID: 34377371 PMCID: PMC8326735 DOI: 10.1016/j.csbj.2021.06.047] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 06/29/2021] [Accepted: 06/29/2021] [Indexed: 12/24/2022] Open

Almodaresi F, Zakeri M, Patro R. Puffaligner : A Fast, Efficient, and Accurate Aligner Based on the Pufferfish Index. Bioinformatics 2021;37:4048-4055. [PMID: 34117875 PMCID: PMC9502150 DOI: 10.1093/bioinformatics/btab408] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 04/30/2021] [Accepted: 06/11/2021] [Indexed: 12/22/2022] Open

Denti L, Pirola Y, Previtali M, Ceccato T, Della Vedova G, Rizzi R, Bonizzoni P. Shark: fishing relevant reads in an RNA-Seq sample. Bioinformatics 2021;37:464-472. [PMID: 32926128 PMCID: PMC8088329 DOI: 10.1093/bioinformatics/btaa779] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Revised: 08/17/2020] [Accepted: 09/02/2020] [Indexed: 11/19/2022] Open

Guo J, Pang E, Song H, Lin K. A tri-tuple coordinate system derived for fast and accurate analysis of the colored de Bruijn graph-based pangenomes. BMC Bioinformatics 2021;22:282. [PMID: 34044757 PMCID: PMC8161984 DOI: 10.1186/s12859-021-04149-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 04/25/2021] [Indexed: 11/25/2022] Open

Abstract

Background

With the rapid development of accurate sequencing and assembly technologies, an increasing number of high-quality chromosome-level and haplotype-resolved assemblies of genomic sequences have been derived, from which there will be great opportunities for computational pangenomics. Although genome graphs are among the most useful models for pangenome representation, their structural complexity makes it difficult to present genome information intuitively, such as the linear reference genome. Thus, efficiently and accurately analyzing the genome graph spatial structure and coordinating the information remains a substantial challenge.

Results

We developed a new method, a colored superbubble (cSupB), that can overcome the complexity of graphs and organize a set of species- or population-specific haplotype sequences of interest. Based on this model, we propose a tri-tuple coordinate system that combines an offset value, topological structure and sample information. Additionally, cSupB provides a novel method that utilizes complete topological information and efficiently detects small indels (< 50 bp) for highly similar samples, which can be validated by simulated datasets. Moreover, we demonstrated that cSupB can adapt to the complex cycle structure.

Conclusions

Although the solution is made suitable for increasingly complex genome graphs by relaxing the constraint, the directed acyclic graph, the motif cSupB and the cSupB method can be extended to any colored directed acyclic graph. We anticipate that our method will facilitate the analysis of individual haplotype variants and population genomic diversity. We have developed a C + + program for implementing our method that is available at https://github.com/eggleader/cSupB.

Supplementary information

The online version contains supplementary material available at 10.1186/s12859-021-04149-w.

Collapse

Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol 2021;22:96. [PMID: 33823902 PMCID: PMC8025321 DOI: 10.1186/s13059-021-02297-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 02/10/2021] [Indexed: 12/30/2022] Open

Rahman A, Medevedev P. Representation of k-Mer Sets Using Spectrum-Preserving String Sets. J Comput Biol 2021;28:381-394. [PMID: 33290137 PMCID: PMC8066325 DOI: 10.1089/cmb.2020.0431] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022] Open

Patro R, Salmela L. Algorithms meet sequencing technologies - 10th edition of the RECOMB-Seq workshop. iScience 2021;24:101956. [PMID: 33437938 PMCID: PMC7788091 DOI: 10.1016/j.isci.2020.101956] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open

Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res 2021;31:1-12. [PMID: 33328168 PMCID: PMC7849385 DOI: 10.1101/gr.260604.119] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2019] [Accepted: 09/14/2020] [Indexed: 12/19/2022]

Li H, Feng X, Chu C. The design and construction of reference pangenome graphs with minigraph. Genome Biol 2020;21:265. [PMID: 33066802 PMCID: PMC7568353 DOI: 10.1186/s13059-020-02168-z] [Citation(s) in RCA: 219] [Impact Index Per Article: 43.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2020] [Accepted: 09/23/2020] [Indexed: 12/21/2022] Open

Srivastava A, Malik L, Sarkar H, Zakeri M, Almodaresi F, Soneson C, Love MI, Kingsford C, Patro R. Alignment and mapping methodology influence transcript abundance estimation. Genome Biol 2020;21:239. [PMID: 32894187 PMCID: PMC7487471 DOI: 10.1186/s13059-020-02151-8] [Citation(s) in RCA: 91] [Impact Index Per Article: 18.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2019] [Accepted: 08/19/2020] [Indexed: 01/23/2023] Open

Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters. Proc Natl Acad Sci U S A 2020;117:16961-16968. [PMID: 32641514 PMCID: PMC7382288 DOI: 10.1073/pnas.1903436117] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open

Minkin I, Medvedev P. Scalable Pairwise Whole-Genome Homology Mapping of Long Genomes with BubbZ. iScience 2020;23:101224. [PMID: 32563153 PMCID: PMC7303978 DOI: 10.1016/j.isci.2020.101224] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 05/25/2020] [Accepted: 05/28/2020] [Indexed: 11/17/2022] Open

Almodaresi F, Pandey P, Ferdman M, Johnson R, Patro R. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. J Comput Biol 2020;27:485-499. [PMID: 32176522 PMCID: PMC7185321 DOI: 10.1089/cmb.2019.0322] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. QUANTITATIVE BIOLOGY 2019. [DOI: 10.1007/s40484-019-0181-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]