Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Pibiri GE. Sparse and skew hashing of K-mers. Bioinformatics 2022;38:i185-i194. [PMID: 35758794 PMCID: PMC9235479 DOI: 10.1093/bioinformatics/btac245] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

For:	Pibiri GE. Sparse and skew hashing of K-mers. Bioinformatics 2022;38:i185-i194. [PMID: 35758794 PMCID: PMC9235479 DOI: 10.1093/bioinformatics/btac245] [Citation(s) in RCA: 26] [Impact Index Per Article: 8.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Number

Cited by Other Article(s)

Groot Koerkamp R, Liu D, Pibiri GE. The open-closed mod-minimizer algorithm. Algorithms Mol Biol 2025;20:4. [PMID: 40098006 PMCID: PMC11912762 DOI: 10.1186/s13015-025-00270-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 01/28/2025] [Indexed: 03/19/2025] Open

Abstract

Sampling algorithms that deterministically select a subset of k -mers are an important building block in bioinformatics applications. For example, they are used to index large textual collections, like DNA, and to compare sequences quickly. In such applications, a sampling algorithm is required to select one k -mer out of every window of w consecutive k -mers. The folklore and most used scheme is the random minimizer that selects the smallest k -mer in the window according to some random order. This scheme is remarkably simple and versatile, and has a density (expected fraction of selected k -mers) of 2 / ( w + 1 ) . In practice, lower density leads to faster methods and smaller indexes, and it turns out that the random minimizer is not the best one can do. Indeed, some schemes are known to approach optimal density 1/w when k → ∞ , like the recently introduced mod-minimizer (Groot Koerkamp and Pibiri, WABI 2024). In this work, we study methods that achieve low density when k ≤ w . In this small-k regime, a practical method with provably better density than the random minimizer is the miniception (Zheng et al., Bioinformatics 2021). This method can be elegantly described as sampling the smallest closed sycnmer (Edgar, PeerJ 2021) in the window according to some random order. We show that extending the miniception to prefer sampling open syncmers yields much better density. This new method-the open-closed minimizer-offers improved density for small k ≤ w while being as fast to compute as the random minimizer. Compared to methods based on decycling sets, that achieve very low density in the small-k regime, our method has comparable density while being computationally simpler and intuitive. Furthermore, we extend the mod-minimizer to improve density of any scheme that works well for small k to also work well when k > w is large. We hence obtain the open-closed mod-minimizer, a practical method that improves over the mod-minimizer for all k.

Collapse

Singh NP, Khan J, Patro R. Alevin-fry-atac enables rapid and memory frugal mapping of single-cell ATAC-seq data using virtual colors for accurate genomic pseudoalignment. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2024.11.27.625771. [PMID: 39677745 PMCID: PMC11642815 DOI: 10.1101/2024.11.27.625771] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]

Abstract

Ultrafast mapping of short reads via lightweight mapping techniques such as pseudoalignment has significantly accelerated transcriptomic and metagenomic analyses, often with minimal accuracy loss compared to alignment-based methods. However, applying pseudoalignment to large genomic references, like chromosomes, is challenging due to their size and repetitive sequences. We introduce a new and modified pseudoalignment scheme that partitions each reference into "virtual colors…. These are essentially overlapping bins of fixed maximal extent on the reference sequences that are treated as distinct "colors" from the perspective of the pseudoalignment algorithm. We apply this modified pseudoalignment procedure to process and map single-cell ATAC-seq data in our new tool alevin-fry-atac . We compare alevin-fry-atac to both Chromap and Cell Ranger ATAC . Alevin-fry-atac is highly scalable and, when using 32 threads, is approximately 2.8 times faster than Chromap (the second fastest approach) while using approximately one third of the memory and mapping slightly more reads. The resulting peaks and clusters generated from alevin-fry-atac show high concordance with those obtained from both Chromap and the Cell Ranger ATAC pipeline, demonstrating that virtual colorenhanced pseudoalignment directly to the genome provides a fast, memory-frugal, and accurate alternative to existing approaches for single-cell ATAC-seq processing. The development of alevin-fry-atac brings single-cell ATAC-seq processing into a unified ecosystem with single-cell RNA-seq processing (via alevin-fry ) to work toward providing a truly open alternative to many of the varied capabilities of CellRanger . Furthermore, our modified pseudoalignment approach should be easily applicable and extendable to other genome-centric mapping-based tasks and modalities such as standard DNA-seq, DNase-seq, Chip-seq and Hi-C.

Collapse

Vicedomini R, Andreace F, Dufresne Y, Chikhi R, Duitama González C. MUSET: set of utilities for constructing abundance unitig matrices from sequencing data. Bioinformatics 2025;41:btaf054. [PMID: 39898792 PMCID: PMC11897428 DOI: 10.1093/bioinformatics/btaf054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/12/2024] [Revised: 12/20/2024] [Accepted: 01/30/2025] [Indexed: 02/04/2025] Open

Rouzé T, Martayan I, Marchet C, Limasset A. Fractional hitting sets for efficient multiset sketching. Algorithms Mol Biol 2025;20:1. [PMID: 39923117 PMCID: PMC11807336 DOI: 10.1186/s13015-024-00268-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 12/01/2024] [Indexed: 02/10/2025] Open

Kille B, Groot Koerkamp R, McAdams D, Liu A, Treangen TJ. A near-tight lower bound on the density of forward sampling schemes. Bioinformatics 2024;41:btae736. [PMID: 39666942 PMCID: PMC11676336 DOI: 10.1093/bioinformatics/btae736] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2024] [Revised: 11/16/2024] [Accepted: 12/10/2024] [Indexed: 12/14/2024] Open

Levallois V, Andreace F, Le Gal B, Dufresne Y, Peterlongo P. The backpack quotient filter: A dynamic and space-efficient data structure for querying k-mers with abundance. iScience 2024;27:111435. [PMID: 39720533 PMCID: PMC11667073 DOI: 10.1016/j.isci.2024.111435] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2024] [Revised: 08/28/2024] [Accepted: 11/18/2024] [Indexed: 12/26/2024] Open

Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024;23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open

Mouratidis I, Baltoumas FA, Chantzi N, Patsakis M, Chan CS, Montgomery A, Konnaris MA, Aplakidou E, Georgakopoulos GC, Das A, Chartoumpekis DV, Kovac J, Pavlopoulos GA, Georgakopoulos-Soares I. kmerDB: A database encompassing the set of genomic and proteomic sequence information for each species. Comput Struct Biotechnol J 2024;23:1919-1928. [PMID: 38711760 PMCID: PMC11070822 DOI: 10.1016/j.csbj.2024.04.050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 04/17/2024] [Accepted: 04/18/2024] [Indexed: 05/08/2024] Open

Affiliation(s)

Ioannis Mouratidis Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA
Fotis A. Baltoumas Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece
Nikol Chantzi Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
Michail Patsakis Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
Candace S.Y. Chan Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
Austin Montgomery Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
Maxwell A. Konnaris Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, PA, USA Department of Statistics, The Pennsylvania State University, University Park, PA, USA
Eleni Aplakidou Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece Department of Basic Sciences, School of Medicine, University of Crete, Heraklion, Greece
George C. Georgakopoulos National Technical University of Athens, School of Electrical and Computer Engineering, Athens, Greece
Anshuman Das Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
Dionysios V. Chartoumpekis Service of Endocrinology, Diabetology and Metabolism, Lausanne University Hospital, Lausanne, Switzerland
Jasna Kovac Department of Food Science, The Pennsylvania State University, University Park, PA 16802, USA
Georgios A. Pavlopoulos Institute for Fundamental Biomedical Research, BSRC "Alexander Fleming", Vari, 16672, Greece Center for New Biotechnologies and Precision Medicine, School of Medicine, National and Kapodistrian University of Athens, Athens, 11527, Greece
Ilias Georgakopoulos-Soares Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA

Collapse

Kille B, Koerkamp RG, McAdams D, Liu A, Treangen TJ. A near-tight lower bound on the density of forward sampling schemes. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.09.06.611668. [PMID: 39605515 PMCID: PMC11601301 DOI: 10.1101/2024.09.06.611668] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 11/29/2024]

Ndiaye M, Prieto-Baños S, Fitzgerald LM, Yazdizadeh Kharrazi A, Oreshkov S, Dessimoz C, Sedlazeck FJ, Glover N, Majidian S. When less is more: sketching with minimizers in genomics. Genome Biol 2024;25:270. [PMID: 39402664 PMCID: PMC11472564 DOI: 10.1186/s13059-024-03414-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2023] [Accepted: 10/01/2024] [Indexed: 10/19/2024] Open

Campanelli A, Pibiri GE, Fan J, Patro R. Where the Patterns Are: Repetition-Aware Compression for Colored de Bruijn Graphs^{. J Comput Biol 2024;31:1022-1044. [PMID: 39381838 PMCID: PMC11631793 DOI: 10.1089/cmb.2024.0714] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/10/2024] Open}

Abrar H, Medvedev P. PLA-index: A k-mer Index Exploiting Rank Curve Linearity. LIPICS : LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS 2024;312:13. [PMID: 40297743 PMCID: PMC12037174 DOI: 10.4230/lipics.wabi.2024.13] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 04/30/2025]

Campanelli A, Pibiri GE, Fan J, Patro R. Where the patterns are: repetition-aware compression for colored de Bruijn graphs ^⋆. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.07.09.602727. [PMID: 39026859 PMCID: PMC11257547 DOI: 10.1101/2024.07.09.602727] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/20/2024]

Martayan I, Cazaux B, Limasset A, Marchet C. Conway-Bromage-Lyndon (CBL): an exact, dynamic representation of k-mer sets. Bioinformatics 2024;40:i48-i57. [PMID: 38940123 PMCID: PMC11211824 DOI: 10.1093/bioinformatics/btae217] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/29/2024] Open

Rahman A, Dufresne Y, Medvedev P. Compression algorithm for colored de Bruijn graphs. Algorithms Mol Biol 2024;19:20. [PMID: 38797858 PMCID: PMC11129398 DOI: 10.1186/s13015-024-00254-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 01/24/2024] [Indexed: 05/29/2024] Open

Díaz-Domínguez D, Leinonen M, Salmela L. Space-efficient computation of k-mer dictionaries for large values of k. Algorithms Mol Biol 2024;19:14. [PMID: 38581000 PMCID: PMC10996146 DOI: 10.1186/s13015-024-00259-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2023] [Accepted: 03/02/2024] [Indexed: 04/07/2024] Open

Fan J, Khan J, Singh NP, Pibiri GE, Patro R. Fulgor: a fast and compact k-mer index for large-scale matching and color queries. Algorithms Mol Biol 2024;19:3. [PMID: 38254124 PMCID: PMC10810250 DOI: 10.1186/s13015-024-00251-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 01/03/2024] [Indexed: 01/24/2024] Open

Abstract

The problem of sequence identification or matching-determining the subset of reference sequences from a given collection that are likely to contain a short, queried nucleotide sequence-is relevant for many important tasks in Computational Biology, such as metagenomics and pangenome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resource-efficient solution to this problem is of utmost importance. This poses the threefold challenge of representing the reference collection with a data structure that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe an efficient colored de Bruijn graph index, arising as the combination of a k-mer dictionary with a compressed inverted index. The proposed index takes full advantage of the fact that unitigs in the colored compacted de Bruijn graph are monochromatic (i.e., all k-mers in a unitig have the same set of references of origin, or color). Specifically, the unitigs are kept in the dictionary in color order, thereby allowing for the encoding of the map from k-mers to their colors in as little as 1 + o(1) bits per unitig. Hence, one color per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for integer lists, the index achieves very small space. We implement these methods in a tool called Fulgor, and conduct an extensive experimental analysis to demonstrate the improvement of our tool over previous solutions. For example, compared to Themisto-the strongest competitor in terms of index space vs. query time trade-off-Fulgor requires significantly less space (up to 43% less space for a collection of 150,000 Salmonella enterica genomes), is at least twice as fast for color queries, and is 2-6[Formula: see text] faster to construct.

Collapse

Rahman A, Dufresne Y, Medvedev P. Compression Algorithm for Colored de Bruijn Graphs. LIPICS : LEIBNIZ INTERNATIONAL PROCEEDINGS IN INFORMATICS 2023;273:17. [PMID: 38712341 PMCID: PMC11071130 DOI: 10.4230/lipics.wabi.2023.17] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Figures] [Subscribe] [Scholar Register] [Indexed: 05/08/2024]

Chicco D, Ferraro Petrillo U, Cattaneo G. Ten quick tips for bioinformatics analyses using an Apache Spark distributed computing environment. PLoS Comput Biol 2023;19:e1011272. [PMID: 37471333 PMCID: PMC10358940 DOI: 10.1371/journal.pcbi.1011272] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/22/2023] Open

Pellow D, Pu L, Ekim B, Kotlar L, Berger B, Shamir R, Orenstein Y. Efficient minimizer orders for large values of k using minimum decycling sets. Genome Res 2023;33:1154-1161. [PMID: 37558282 PMCID: PMC10538483 DOI: 10.1101/gr.277644.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 04/20/2023] [Indexed: 08/11/2023]

Marchet C, Limasset A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 2023;39:i252-i259. [PMID: 37387170 DOI: 10.1093/bioinformatics/btad225] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open

Pibiri GE, Shibuya Y, Limasset A. Locality-preserving minimal perfect hashing of k-mers. Bioinformatics 2023;39:i534-i543. [PMID: 37387137 PMCID: PMC10311298 DOI: 10.1093/bioinformatics/btad219] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open

Pibiri GE. On weighted k-mer dictionaries. Algorithms Mol Biol 2023;18:3. [PMID: 37328897 DOI: 10.1186/s13015-023-00226-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 05/13/2023] [Indexed: 06/18/2023] Open

Schmidt S, Khan S, Alanko JN, Pibiri GE, Tomescu AI. Matchtigs: minimum plain text representation of k-mer sets. Genome Biol 2023;24:136. [PMID: 37296461 PMCID: PMC10251615 DOI: 10.1186/s13059-023-02968-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Accepted: 05/10/2023] [Indexed: 06/12/2023] Open

Fan J, Singh NP, Khan J, Pibiri GE, Patro R. Fulgor: A fast and compact k-mer index for large-scale matching and color queries. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.05.09.539895. [PMID: 37214944 PMCID: PMC10197524 DOI: 10.1101/2023.05.09.539895] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/24/2023]

Abstract

The problem of sequence identification or matching - determining the subset of references from a given collection that are likely to contain a query nucleotide sequence - is relevant for many important tasks in Computational Biology, such as metagenomics and pan-genome analysis. Due to the complex nature of such analyses and the large scale of the reference collections a resourceefficient solution to this problem is of utmost importance. The reference collection should therefore be pre-processed into an index for fast queries. This poses the threefold challenge of designing an index that is efficient to query, has light memory usage, and scales well to large collections. To solve this problem, we describe how recent advancements in associative, order-preserving, k-mer dictionaries can be combined with a compressed inverted index to implement a fast and compact colored de Bruijn graph data structure. This index takes full advantage of the fact that unitigs in the colored de Bruijn graph are monochromatic (all k-mers in a unitig have the same set of references of origin, or "color"), leveraging the order-preserving property of its dictionary. In fact, k-mers are kept in unitig order by the dictionary, thereby allowing for the encoding of the map from k-mers to their inverted lists in as little as 1+o(1) bits per unitig. Hence, one inverted list per unitig is stored in the index with almost no space/time overhead. By combining this property with simple but effective compression methods for inverted lists, the index achieves very small space. We implement these methods in a tool called Fulgor. Compared to Themisto, the prior state of the art, Fulgor indexes a heterogeneous collection of 30,691 bacterial genomes in 3.8× less space, a collection of 150,000 Salmonella enterica genomes in approximately 2× less space, is at least twice as fast for color queries, and is 2 - 6× faster to construct.

Collapse

Schmidt S, Alanko JN. Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time. RESEARCH SQUARE 2023:rs.3.rs-2581995. [PMID: 36824947 PMCID: PMC9949180 DOI: 10.21203/rs.3.rs-2581995/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]

He D, Soneson C, Patro R. Understanding and evaluating ambiguity in single-cell and single-nucleus RNA-sequencing. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.01.04.522742. [PMID: 36711921 PMCID: PMC9881993 DOI: 10.1101/2023.01.04.522742] [Citation(s) in RCA: 8] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/06/2023]

Abstract

Recently, a new modification has been proposed by Hjörleifsson and Sullivan et al. to the model used to classify the splicing status of reads (as spliced (mature), unspliced (nascent), or ambiguous) in single-cell and single-nucleus RNA-seq data. Here, we evaluate both the theoretical basis and practical implementation of the proposed method. The proposed method is highly-conservative, and therefore, unlikely to mischaracterize reads as spliced (mature) or unspliced (nascent) when they are not. However, we find that it leaves a large fraction of reads classified as ambiguous, and, in practice, allocates these ambiguous reads in an all-or-nothing manner, and differently between single-cell and single-nucleus RNA-seq data. Further, as implemented in practice, the ambiguous classification is implicit and based on the index against which the reads are mapped, which leads to several drawbacks compared to methods that consider both spliced (mature) and unspliced (nascent) mapping targets simultaneously - for example, the ability to use confidently assigned reads to rescue ambiguous reads based on shared UMIs and gene targets. Nonetheless, we show that these conservative assignment rules can be obtained directly in existing approaches simply by altering the set of targets that are indexed. To this end, we introduce the spliceu reference and show that its use with alevin-fry recapitulates the more conservative proposed classification. We also observe that, on experimental data, and under the proposed allocation rules for ambiguous UMIs, the difference between the proposed classification scheme and existing conventions appears much smaller than previously reported. We demonstrate the use of the new piscem index for mapping simultaneously against spliced (mature) and unspliced (nascent) targets, allowing classification against the full nascent and mature transcriptome in human or mouse in <3GB of memory. Finally, we discuss the potential of incorporating probabilistic evidence into the inference of splicing status, and suggest that it may provide benefits beyond what can be obtained from discrete classification of UMIs as splicing-ambiguous.

Collapse