Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Firtina C, Park J, Alser M, Kim JS, Cali D, Shahroodi T, Ghiasi N, Singh G, Kanellopoulos K, Alkan C, Mutlu O. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023;5:lqad004. [PMID: 36685727 PMCID: PMC9853099 DOI: 10.1093/nargab/lqad004] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 12/16/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open

For:	Firtina C, Park J, Alser M, Kim JS, Cali D, Shahroodi T, Ghiasi N, Singh G, Kanellopoulos K, Alkan C, Mutlu O. BLEND: a fast, memory-efficient and accurate mechanism to find fuzzy seed matches in genome analysis. NAR Genom Bioinform 2023;5:lqad004. [PMID: 36685727 PMCID: PMC9853099 DOI: 10.1093/nargab/lqad004] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2022] [Revised: 12/16/2022] [Accepted: 01/10/2023] [Indexed: 01/22/2023] Open

Number

Cited by Other Article(s)

Kong T, Wang Y, Liu B. xRead: a coverage-guided approach for scalable construction of read overlapping graph. Gigascience 2025;14:giaf007. [PMID: 39960665 PMCID: PMC11831799 DOI: 10.1093/gigascience/giaf007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 11/29/2024] [Accepted: 01/10/2025] [Indexed: 02/20/2025] Open

Cavlak MB, Singh G, Alser M, Firtina C, Lindegger J, Sadrosadati M, Mansouri Ghiasi N, Alkan C, Mutlu O. TargetCall: eliminating the wasted computation in basecalling via pre-basecalling filtering. Front Genet 2024;15:1429306. [PMID: 39529848 PMCID: PMC11551021 DOI: 10.3389/fgene.2024.1429306] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 09/30/2024] [Indexed: 11/16/2024] Open

Baudeau T, Sahlin K. Improved sub-genomic RNA prediction with the ARTIC protocol. Nucleic Acids Res 2024;52:e82. [PMID: 39149898 PMCID: PMC11417393 DOI: 10.1093/nar/gkae687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 07/18/2024] [Accepted: 07/25/2024] [Indexed: 08/17/2024] Open

Sharma GK, Sharma R, Joshi K, Qureshi S, Mathur S, Sinha S, Chatterjee S, Nunia V. Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection. Brief Bioinform 2024;25:bbae545. [PMID: 39441245 PMCID: PMC11497845 DOI: 10.1093/bib/bbae545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 09/21/2024] [Accepted: 10/11/2024] [Indexed: 10/25/2024] Open

Abstract

Sequences derived from organisms sharing common evolutionary origins exhibit similarity, while unique sequences, absent in related organisms, act as good diagnostic marker candidates. However, the approach focused on identifying dissimilar regions among closely-related organisms poses challenges as it requires complex multiple sequence alignments, making computation and parsing difficult. To address this, we have developed a biologically inspired universal NAUniSeq algorithm to find the unique sequences for microorganism diagnosis by traveling through the phylogeny of life. Mapping through a phylogenetic tree ensures a low number of cross-contamination and false positives. We have downloaded complete taxonomy data from Taxadb database and sequence data from National Center for Biotechnology Information Reference Sequence Database (NCBI-Refseq) and, with the help of NetworkX, created a phylogenetic tree. Sequences were assigned over the graph nodes, k-mers were created for target and non-target nodes and search was performed over the graph using the depth first search algorithm. In a memory efficient alternative NoSQL approach, we created a collection of Refseq sequences in MongoDB database using tax-id and path of FASTA files. We queried the MongoDB collection for the target and non-target sequences. In both the approaches, we used an alignment free sliding window k-mer-based procedure that quickly compares k-mers of target and non-target sequences and returns unique sequences that are not present in the non-target. We have validated our algorithm with target nodes Mycobacterium tuberculosis, Neisseria gonorrhoeae, and Monkeypox and generated unique sequences. This universal algorithm is a powerful tool for generating diagnostic sequences, enabling the accurate identification of microbial strains with high phylogenetic precision.

Collapse

Wang S, Jiang Y, Che L, Wang RH, Li SC. Enhancing insights into diseases through horizontal gene transfer event detection from gut microbiome. Nucleic Acids Res 2024;52:e61. [PMID: 38884260 DOI: 10.1093/nar/gkae515] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 04/23/2024] [Accepted: 06/04/2024] [Indexed: 06/18/2024] Open

Firtina C, Soysal M, Lindegger J, Mutlu O. RawHash2: mapping raw nanopore signals using hash-based seeding and adaptive quantization. Bioinformatics 2024;40:btae478. [PMID: 39078113 PMCID: PMC11333567 DOI: 10.1093/bioinformatics/btae478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 07/04/2024] [Accepted: 07/29/2024] [Indexed: 07/31/2024] Open

Xu W, Hsu PK, Moshiri N, Yu S, Rosing T. HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors. Bioinformatics 2024;40:btae452. [PMID: 39012512 PMCID: PMC11281827 DOI: 10.1093/bioinformatics/btae452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 07/09/2024] [Accepted: 07/12/2024] [Indexed: 07/17/2024] Open

Abstract

MOTIVATION

Genomic distance estimation is a critical workload since exact computation for whole-genome similarity metrics such as Average Nucleotide Identity (ANI) incurs prohibitive runtime overhead. Genome sketching is a fast and memory-efficient solution to estimate ANI similarity by distilling representative k-mers from the original sequences. In this work, we present HyperGen that improves accuracy, runtime performance, and memory efficiency for large-scale ANI estimation. Unlike existing genome sketching algorithms that convert large genome files into discrete k-mer hashes, HyperGen leverages the emerging hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector, HV) in high-dimensional space. HV is compact and can preserve more information, allowing for accurate ANI estimation while reducing required sketch sizes. In particular, the HV sketch representation in HyperGen allows efficient ANI estimation using vector multiplication, which naturally benefits from highly optimized general matrix multiply (GEMM) routines. As a result, HyperGen enables the efficient sketching and ANI estimation for massive genome collections.

RESULTS

We evaluate HyperGen 's sketching and database search performance using several genome datasets at various scales. HyperGen is able to achieve comparable or superior ANI estimation error and linearity compared to other sketch-based counterparts. The measurement results show that HyperGen is one of the fastest tools for both genome sketching and database search. Meanwhile, HyperGen produces memory-efficient sketch files while ensuring high ANI estimation accuracy.

AVAILABILITY

A Rust implementation of HyperGen is freely available under the MIT license as an open-source software project at https://github.com/wh-xu/Hyper-Gen. The scripts to reproduce the experimental results can be accessed at https://github.com/wh-xu/experiment-hyper-gen.

Collapse

Karami M, Soltani Mohammadi A, Martin M, Ekim B, Shen W, Guo L, Xu M, Pibiri GE, Patro R, Sahlin K. Designing efficient randstrobes for sequence similarity analyses. Bioinformatics 2024;40:btae187. [PMID: 38579261 PMCID: PMC11034988 DOI: 10.1093/bioinformatics/btae187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 02/23/2024] [Accepted: 04/04/2024] [Indexed: 04/07/2024] Open

Abstract

MOTIVATION

Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080-94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy.

RESULTS

In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign's accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes.

AVAILABILITY AND IMPLEMENTATION

All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.

Collapse

Greenberg G, Ravi AN, Shomorony I. LexicHash: sequence similarity estimation via lexicographic comparison of hashes. Bioinformatics 2023;39:btad652. [PMID: 37878809 PMCID: PMC10628434 DOI: 10.1093/bioinformatics/btad652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 10/11/2023] [Accepted: 10/23/2023] [Indexed: 10/27/2023] Open

Kille B, Garrison E, Treangen TJ, Phillippy AM. Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation. Bioinformatics 2023;39:btad512. [PMID: 37603771 PMCID: PMC10505501 DOI: 10.1093/bioinformatics/btad512] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 07/19/2023] [Accepted: 08/18/2023] [Indexed: 08/23/2023] Open

Ayad LAK, Chikhi R, Pissis SP. Seedability: optimizing alignment parameters for sensitive sequence comparison. BIOINFORMATICS ADVANCES 2023;3:vbad108. [PMID: 37621456 PMCID: PMC10444664 DOI: 10.1093/bioadv/vbad108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 08/02/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023]

Ekim B, Sahlin K, Medvedev P, Berger B, Chikhi R. Efficient mapping of accurate long reads in minimizer space with mapquik. Genome Res 2023;33:1188-1197. [PMID: 37399256 PMCID: PMC10538364 DOI: 10.1101/gr.277679.123] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2023] [Accepted: 06/26/2023] [Indexed: 07/05/2023]

Maier BD, Sahlin K. Entropy predicts sensitivity of pseudorandom seeds. Genome Res 2023;33:1162-1174. [PMID: 37217253 PMCID: PMC10538493 DOI: 10.1101/gr.277645.123] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Accepted: 05/04/2023] [Indexed: 05/24/2023]

Firtina C, Mansouri Ghiasi N, Lindegger J, Singh G, Cavlak MB, Mao H, Mutlu O. RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes. Bioinformatics 2023;39:i297-i307. [PMID: 37387139 PMCID: PMC10311405 DOI: 10.1093/bioinformatics/btad272] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open

Sahlin K, Baudeau T, Cazaux B, Marchet C. A survey of mapping algorithms in the long-reads era. Genome Biol 2023;24:133. [PMID: 37264447 PMCID: PMC10236595 DOI: 10.1186/s13059-023-02972-3] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 05/12/2023] [Indexed: 06/03/2023] Open