1
|
Kong T, Wang Y, Liu B. xRead: a coverage-guided approach for scalable construction of read overlapping graph. Gigascience 2025; 14:giaf007. [PMID: 39960665 PMCID: PMC11831799 DOI: 10.1093/gigascience/giaf007] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 11/29/2024] [Accepted: 01/10/2025] [Indexed: 02/20/2025] Open
Abstract
BACKGROUND The development of long-read sequencing is promising for the high-quality and comprehensive de novo assembly for various species around the world. However, it is still challenging for assemblers to handle thousands of genomes, tens of gigabase-level assembly sizes, and terabase-level datasets efficiently, which is a bottleneck to large-scale de novo sequencing studies. A major cause is the read overlapping graph construction that state-of-the-art tools usually have to cost terabyte-level RAM space and tens of days for large genomes. Such lower performance and scalability are not suited to handle the numerous samples being sequenced. FINDINGS Herein, we propose xRead, a novel iterative overlapping graph construction approach that achieves high performance, scalability, and yield simultaneously. Under the guidance of its coverage-based model, xRead converts read-overlapping to heuristic read-mapping and incremental graph construction tasks with highly controllable RAM space and faster speed. It enables the processing of very large datasets (such as the 1.28 Tb Ambystoma mexicanum dataset) with less than 64 GB RAM and obviously lower time costs. Moreover, benchmarks suggest that it can produce highly accurate and well-connected overlapping graphs, which are also supportive of various kinds of downstream assembly strategies. CONCLUSIONS xRead is able to break through the major bottleneck to graph construction and lays a new foundation for de novo assembly. This tool is suited to handle a large number of datasets from large genomes and may play important roles in many de novo sequencing studies.
Collapse
Affiliation(s)
- Tangchao Kong
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Yadong Wang
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| | - Bo Liu
- Center for Bioinformatics, Faculty of Computing, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
- Key Laboratory of Biological Bigdata, Ministry of Education, Harbin Institute of Technology, Harbin, Heilongjiang 150001, China
| |
Collapse
|
2
|
Cavlak MB, Singh G, Alser M, Firtina C, Lindegger J, Sadrosadati M, Mansouri Ghiasi N, Alkan C, Mutlu O. TargetCall: eliminating the wasted computation in basecalling via pre-basecalling filtering. Front Genet 2024; 15:1429306. [PMID: 39529848 PMCID: PMC11551021 DOI: 10.3389/fgene.2024.1429306] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/07/2024] [Accepted: 09/30/2024] [Indexed: 11/16/2024] Open
Abstract
Basecalling is an essential step in nanopore sequencing analysis where the raw signals of nanopore sequencers are converted into nucleotide sequences, that is, reads. State-of-the-art basecallers use complex deep learning models to achieve high basecalling accuracy. This makes basecalling computationally inefficient and memory-hungry, bottlenecking the entire genome analysis pipeline. However, for many applications, most reads do not match the reference genome of interest (i.e., target reference) and thus are discarded in later steps in the genomics pipeline, wasting the basecalling computation. To overcome this issue, we propose TargetCall, the first pre-basecalling filter to eliminate the wasted computation in basecalling. TargetCall's key idea is to discard reads that will not match the target reference (i.e., off-target reads) prior to basecalling. TargetCall consists of two main components: (1) LightCall, a lightweight neural network basecaller that produces noisy reads, and (2) Similarity Check, which labels each of these noisy reads as on-target or off-target by matching them to the target reference. Our thorough experimental evaluations show that TargetCall 1) improves the end-to-end basecalling runtime performance of the state-of-the-art basecaller by 3.31 × while maintaining high ( 98.88 % ) recall in keeping on-target reads, 2) maintains high accuracy in downstream analysis, and 3) achieves better runtime performance, throughput, recall, precision, and generality than prior works. TargetCall is available at https://github.com/CMU-SAFARI/TargetCall.
Collapse
Affiliation(s)
- Meryem Banu Cavlak
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Gagandeep Singh
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Mohammed Alser
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Can Firtina
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Joël Lindegger
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Mohammad Sadrosadati
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Nika Mansouri Ghiasi
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| | - Can Alkan
- Department of Computer Engineering, Bilkent University, Ankara, Türkiye
| | - Onur Mutlu
- SAFARI Research Group, Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
3
|
Baudeau T, Sahlin K. Improved sub-genomic RNA prediction with the ARTIC protocol. Nucleic Acids Res 2024; 52:e82. [PMID: 39149898 PMCID: PMC11417393 DOI: 10.1093/nar/gkae687] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/21/2023] [Revised: 07/18/2024] [Accepted: 07/25/2024] [Indexed: 08/17/2024] Open
Abstract
Viral subgenomic RNA (sgRNA) plays a major role in SARS-COV2's replication, pathogenicity, and evolution. Recent sequencing protocols, such as the ARTIC protocol, have been established. However, due to the viral-specific biological processes, analyzing sgRNA through viral-specific read sequencing data is a computational challenge. Current methods rely on computational tools designed for eukaryote genomes, resulting in a gap in the tools designed specifically for sgRNA detection. To address this, we make two contributions. Firstly, we present sgENERATE, an evaluation pipeline to study the accuracy and efficacy of sgRNA detection tools using the popular ARTIC sequencing protocol. Using sgENERATE, we evaluate periscope, a recently introduced tool that detects sgRNA from ARTIC sequencing data. We find that periscope has biased predictions and high computational costs. Secondly, using the information produced from sgENERATE, we redesign the algorithm in periscope to use multiple references from canonical sgRNAs to mitigate alignment issues and improve sgRNA and non-canonical sgRNA detection. We evaluate periscope and our algorithm, periscope_multi, on simulated and biological sequencing datasets and demonstrate periscope_multi's enhanced sgRNA detection accuracy. Our contribution advances tools for studying viral sgRNA, paving the way for more accurate and efficient analyses in the context of viral RNA discovery.
Collapse
Affiliation(s)
- Thomas Baudeau
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91 Stockholm, Sweden
| |
Collapse
|
4
|
Sharma GK, Sharma R, Joshi K, Qureshi S, Mathur S, Sinha S, Chatterjee S, Nunia V. Advancing microbial diagnostics: a universal phylogeny guided computational algorithm to find unique sequences for precise microorganism detection. Brief Bioinform 2024; 25:bbae545. [PMID: 39441245 PMCID: PMC11497845 DOI: 10.1093/bib/bbae545] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2024] [Revised: 09/21/2024] [Accepted: 10/11/2024] [Indexed: 10/25/2024] Open
Abstract
Sequences derived from organisms sharing common evolutionary origins exhibit similarity, while unique sequences, absent in related organisms, act as good diagnostic marker candidates. However, the approach focused on identifying dissimilar regions among closely-related organisms poses challenges as it requires complex multiple sequence alignments, making computation and parsing difficult. To address this, we have developed a biologically inspired universal NAUniSeq algorithm to find the unique sequences for microorganism diagnosis by traveling through the phylogeny of life. Mapping through a phylogenetic tree ensures a low number of cross-contamination and false positives. We have downloaded complete taxonomy data from Taxadb database and sequence data from National Center for Biotechnology Information Reference Sequence Database (NCBI-Refseq) and, with the help of NetworkX, created a phylogenetic tree. Sequences were assigned over the graph nodes, k-mers were created for target and non-target nodes and search was performed over the graph using the depth first search algorithm. In a memory efficient alternative NoSQL approach, we created a collection of Refseq sequences in MongoDB database using tax-id and path of FASTA files. We queried the MongoDB collection for the target and non-target sequences. In both the approaches, we used an alignment free sliding window k-mer-based procedure that quickly compares k-mers of target and non-target sequences and returns unique sequences that are not present in the non-target. We have validated our algorithm with target nodes Mycobacterium tuberculosis, Neisseria gonorrhoeae, and Monkeypox and generated unique sequences. This universal algorithm is a powerful tool for generating diagnostic sequences, enabling the accurate identification of microbial strains with high phylogenetic precision.
Collapse
Affiliation(s)
- Gulshan Kumar Sharma
- Malaviya National Institute of Technology, Jawahar Lal Nehru Marg, Jhalana Gram, Malviya Nagar, Jaipur, Rajasthan 302017, India
| | - Rakesh Sharma
- Centre for Converging Technologies, University of Rajasthan, Jawahar Lal Nehru Marg, Talvandi, Jaipur, Rajasthan 302004, India
| | - Kavita Joshi
- Department of Zoology, University of Rajasthan, Jawahar Lal Nehru Marg, Talvandi, Jaipur, Rajasthan 302004, India
| | - Sameer Qureshi
- Department of Zoology, University of Rajasthan, Jawahar Lal Nehru Marg, Talvandi, Jaipur, Rajasthan 302004, India
| | - Shubhita Mathur
- Department of Zoology, University of Rajasthan, Jawahar Lal Nehru Marg, Talvandi, Jaipur, Rajasthan 302004, India
| | - Sharad Sinha
- Department of Mathematics, University of Rajasthan, Jawahar Lal Nehru Marg, Talvandi, Jaipur, Rajasthan 302004, India
| | - Samit Chatterjee
- Department of Zoology, University of Rajasthan, Jawahar Lal Nehru Marg, Talvandi, Jaipur, Rajasthan 302004, India
| | - Vandana Nunia
- Department of Zoology, University of Rajasthan, Jawahar Lal Nehru Marg, Talvandi, Jaipur, Rajasthan 302004, India
| |
Collapse
|
5
|
Wang S, Jiang Y, Che L, Wang RH, Li SC. Enhancing insights into diseases through horizontal gene transfer event detection from gut microbiome. Nucleic Acids Res 2024; 52:e61. [PMID: 38884260 DOI: 10.1093/nar/gkae515] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2023] [Revised: 04/23/2024] [Accepted: 06/04/2024] [Indexed: 06/18/2024] Open
Abstract
Horizontal gene transfer (HGT) phenomena pervade the gut microbiome and significantly impact human health. Yet, no current method can accurately identify complete HGT events, including the transferred sequence and the associated deletion and insertion breakpoints from shotgun metagenomic data. Here, we develop LocalHGT, which facilitates the reliable and swift detection of complete HGT events from shotgun metagenomic data, delivering an accuracy of 99.4%-verified by Nanopore data-across 200 gut microbiome samples, and achieving an average F1 score of 0.99 on 100 simulated data. LocalHGT enables a systematic characterization of HGT events within the human gut microbiome across 2098 samples, revealing that multiple recipient genome sites can become targets of a transferred sequence, microhomology is enriched in HGT breakpoint junctions (P-value = 3.3e-58), and HGTs can function as host-specific fingerprints indicated by the significantly higher HGT similarity of intra-personal temporal samples than inter-personal samples (P-value = 4.3e-303). Crucially, HGTs showed potential contributions to colorectal cancer (CRC) and acute diarrhoea, as evidenced by the enrichment of the butyrate metabolism pathway (P-value = 3.8e-17) and the shigellosis pathway (P-value = 5.9e-13) in the respective associated HGTs. Furthermore, differential HGTs demonstrated promise as biomarkers for predicting various diseases. Integrating HGTs into a CRC prediction model achieved an AUC of 0.87.
Collapse
Affiliation(s)
- Shuai Wang
- City University of Hong Kong Shenzhen Research Institute, Shenzhen, China
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
| | - Yiqi Jiang
- City University of Hong Kong Shenzhen Research Institute, Shenzhen, China
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
| | - Lijia Che
- City University of Hong Kong Shenzhen Research Institute, Shenzhen, China
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
| | - Ruo Han Wang
- City University of Hong Kong Shenzhen Research Institute, Shenzhen, China
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
| | - Shuai Cheng Li
- City University of Hong Kong Shenzhen Research Institute, Shenzhen, China
- Department of Computer Science, City University of Hong Kong, Kowloon, Hong Kong
| |
Collapse
|
6
|
Firtina C, Soysal M, Lindegger J, Mutlu O. RawHash2: mapping raw nanopore signals using hash-based seeding and adaptive quantization. Bioinformatics 2024; 40:btae478. [PMID: 39078113 PMCID: PMC11333567 DOI: 10.1093/bioinformatics/btae478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/03/2023] [Revised: 07/04/2024] [Accepted: 07/29/2024] [Indexed: 07/31/2024] Open
Abstract
SUMMARY Raw nanopore signals can be analyzed while they are being generated, a process known as real-time analysis. Real-time analysis of raw signals is essential to utilize the unique features that nanopore sequencing provides, enabling the early stopping of the sequencing of a read or the entire sequencing run based on the analysis. The state-of-the-art mechanism, RawHash, offers the first hash-based efficient and accurate similarity identification between raw signals and a reference genome by quickly matching their hash values. In this work, we introduce RawHash2, which provides major improvements over RawHash, including more sensitive quantization and chaining algorithms, weighted mapping decisions, frequency filters to reduce ambiguous seed hits, minimizers for hash-based sketching, and support for the R10.4 flow cell version and POD5 and SLOW5 file formats. Compared to RawHash, RawHash2 provides better F1 accuracy (on average by 10.57% and up to 20.25%) and better throughput (on average by 4.0× and up to 9.9×) than RawHash. AVAILABILITY AND IMPLEMENTATION RawHash2 is available at https://github.com/CMU-SAFARI/RawHash. We also provide the scripts to fully reproduce our results on our GitHub page.
Collapse
Affiliation(s)
- Can Firtina
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8092, Switzerland
| | - Melina Soysal
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8092, Switzerland
| | - Joël Lindegger
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8092, Switzerland
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zurich, Zurich 8092, Switzerland
| |
Collapse
|
7
|
Xu W, Hsu PK, Moshiri N, Yu S, Rosing T. HyperGen: Compact and Efficient Genome Sketching using Hyperdimensional Vectors. Bioinformatics 2024; 40:btae452. [PMID: 39012512 PMCID: PMC11281827 DOI: 10.1093/bioinformatics/btae452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2024] [Revised: 07/09/2024] [Accepted: 07/12/2024] [Indexed: 07/17/2024] Open
Abstract
MOTIVATION Genomic distance estimation is a critical workload since exact computation for whole-genome similarity metrics such as Average Nucleotide Identity (ANI) incurs prohibitive runtime overhead. Genome sketching is a fast and memory-efficient solution to estimate ANI similarity by distilling representative k-mers from the original sequences. In this work, we present HyperGen that improves accuracy, runtime performance, and memory efficiency for large-scale ANI estimation. Unlike existing genome sketching algorithms that convert large genome files into discrete k-mer hashes, HyperGen leverages the emerging hyperdimensional computing (HDC) to encode genomes into quasi-orthogonal vectors (Hypervector, HV) in high-dimensional space. HV is compact and can preserve more information, allowing for accurate ANI estimation while reducing required sketch sizes. In particular, the HV sketch representation in HyperGen allows efficient ANI estimation using vector multiplication, which naturally benefits from highly optimized general matrix multiply (GEMM) routines. As a result, HyperGen enables the efficient sketching and ANI estimation for massive genome collections. RESULTS We evaluate HyperGen 's sketching and database search performance using several genome datasets at various scales. HyperGen is able to achieve comparable or superior ANI estimation error and linearity compared to other sketch-based counterparts. The measurement results show that HyperGen is one of the fastest tools for both genome sketching and database search. Meanwhile, HyperGen produces memory-efficient sketch files while ensuring high ANI estimation accuracy. AVAILABILITY A Rust implementation of HyperGen is freely available under the MIT license as an open-source software project at https://github.com/wh-xu/Hyper-Gen. The scripts to reproduce the experimental results can be accessed at https://github.com/wh-xu/experiment-hyper-gen.
Collapse
Affiliation(s)
- Weihong Xu
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, United States
| | - Po-Kai Hsu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, United States
| | - Niema Moshiri
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, United States
| | - Shimeng Yu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332, United States
| | - Tajana Rosing
- Department of Computer Science and Engineering, University of California San Diego, La Jolla, CA 92093, United States
| |
Collapse
|
8
|
Karami M, Soltani Mohammadi A, Martin M, Ekim B, Shen W, Guo L, Xu M, Pibiri GE, Patro R, Sahlin K. Designing efficient randstrobes for sequence similarity analyses. Bioinformatics 2024; 40:btae187. [PMID: 38579261 PMCID: PMC11034988 DOI: 10.1093/bioinformatics/btae187] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2023] [Revised: 02/23/2024] [Accepted: 04/04/2024] [Indexed: 04/07/2024] Open
Abstract
MOTIVATION Substrings of length k, commonly referred to as k-mers, play a vital role in sequence analysis. However, k-mers are limited to exact matches between sequences leading to alternative constructs. We recently introduced a class of new constructs, strobemers, that can match across substitutions and smaller insertions and deletions. Randstrobes, the most sensitive strobemer proposed in Sahlin (Effective sequence similarity detection with strobemers. Genome Res 2021a;31:2080-94. https://doi.org/10.1101/gr.275648.121), has been used in several bioinformatics applications such as read classification, short-read mapping, and read overlap detection. Recently, we showed that the more pseudo-random the behavior of the construction (measured in entropy), the more efficient the seeds for sequence similarity analysis. The level of pseudo-randomness depends on the construction operators, but no study has investigated the efficacy. RESULTS In this study, we introduce novel construction methods, including a Binary Search Tree-based approach that improves time complexity over previous methods. To our knowledge, we are also the first to address biases in construction and design three metrics for measuring bias. Our evaluation shows that our methods have favorable speed and sampling uniformity compared to existing approaches. Lastly, guided by our results, we change the seed construction in strobealign, a short-read mapper, and find that the results change substantially. We suggest combining the two results to improve strobealign's accuracy for the shortest reads in our evaluated datasets. Our evaluation highlights sampling biases that can occur and provides guidance on which operators to use when implementing randstrobes. AVAILABILITY AND IMPLEMENTATION All methods and evaluation benchmarks are available in a public Github repository at https://github.com/Moein-Karami/RandStrobes. The scripts for running the strobealign analysis are found at https://github.com/NBISweden/strobealign-evaluation.
Collapse
Affiliation(s)
- Moein Karami
- Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm 106 91, Sweden
| | - Aryan Soltani Mohammadi
- Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm 106 91, Sweden
| | - Marcel Martin
- Department of Biochemistry and Biophysics, National Bioinformatics Infrastructure Sweden, Science for Life Laboratory, Stockholm University, Solna SE-17121, Sweden
| | - Barış Ekim
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, MA 02139, United States
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, United States
| | - Wei Shen
- Department of Infectious Diseases, Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Institute for Viral Hepatitis, The Second Affiliated Hospital of Chongqing Medical University, Chongqing 400010, China
| | | | | | - Giulio Ermanno Pibiri
- Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Venice 30172, Italy
- ISTI-CNR, Pisa 56124, Italy
| | - Rob Patro
- Department of Computer Science and Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, United States
| | - Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, Stockholm 106 91, Sweden
| |
Collapse
|
9
|
Greenberg G, Ravi AN, Shomorony I. LexicHash: sequence similarity estimation via lexicographic comparison of hashes. Bioinformatics 2023; 39:btad652. [PMID: 37878809 PMCID: PMC10628434 DOI: 10.1093/bioinformatics/btad652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 10/11/2023] [Accepted: 10/23/2023] [Indexed: 10/27/2023] Open
Abstract
MOTIVATION Pairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise. RESULTS In this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how "lexicographically similar" the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision-recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an O(n) time algorithm, and circumventing the seemingly fundamental O(n2) scaling associated with pairwise similarity search. AVAILABILITY AND IMPLEMENTATION LexicHash is available on GitHub at https://github.com/gcgreenberg/LexicHash.
Collapse
Affiliation(s)
- Grant Greenberg
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Aditya Narayan Ravi
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Ilan Shomorony
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| |
Collapse
|
10
|
Kille B, Garrison E, Treangen TJ, Phillippy AM. Minmers are a generalization of minimizers that enable unbiased local Jaccard estimation. Bioinformatics 2023; 39:btad512. [PMID: 37603771 PMCID: PMC10505501 DOI: 10.1093/bioinformatics/btad512] [Citation(s) in RCA: 13] [Impact Index Per Article: 6.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2023] [Revised: 07/19/2023] [Accepted: 08/18/2023] [Indexed: 08/23/2023] Open
Abstract
MOTIVATION The Jaccard similarity on k-mer sets has shown to be a convenient proxy for sequence identity. By avoiding expensive base-level alignments and comparing reduced sequence representations, tools such as MashMap can scale to massive numbers of pairwise comparisons while still providing useful similarity estimates. However, due to their reliance on minimizer winnowing, previous versions of MashMap were shown to be biased and inconsistent estimators of Jaccard similarity. This directly impacts downstream tools that rely on the accuracy of these estimates. RESULTS To address this, we propose the minmer winnowing scheme, which generalizes the minimizer scheme by use of a rolling minhash with multiple sampled k-mers per window. We show both theoretically and empirically that minmers yield an unbiased estimator of local Jaccard similarity, and we implement this scheme in an updated version of MashMap. The minmer-based implementation is over 10 times faster than the minimizer-based version under the default ANI threshold, making it well-suited for large-scale comparative genomics applications. AVAILABILITY AND IMPLEMENTATION MashMap3 is available at https://github.com/marbl/MashMap.
Collapse
Affiliation(s)
- Bryce Kille
- Department of Computer Science, Rice University, Houston, TX, United States
| | - Erik Garrison
- Department of Genetics, Genomics and Informatics, University of Tennessee Health Science Center, Memphis, TN, United States
| | - Todd J Treangen
- Department of Computer Science, Rice University, Houston, TX, United States
| | - Adam M Phillippy
- Genome Informatics Section, Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, United States
| |
Collapse
|
11
|
Ayad LAK, Chikhi R, Pissis SP. Seedability: optimizing alignment parameters for sensitive sequence comparison. BIOINFORMATICS ADVANCES 2023; 3:vbad108. [PMID: 37621456 PMCID: PMC10444664 DOI: 10.1093/bioadv/vbad108] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 08/02/2023] [Accepted: 08/10/2023] [Indexed: 08/26/2023]
Abstract
Motivation Most sequence alignment techniques make use of exact k-mer hits, called seeds, as anchors to optimize alignment speed. A large number of bioinformatics tools employing seed-based alignment techniques, such as Minimap2 , use a single value of k per sequencing technology, without a strong guarantee that this is the best possible value. Given the ubiquity of sequence alignment, identifying values of k that lead to more sensitive alignments is thus an important task. To aid this, we present Seedability , a seed-based alignment framework designed for estimating an optimal seed k-mer length (as well as a minimal number of shared seeds) based on a given alignment identity threshold. In particular, we were motivated to make Minimap2 more sensitive in the pairwise alignment of short sequences. Results The experimental results herein show improved alignments of short and divergent sequences when using the parameter values determined by Seedability in comparison to the default values of Minimap2 . We also show several cases of pairs of real divergent sequences, where the default parameter values of Minimap2 yield no output alignments, but the values output by Seedability produce plausible alignments. Availability and implementation https://github.com/lorrainea/Seedability (distributed under GPL v3.0).
Collapse
Affiliation(s)
- Lorraine A K Ayad
- Department of Computer Science, Brunel University London, London UB8 3PH, UK
| | - Rayan Chikhi
- G5 Sequence Bioinformatics, Institut Pasteur, Université Paris Cité, 75015 Paris, France
| | - Solon P Pissis
- Networks & Optimization, CWI, 1098 XG Amsterdam, The Netherlands
- Department of Computer Science, Vrije Universiteit, 1081 HV Amsterdam, The Netherlands
| |
Collapse
|
12
|
Ekim B, Sahlin K, Medvedev P, Berger B, Chikhi R. Efficient mapping of accurate long reads in minimizer space with mapquik. Genome Res 2023; 33:1188-1197. [PMID: 37399256 PMCID: PMC10538364 DOI: 10.1101/gr.277679.123] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/08/2023] [Accepted: 06/26/2023] [Indexed: 07/05/2023]
Abstract
DNA sequencing data continue to progress toward longer reads with increasingly lower sequencing error rates. We focus on the critical problem of mapping, or aligning, low-divergence sequences from long reads (e.g., Pacific Biosciences [PacBio] HiFi) to a reference genome, which poses challenges in terms of accuracy and computational resources when using cutting-edge read mapping approaches that are designed for all types of alignments. A natural idea would be to optimize efficiency with longer seeds to reduce the probability of extraneous matches; however, contiguous exact seeds quickly reach a sensitivity limit. We introduce mapquik, a novel strategy that creates accurate longer seeds by anchoring alignments through matches of k consecutively sampled minimizers (k-min-mers) and only indexing k-min-mers that occur once in the reference genome, thereby unlocking ultrafast mapping while retaining high sensitivity. We show that mapquik significantly accelerates the seeding and chaining steps-fundamental bottlenecks to read mapping-for both the human and maize genomes with [Formula: see text] sensitivity and near-perfect specificity. On the human genome, for both real and simulated reads, mapquik achieves a [Formula: see text] speedup over the state-of-the-art tool minimap2, and on the maize genome, mapquik achieves a [Formula: see text] speedup over minimap2, making mapquik the fastest mapper to date. These accelerations are enabled from not only minimizer-space seeding but also a novel heuristic [Formula: see text] pseudochaining algorithm, which improves upon the long-standing [Formula: see text] bound. Minimizer-space computation builds the foundation for achieving real-time analysis of long-read sequencing data.
Collapse
Affiliation(s)
- Bariş Ekim
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
| | - Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, SE-106 91 Stockholm, Sweden
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory (CSAIL), Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology (MIT), Cambridge, Massachusetts 02139, USA
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, 75015 Paris, France
| |
Collapse
|
13
|
Maier BD, Sahlin K. Entropy predicts sensitivity of pseudorandom seeds. Genome Res 2023; 33:1162-1174. [PMID: 37217253 PMCID: PMC10538493 DOI: 10.1101/gr.277645.123] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2023] [Accepted: 05/04/2023] [Indexed: 05/24/2023]
Abstract
Seed design is important for sequence similarity search applications such as read mapping and average nucleotide identity (ANI) estimation. Although k-mers and spaced k-mers are likely the most well-known and used seeds, sensitivity suffers at high error rates, particularly when indels are present. Recently, we developed a pseudorandom seeding construct, strobemers, which was empirically shown to have high sensitivity also at high indel rates. However, the study lacked a deeper understanding of why. In this study, we propose a model to estimate the entropy of a seed and find that seeds with high entropy, according to our model, in most cases have high match sensitivity. Our discovered seed randomness-sensitivity relationship explains why some seeds perform better than others, and the relationship provides a framework for designing even more sensitive seeds. We also present three new strobemer seed constructs: mixedstrobes, altstrobes, and multistrobes. We use both simulated and biological data to show that our new seed constructs improve sequence-matching sensitivity to other strobemers. We show that the three new seed constructs are useful for read mapping and ANI estimation. For read mapping, we implement strobemers into minimap2 and observe 30% faster alignment time and 0.2% higher accuracy than using k-mers when mapping reads at high error rates. As for ANI estimation, we find that higher entropy seeds have a higher rank correlation between estimated and true ANI.
Collapse
Affiliation(s)
| | - Kristoffer Sahlin
- Department of Mathematics, Stockholm University, 106 91 Stockholm, Sweden
| |
Collapse
|
14
|
Firtina C, Mansouri Ghiasi N, Lindegger J, Singh G, Cavlak MB, Mao H, Mutlu O. RawHash: enabling fast and accurate real-time analysis of raw nanopore signals for large genomes. Bioinformatics 2023; 39:i297-i307. [PMID: 37387139 PMCID: PMC10311405 DOI: 10.1093/bioinformatics/btad272] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
Nanopore sequencers generate electrical raw signals in real-time while sequencing long genomic strands. These raw signals can be analyzed as they are generated, providing an opportunity for real-time genome analysis. An important feature of nanopore sequencing, Read Until, can eject strands from sequencers without fully sequencing them, which provides opportunities to computationally reduce the sequencing time and cost. However, existing works utilizing Read Until either (i) require powerful computational resources that may not be available for portable sequencers or (ii) lack scalability for large genomes, rendering them inaccurate or ineffective. We propose RawHash, the first mechanism that can accurately and efficiently perform real-time analysis of nanopore raw signals for large genomes using a hash-based similarity search. To enable this, RawHash ensures the signals corresponding to the same DNA content lead to the same hash value, regardless of the slight variations in these signals. RawHash achieves an accurate hash-based similarity search via an effective quantization of the raw signals such that signals corresponding to the same DNA content have the same quantized value and, subsequently, the same hash value. We evaluate RawHash on three applications: (i) read mapping, (ii) relative abundance estimation, and (iii) contamination analysis. Our evaluations show that RawHash is the only tool that can provide high accuracy and high throughput for analyzing large genomes in real-time. When compared to the state-of-the-art techniques, UNCALLED and Sigmap, RawHash provides (i) 25.8× and 3.4× better average throughput and (ii) significantly better accuracy for large genomes, respectively. Source code is available at https://github.com/CMU-SAFARI/RawHash.
Collapse
Affiliation(s)
- Can Firtina
- Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland
| | - Nika Mansouri Ghiasi
- Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland
| | - Joel Lindegger
- Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland
| | - Gagandeep Singh
- Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland
| | - Meryem Banu Cavlak
- Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland
| | - Haiyu Mao
- Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zurich, 8092 Zurich, Switzerland
| |
Collapse
|
15
|
Sahlin K, Baudeau T, Cazaux B, Marchet C. A survey of mapping algorithms in the long-reads era. Genome Biol 2023; 24:133. [PMID: 37264447 PMCID: PMC10236595 DOI: 10.1186/s13059-023-02972-3] [Citation(s) in RCA: 24] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 05/12/2023] [Indexed: 06/03/2023] Open
Abstract
It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91, Stockholm, Sweden.
| | - Thomas Baudeau
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Bastien Cazaux
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France.
| |
Collapse
|