1
|
Břinda K, Lima L, Pignotti S, Quinones-Olvera N, Salikhov K, Chikhi R, Kucherov G, Iqbal Z, Baym M. Efficient and robust search of microbial genomes via phylogenetic compression. Nat Methods 2025; 22:692-697. [PMID: 40205174 DOI: 10.1038/s41592-025-02625-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2023] [Accepted: 02/12/2025] [Indexed: 04/11/2025]
Abstract
Comprehensive collections approaching millions of sequenced genomes have become central information sources in the life sciences. However, the rapid growth of these collections has made it effectively impossible to search these data using tools such as the Basic Local Alignment Search Tool (BLAST) and its successors. Here, we present a technique called phylogenetic compression, which uses evolutionary history to guide compression and efficiently search large collections of microbial genomes using existing algorithms and data structures. We show that, when applied to modern diverse collections approaching millions of genomes, lossless phylogenetic compression improves the compression ratios of assemblies, de Bruijn graphs and k-mer indexes by one to two orders of magnitude. Additionally, we develop a pipeline for a BLAST-like search over these phylogeny-compressed reference data, and demonstrate it can align genes, plasmids or entire sequencing experiments against all sequenced bacteria until 2019 on ordinary desktop computers within a few hours. Phylogenetic compression has broad applications in computational biology and may provide a fundamental design principle for future genomics infrastructure.
Collapse
Affiliation(s)
- Karel Břinda
- Inria, Irisa, Univ. Rennes, Rennes, France.
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| | | | - Simone Pignotti
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA
- LIGM, CNRS, Univ. Gustave Eiffel, Marne-la-Vallée, France
| | | | - Kamil Salikhov
- LIGM, CNRS, Univ. Gustave Eiffel, Marne-la-Vallée, France
| | - Rayan Chikhi
- Institut Pasteur, Univ. Paris Cité, G5 Sequence Bioinformatics, Paris, France
| | | | - Zamin Iqbal
- EMBL-EBI, Hinxton, UK
- Milner Centre for Evolution, University of Bath, Bath, UK
| | - Michael Baym
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.
| |
Collapse
|
2
|
Alser M, Eudine J, Mutlu O. Taming large-scale genomic analyses via sparsified genomics. Nat Commun 2025; 16:876. [PMID: 39837860 PMCID: PMC11751491 DOI: 10.1038/s41467-024-55762-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2023] [Accepted: 12/20/2024] [Indexed: 01/23/2025] Open
Abstract
Searching for similar genomic sequences is an essential and fundamental step in biomedical research. State-of-the-art computational methods performing such comparisons fail to cope with the exponential growth of genomic sequencing data. We introduce the concept of sparsified genomics where we systematically exclude a large number of bases from genomic sequences and enable faster and memory-efficient processing of the sparsified, shorter genomic sequences, while providing comparable accuracy to processing non-sparsified sequences. Sparsified genomics provides benefits to many genomic analyses and has broad applicability. Sparsifying genomic sequences accelerates the state-of-the-art read mapper (minimap2) by 2.57-5.38x, 1.13-2.78x, and 3.52-6.28x using real Illumina, HiFi, and ONT reads, respectively, while providing comparable memory footprint, 2x smaller index size, and more correctly detected variations compared to minimap2. Sparsifying genomic sequences makes containment search through very large genomes and large databases 72.7-75.88x (1.62-1.9x when indexing is preprocessed) faster and 723.3x more storage-efficient than searching through non-sparsified genomic sequences (with CMash and KMC3). Sparsifying genomic sequences enables robust microbiome discovery by providing 54.15-61.88x (1.58-1.71x when indexing is preprocessed) faster and 720x more storage-efficient taxonomic profiling of metagenomic samples over the state-of-the-art tool (Metalign).
Collapse
Affiliation(s)
- Mohammed Alser
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland.
- Department of Computer Science, Georgia State University, Atlanta, GA, USA.
- Department of Clinical Pharmacy, University of Southern California, LA, CA, USA.
| | - Julien Eudine
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| | - Onur Mutlu
- Department of Information Technology and Electrical Engineering, ETH Zürich, Zurich, Switzerland
| |
Collapse
|
3
|
Mian E, Petrucci E, Pizzi C, Comin M. MISSH: Fast Hashing of Multiple Spaced Seeds. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:2330-2339. [PMID: 39320990 DOI: 10.1109/tcbb.2024.3467368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/27/2024]
Abstract
Alignment-free analysis of sequences has revolutionized the high-throughput processing of sequencing data within numerous bioinformatics pipelines. Hashing -mers represents a common function across various alignment-free applications, serving as a crucial tool for indexing, querying, and rapid similarity searching. More recently, spaced seeds, a specialized pattern that accommodates errors or mutations, have become a standard choice over traditional -mers. Spaced seeds offer enhanced sensitivity in many applications when compared to -mers. However, it's important to note that hashing spaced seeds significantly increases computational time. Furthermore, if multiple spaced seeds are employed, accuracy can be further improved, albeit at the expense of longer processing times. This paper addresses the challenge of efficiently hashing multiple spaced seeds. The proposed algorithms leverage the similarity of adjacent spaced seed hash values within an input sequence, allowing for the swift computation of subsequent hashes. Our experimental results, conducted across various tests, demonstrate a remarkable performance improvement over previously suggested algorithms, with potential speedups of up to 20 times. Additionally, we apply these efficient spaced seed hashing algorithms to a metagenomic application, specifically the classification of reads using Clark-S (Ounit and Lonardi, 2016). Our findings reveal a substantial speedup, effectively mitigating the slowdown caused by the utilization of multiple spaced seeds.
Collapse
|
4
|
Zhao J, Both JP, Rodriguez-R LM, Konstantinidis KT. GSearch: ultra-fast and scalable genome search by combining K-mer hashing with hierarchical navigable small world graphs. Nucleic Acids Res 2024; 52:e74. [PMID: 39011878 PMCID: PMC11381346 DOI: 10.1093/nar/gkae609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2023] [Revised: 06/20/2024] [Accepted: 06/27/2024] [Indexed: 07/17/2024] Open
Abstract
Genome search and/or classification typically involves finding the best-match database (reference) genomes and has become increasingly challenging due to the growing number of available database genomes and the fact that traditional methods do not scale well with large databases. By combining k-mer hashing-based probabilistic data structures (i.e. ProbMinHash, SuperMinHash, Densified MinHash and SetSketch) to estimate genomic distance, with a graph based nearest neighbor search algorithm (Hierarchical Navigable Small World Graphs, or HNSW), we created a new data structure and developed an associated computer program, GSearch, that is orders of magnitude faster than alternative tools while maintaining high accuracy and low memory usage. For example, GSearch can search 8000 query genomes against all available microbial or viral genomes for their best matches (n = ∼318 000 or ∼3 000 000, respectively) within a few minutes on a personal laptop, using ∼6 GB of memory (2.5 GB via SetSketch). Notably, GSearch has an O(log(N)) time complexity and will scale well with billions of genomes based on a database splitting strategy. Further, GSearch implements a three-step search strategy depending on the degree of novelty of the query genomes to maximize specificity and sensitivity. Therefore, GSearch solves a major bottleneck of microbiome studies that require genome search and/or classification.
Collapse
Affiliation(s)
- Jianshu Zhao
- Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
| | | | - Luis M Rodriguez-R
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA
- Department of Microbiology, University of Innsbruck, Innsbruck, Austria
- Digital Science Center (DiSC), University of Innsbruck, Innsbruck, Austria
| | - Konstantinos T Konstantinidis
- Center for Bioinformatics and Computational Genomics, Georgia Institute of Technology, Atlanta, GA, USA
- School of Biological Sciences, Georgia Institute of Technology, Atlanta, GA, USA
- School of Civil and Environmental Engineering, Georgia Institute of Technology, Atlanta, GA, USA
| |
Collapse
|
5
|
Marini S, Barquero A, Wadhwani AA, Bian J, Ruiz J, Boucher C, Prosperi M. OCTOPUS: Disk-based, Multiplatform, Mobile-friendly Metagenomics Classifier. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.03.15.585215. [PMID: 38559026 PMCID: PMC10979967 DOI: 10.1101/2024.03.15.585215] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Portable genomic sequencers such as Oxford Nanopore's MinION enable real-time applications in clinical and environmental health. However, there is a bottleneck in the downstream analytics when bioinformatics pipelines are unavailable, e.g., when cloud processing is unreachable due to absence of Internet connection, or only low-end computing devices can be carried on site. Here we present a platform-friendly software for portable metagenomic analysis of Nanopore data, the Oligomer-based Classifier of Taxonomic Operational and Pan-genome Units via Singletons (OCTOPUS). OCTOPUS is written in Java, reimplements several features of the popular Kraken2 and KrakenUniq software, with original components for improving metagenomics classification on incomplete/sampled reference databases, making it ideal for running on smartphones or tablets. OCTOPUS obtains sensitivity and precision comparable to Kraken2, while dramatically decreasing (4- to 16-fold) the false positive rate, and yielding high correlation on real-word data. OCTOPUS is available along with customized databases at https://github.com/DataIntellSystLab/OCTOPUS and https://github.com/Ruiz-HCI-Lab/OctopusMobile.
Collapse
Affiliation(s)
- Simone Marini
- Department of Epidemiology, University of Florida, Gainesville, USA
- Emerging Pathogens Institute, University of Florida, Gainesville, USA
| | - Alexander Barquero
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Anisha Ashok Wadhwani
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, University of Florida, USA
| | - Jaime Ruiz
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, USA
| | - Mattia Prosperi
- Department of Epidemiology, University of Florida, Gainesville, USA
| |
Collapse
|
6
|
Zheng H, Marçais G, Kingsford C. Creating and Using Minimizer Sketches in Computational Genomics. J Comput Biol 2023; 30:1251-1276. [PMID: 37646787 PMCID: PMC11082048 DOI: 10.1089/cmb.2023.0094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/01/2023] Open
Abstract
Processing large data sets has become an essential part of computational genomics. Greatly increased availability of sequence data from multiple sources has fueled breakthroughs in genomics and related fields but has led to computational challenges processing large sequencing experiments. The minimizer sketch is a popular method for sequence sketching that underlies core steps in computational genomics such as read mapping, sequence assembling, k-mer counting, and more. In most applications, minimizer sketches are constructed using one of few classical approaches. More recently, efforts have been put into building minimizer sketches with desirable properties compared with the classical constructions. In this survey, we review the history of the minimizer sketch, the theories developed around the concept, and the plethora of applications taking advantage of such sketches. We aim to provide the readers a comprehensive picture of the research landscape involving minimizer sketches, in anticipation of better fusion of theory and application in the future.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computer Science Department, Princeton University, Princeton, New Jersey, USA
| | - Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
7
|
Sahlin K, Baudeau T, Cazaux B, Marchet C. A survey of mapping algorithms in the long-reads era. Genome Biol 2023; 24:133. [PMID: 37264447 PMCID: PMC10236595 DOI: 10.1186/s13059-023-02972-3] [Citation(s) in RCA: 21] [Impact Index Per Article: 10.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2022] [Accepted: 05/12/2023] [Indexed: 06/03/2023] Open
Abstract
It has been over a decade since the first publication of a method dedicated entirely to mapping long-reads. The distinctive characteristics of long reads resulted in methods moving from the seed-and-extend framework used for short reads to a seed-and-chain framework due to the seed abundance in each read. The main novelties are based on alternative seed constructs or chaining formulations. Dozens of tools now exist, whose heuristics have evolved considerably. We provide an overview of the methods used in long-read mappers. Since they are driven by implementation-specific parameters, we develop an original visualization tool to understand the parameter settings ( http://bcazaux.polytech-lille.net/Minimap2/ ).
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 106 91, Stockholm, Sweden.
| | - Thomas Baudeau
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Bastien Cazaux
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France
| | - Camille Marchet
- Univ. Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000, Lille, France.
| |
Collapse
|
8
|
Cavattoni M, Comin M. ClassGraph: Improving Metagenomic Read Classification with Overlap Graphs. J Comput Biol 2023. [PMID: 37023405 DOI: 10.1089/cmb.2022.0208] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/08/2023] Open
Abstract
ABSTRACT Current technologies allow the sequencing of microbial communities directly from the environment without prior culturing. One of the major problems when analyzing a microbial sample is to taxonomically annotate its reads to identify the species it contains. Most methods that are currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of sensitivity (the actual number of classified reads), the performance is often poor. One reason is that the reads in a sample can be very different from the corresponding reference genomes; for example, viral genomes are usually highly mutated. To address this issue, in this article, we propose ClassGraph, a new taxonomic classification method that makes use of the read overlap graph and applies a label propagation algorithm to refine the results of existing tools. We evaluated its performance on simulated and real datasets with several taxonomic classification tools, and the results showed an improved sensitivity and F-measure, while maintaining high precision. ClassGraph is capable of improving the classification accuracy, especially in difficult cases such as virus and real datasets, where traditional tools can classify <40% of reads.
Collapse
Affiliation(s)
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
9
|
Burks DJ, Pusadkar V, Azad RK. POSMM: an efficient alignment-free metagenomic profiler that complements alignment-based profiling. ENVIRONMENTAL MICROBIOME 2023; 18:16. [PMID: 36890583 PMCID: PMC9993663 DOI: 10.1186/s40793-023-00476-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 07/14/2022] [Accepted: 02/25/2023] [Indexed: 06/18/2023]
Abstract
We present here POSMM (pronounced 'Possum'), Python-Optimized Standard Markov Model classifier, which is a new incarnation of the Markov model approach to metagenomic sequence analysis. Built on the top of a rapid Markov model based classification algorithm SMM, POSMM reintroduces high sensitivity associated with alignment-free taxonomic classifiers to probe whole genome or metagenome datasets of increasingly prohibitive sizes. Logistic regression models generated and optimized using the Python sklearn library, transform Markov model probabilities to scores suitable for thresholding. Featuring a dynamic database-free approach, models are generated directly from genome fasta files per run, making POSMM a valuable accompaniment to many other programs. By combining POSMM with ultrafast classifiers such as Kraken2, their complementary strengths can be leveraged to produce higher overall accuracy in metagenomic sequence classification than by either as a standalone classifier. POSMM is a user-friendly and highly adaptable tool designed for broad use by the metagenome scientific community.
Collapse
Affiliation(s)
- David J Burks
- Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, 76203, USA
| | - Vaidehi Pusadkar
- Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, 76203, USA
| | - Rajeev K Azad
- Department of Biological Sciences and BioDiscovery Institute, University of North Texas, Denton, TX, 76203, USA.
- Department of Mathematics, University of North Texas, Denton, TX, 76203, USA.
| |
Collapse
|
10
|
Shen W, Xiang H, Huang T, Tang H, Peng M, Cai D, Hu P, Ren H. KMCP: accurate metagenomic profiling of both prokaryotic and viral populations by pseudo-mapping. Bioinformatics 2023; 39:btac845. [PMID: 36579886 PMCID: PMC9828150 DOI: 10.1093/bioinformatics/btac845] [Citation(s) in RCA: 19] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2022] [Revised: 12/17/2022] [Accepted: 12/28/2022] [Indexed: 12/30/2022] Open
Abstract
MOTIVATION The growing number of microbial reference genomes enables the improvement of metagenomic profiling accuracy but also imposes greater requirements on the indexing efficiency, database size and runtime of taxonomic profilers. Additionally, most profilers focus mainly on bacterial, archaeal and fungal populations, while less attention is paid to viral communities. RESULTS We present KMCP (K-mer-based Metagenomic Classification and Profiling), a novel k-mer-based metagenomic profiling tool that utilizes genome coverage information by splitting the reference genomes into chunks and stores k-mers in a modified and optimized Compact Bit-Sliced Signature Index for fast alignment-free sequence searching. KMCP combines k-mer similarity and genome coverage information to reduce the false positive rate of k-mer-based taxonomic classification and profiling methods. Benchmarking results based on simulated and real data demonstrate that KMCP, despite a longer running time than all other methods, not only allows the accurate taxonomic profiling of prokaryotic and viral populations but also provides more confident pathogen detection in clinical samples of low depth. AVAILABILITY AND IMPLEMENTATION The software is open-source under the MIT license and available at https://github.com/shenwei356/kmcp. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wei Shen
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hongyan Xiang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Tianquan Huang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hui Tang
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Mingli Peng
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Dachuan Cai
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Peng Hu
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| | - Hong Ren
- Key Laboratory of Molecular Biology for Infectious Diseases (Ministry of Education), Department of Infectious Diseases, Institute for Viral Hepatitis, The Second Affiliated Hospital, Chongqing Medical University, Chongqing 400010, China
| |
Collapse
|
11
|
Zhu K, Schäffer AA, Robinson W, Xu J, Ruppin E, Ergun AF, Ye Y, Sahinalp SC. Strain level microbial detection and quantification with applications to single cell metagenomics. Nat Commun 2022; 13:6430. [PMID: 36307411 PMCID: PMC9616933 DOI: 10.1038/s41467-022-33869-7] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2021] [Accepted: 10/04/2022] [Indexed: 12/25/2022] Open
Abstract
Computational identification and quantification of distinct microbes from high throughput sequencing data is crucial for our understanding of human health. Existing methods either use accurate but computationally expensive alignment-based approaches or less accurate but computationally fast alignment-free approaches, which often fail to correctly assign reads to genomes. Here we introduce CAMMiQ, a combinatorial optimization framework to identify and quantify distinct genomes (specified by a database) in a metagenomic dataset. As a key methodological innovation, CAMMiQ uses substrings of variable length and those that appear in two genomes in the database, as opposed to the commonly used fixed-length, unique substrings. These substrings allow to accurately decouple mixtures of highly similar genomes resulting in higher accuracy than the leading alternatives, without requiring additional computational resources, as demonstrated on commonly used benchmarking datasets. Importantly, we show that CAMMiQ can distinguish closely related bacterial strains in simulated metagenomic and real single-cell metatranscriptomic data.
Collapse
Affiliation(s)
- Kaiyuan Zhu
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
- Department of Computer Science & Engineering, UC San Diego, La Jolla, CA, USA
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - Alejandro A Schäffer
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Welles Robinson
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
- Surgery Branch, Center for Cancer Research, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Junyan Xu
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - Eytan Ruppin
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - A Funda Ergun
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - Yuzhen Ye
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - S Cenk Sahinalp
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA.
- Department of Computer Science, Indiana University, Bloomington, IN, USA.
| |
Collapse
|
12
|
Liu S, Koslicki D. CMash: fast, multi-resolution estimation of k-mer-based Jaccard and containment indices. Bioinformatics 2022; 38:i28-i35. [PMID: 35758788 PMCID: PMC9235470 DOI: 10.1093/bioinformatics/btac237] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Motivation K-mer-based methods are used ubiquitously in the field of computational biology. However, determining the optimal value of k for a specific application often remains heuristic. Simply reconstructing a new k-mer set with another k-mer size is computationally expensive, especially in metagenomic analysis where datasets are large. Here, we introduce a hashing-based technique that leverages a kind of bottom-m sketch as well as a k-mer ternary search tree (KTST) to obtain k-mer-based similarity estimates for a range of k values. By truncating k-mers stored in a pre-built KTST with a large k=kmax value, we can simultaneously obtain k-mer-based estimates for all k values up to kmax. This truncation approach circumvents the reconstruction of new k-mer sets when changing k values, making analysis more time and space-efficient. Results We derived the theoretical expression of the bias factor due to truncation. And we showed that the biases are negligible in practice: when using a KTST to estimate the containment index between a RefSeq-based microbial reference database and simulated metagenome data for 10 values of k, the running time was close to 10× faster compared to a classic MinHash approach while using less than one-fifth the space to store the data structure. Availability and implementation A python implementation of this method, CMash, is available at https://github.com/dkoslicki/CMash. The reproduction of all experiments presented herein can be accessed via https://github.com/KoslickiLab/CMASH-reproducibles. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shaopeng Liu
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA
| | - David Koslicki
- Huck Institutes of Life Sciences, Pennsylvania State University, State College, PA 16801, USA.,Department of Computer Science and Engineering, Pennsylvania State University, State College, PA 16801, USA.,Department of Biology, Pennsylvania State University, State College, PA 16801, USA
| |
Collapse
|
13
|
Tang D, Li Y, Tan D, Fu J, Tang Y, Lin J, Zhao R, Du H, Zhao Z. KCOSS: an ultra-fast k-mer counter for assembled genome analysis. Bioinformatics 2022; 38:933-940. [PMID: 34849595 DOI: 10.1093/bioinformatics/btab797] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2021] [Revised: 10/13/2021] [Accepted: 11/19/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION The k-mer frequency in whole genome sequences provides researchers with an insightful perspective on genomic complexity, comparative genomics, metagenomics and phylogeny. The current k-mer counting tools are typically slow, and they require large memory and hard disk for assembled genome analysis. RESULTS We propose a novel and ultra-fast k-mer counting algorithm, KCOSS, to fulfill k-mer counting mainly for assembled genomes with segmented Bloom filter, lock-free queue, lock-free thread pool and cuckoo hash table. We optimize running time and memory consumption by recycling memory blocks, merging multiple consecutive first-occurrence k-mers into C-read, and writing a set of C-reads to disk asynchronously. KCOSS was comparatively tested with Jellyfish2, CHTKC and KMC3 on seven assembled genomes and three sequencing datasets in running time, memory consumption, and hard disk occupation. The experimental results show that KCOSS counts k-mer with less memory and disk while having a shorter running time on assembled genomes. KCOSS can be used to calculate the k-mer frequency not only for assembled genomes but also for sequencing data. AVAILABILITYAND IMPLEMENTATION The KCOSS software is implemented in C++. It is freely available on GitHub: https://github.com/kcoss-2021/KCOSS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Deyou Tang
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China.,Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA
| | - Yucheng Li
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Daqiang Tan
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Juan Fu
- School of Medicine, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Yelei Tang
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Jiabin Lin
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Rong Zhao
- School of Software Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Hongli Du
- School of Biology and Biological Engineering, South China University of Technology, Guangzhou, Guangdong 510006, China
| | - Zhongming Zhao
- Center for Precision Health, School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,Human Genetics Center, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX 77030, USA.,MD Anderson Cancer Center UTHealth Graduate School of Biomedical Sciences, Houston, TX 77030, USA
| |
Collapse
|
14
|
Storato D, Comin M. K2Mem: Discovering Discriminative K-mers From Sequencing Data for Metagenomic Reads Classification. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:220-229. [PMID: 34606462 DOI: 10.1109/tcbb.2021.3117406] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
The major problem when analyzing a metagenomic sample is to taxonomically annotate its reads to identify the species they contain. Most of the methods currently available focus on the classification of reads using a set of reference genomes and their k-mers. While in terms of precision these methods have reached percentages of correctness close to perfection, in terms of recall (the actual number of classified reads) the performances fall at around 50%. One of the reasons is the fact that the sequences in a sample can be very different from the corresponding reference genome, e.g., viral genomes are highly mutated. To address this issue, in this paper we study the problem of metagenomic reads classification by improving the reference k-mers library with novel discriminative k-mers from the input sequencing reads. We evaluated the performance in different conditions against several other tools and the results showed an improved F-measure, especially when close reference genomes are not available. Availability: https://github.com.
Collapse
|
15
|
Siekaniec G, Roux E, Lemane T, Guédon E, Nicolas J. Identification of isolated or mixed strains from long reads: a challenge met on Streptococcus thermophilus using a MinION sequencer. Microb Genom 2021; 7. [PMID: 34812718 PMCID: PMC8743539 DOI: 10.1099/mgen.0.000654] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
This study aimed to provide efficient recognition of bacterial strains on personal computers from MinION (Nanopore) long read data. Thanks to the fall in sequencing costs, the identification of bacteria can now proceed by whole genome sequencing. MinION is a fast, but highly error-prone sequencing device and it is a challenge to successfully identify the strain content of unknown simple or complex microbial samples. It is heavily constrained by memory management and fast access to the read and genome fragments. Our strategy involves three steps: indexing of known genomic sequences for a given or several bacterial species; a request process to assign a read to a strain by matching it to the closest reference genomes; and a final step looking for a minimum set of strains that best explains the observed reads. We have applied our method, called ORI, on 77 strains of Streptococcus thermophilus. We worked on several genomic distances and obtained a detailed classification of the strains, together with a criterion that allows merging of what we termed 'sibling' strains, only separated by a few mutations. Overall, isolated strains can be safely recognized from MinION data. For mixtures of several non-sibling strains, results depend on strain abundance.
Collapse
Affiliation(s)
- Grégoire Siekaniec
- Univ Rennes, INRIA, Campus de Beaulieu 35042 Rennes cedex, Rennes, France
- INRAE, Institut Agro, STLO, F-35000, Rennes, France
| | - Emeline Roux
- Univ Rennes, INRIA, Campus de Beaulieu 35042 Rennes cedex, Rennes, France
- CALBINOTOX (Composés ALimentaire BIofonctionnalités et risques NeuTOXiques) EA7488 Université de Lorraine, France
| | - Téo Lemane
- Univ Rennes, INRIA, Campus de Beaulieu 35042 Rennes cedex, Rennes, France
| | - Eric Guédon
- INRAE, Institut Agro, STLO, F-35000, Rennes, France
- *Correspondence: Eric Guédon,
| | - Jacques Nicolas
- Univ Rennes, INRIA, Campus de Beaulieu 35042 Rennes cedex, Rennes, France
- *Correspondence: Jacques Nicolas,
| |
Collapse
|
16
|
Sahlin K. Effective sequence similarity detection with strobemers. Genome Res 2021; 31:2080-2094. [PMID: 34667119 PMCID: PMC8559714 DOI: 10.1101/gr.275648.121] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 08/20/2021] [Indexed: 01/08/2023]
Abstract
k-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate k consecutive k-mers and make most k-mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches owing to the size of k Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 10691 Stockholm, Sweden
| |
Collapse
|
17
|
Blanke M, Morgenstern B. App-SpaM: phylogenetic placement of short reads without sequence alignment. BIOINFORMATICS ADVANCES 2021; 1:vbab027. [PMID: 36700102 PMCID: PMC9710606 DOI: 10.1093/bioadv/vbab027] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 09/27/2021] [Accepted: 10/11/2021] [Indexed: 01/28/2023]
Abstract
Motivation Phylogenetic placement is the task of placing a query sequence of unknown taxonomic origin into a given phylogenetic tree of a set of reference sequences. A major field of application of such methods is, for example, the taxonomic identification of reads in metabarcoding or metagenomic studies. Several approaches to phylogenetic placement have been proposed in recent years. The most accurate of them requires a multiple sequence alignment of the references as input. However, calculating multiple alignments is not only time-consuming but also limits the applicability of these approaches. Results Herein, we propose Alignment-free phylogenetic placement algorithm based on Spaced-word Matches (App-SpaM), an efficient algorithm for the phylogenetic placement of short sequencing reads on a tree of a set of reference sequences. App-SpaM produces results of high quality that are on a par with the best available approaches to phylogenetic placement, while our software is two orders of magnitude faster than these existing methods. Our approach neither requires a multiple alignment of the reference sequences nor alignments of the queries to the references. This enables App-SpaM to perform phylogenetic placement on a broad variety of datasets. Availability and implementation The source code of App-SpaM is freely available on Github at https://github.com/matthiasblanke/App-SpaM together with detailed instructions for installation and settings. App-SpaM is furthermore available as a Conda-package on the Bioconda channel. Contact matthias.blanke@biologie.uni-goettingen.de. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Matthias Blanke
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Göttingen 37077, Germany
- International Max Planck Research School for Genome Science, Göttingen 37077, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Göttingen 37077, Germany
- Campus-Institute Data Science (CIDAS), Göttingen 37077, Germany
| |
Collapse
|
18
|
Huang L, Hong B, Yang W, Wang L, Yu R. Snipe: highly sensitive pathogen detection from metagenomic sequencing data. Brief Bioinform 2021; 22:6210456. [PMID: 33822895 DOI: 10.1093/bib/bbab064] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2020] [Revised: 01/05/2021] [Accepted: 02/08/2021] [Indexed: 01/24/2023] Open
Abstract
Metagenomics data provide rich information for the detection of foodborne pathogens from food and environmental samples that are mixed with complex background bacteria strains. While pathogen detection from metagenomic sequencing data has become an activity of increasing interest, shotgun sequencing of uncultured food samples typically produces data that contain reads from many different organisms, making accurate strain typing a challenging task. Particularly, as many pathogens may contain a common set of genes that are highly similar to those from normal bacteria in food samples, traditional strain-level abundance profiling approaches do not perform well at detecting pathogens of very low abundance levels. To overcome this limitation, we propose an abundance correction method based on species-specific genomic regions to achieve high sensitivity and high specificity in target pathogen detection at low abundance.
Collapse
Affiliation(s)
- Lihong Huang
- School of Informatics of Xiamen University, China
| | - Bin Hong
- Department of Computer Science, School of Informatics of Xiamen University, China
| | | | - Liansheng Wang
- Department of Computer Science, School of Informatics of Xiamen University, China
| | - Rongshan Yu
- Department of Computer Science, School of Informatics of Xiamen University, China
| |
Collapse
|
19
|
Rossier V, Warwick Vesztrocy A, Robinson-Rechavi M, Dessimoz C. OMAmer: tree-driven and alignment-free protein assignment to subfamilies outperforms closest sequence approaches. Bioinformatics 2021; 37:2866-2873. [PMID: 33787851 PMCID: PMC8479680 DOI: 10.1093/bioinformatics/btab219] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2020] [Revised: 02/18/2021] [Accepted: 03/30/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION Assigning new sequences to known protein families and subfamilies is a prerequisite for many functional, comparative and evolutionary genomics analyses. Such assignment is commonly achieved by looking for the closest sequence in a reference database, using a method such as BLAST. However, ignoring the gene phylogeny can be misleading because a query sequence does not necessarily belong to the same subfamily as its closest sequence. For example, a hemoglobin which branched out prior to the hemoglobin alpha/beta duplication could be closest to a hemoglobin alpha or beta sequence, whereas it is neither. To overcome this problem, phylogeny-driven tools have emerged but rely on gene trees, whose inference is computationally expensive. RESULTS Here, we first show that in multiple animal and plant datasets, 18-62% of assignments by closest sequence are misassigned, typically to an over-specific subfamily. Then, we introduce OMAmer, a novel alignment-free protein subfamily assignment method, which limits over-specific subfamily assignments and is suited to phylogenomic databases with thousands of genomes. OMAmer is based on an innovative method using evolutionarily informed k-mers for alignment-free mapping to ancestral protein subfamilies. Whilst able to reject non-homologous family-level assignments, we show that OMAmer provides better and quicker subfamily-level assignments than approaches relying on the closest sequence, whether inferred exactly by Smith-Waterman or by the fast heuristic DIAMOND. AVAILABILITYAND IMPLEMENTATION OMAmer is available from the Python Package Index (as omamer), with the source code and a precomputed database available at https://github.com/DessimozLab/omamer. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Victor Rossier
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland,Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Alex Warwick Vesztrocy
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland,Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Marc Robinson-Rechavi
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland,Department of Ecology and Evolution, University of Lausanne, 1015 Lausanne, Switzerland,To whom correspondence should be addressed. or
| | - Christophe Dessimoz
- Department of Computational Biology, University of Lausanne, 1015 Lausanne, Switzerland,Center for Integrative Genomics, University of Lausanne, 1015 Lausanne, Switzerland,SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland,Department of Genetics, Evolution, and Environment, University College London, London, WC1E 6BT, UK,Department of Computer Science, University College London, London, WC1E 6BT, UK,To whom correspondence should be addressed. or
| |
Collapse
|
20
|
Pornputtapong N, Acheampong DA, Patumcharoenpol P, Jenjaroenpun P, Wongsurawat T, Jun SR, Yongkiettrakul S, Chokesajjawatee N, Nookaew I. KITSUNE: A Tool for Identifying Empirically Optimal K-mer Length for Alignment-Free Phylogenomic Analysis. Front Bioeng Biotechnol 2020; 8:556413. [PMID: 33072720 PMCID: PMC7538862 DOI: 10.3389/fbioe.2020.556413] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2020] [Accepted: 08/24/2020] [Indexed: 12/22/2022] Open
Abstract
Genomic DNA is the best “unique identifier” for organisms. Alignment-free phylogenomic analysis, simple, fast, and efficient method to compare genome sequences, relies on looking at the distribution of small DNA sequence of a particular length, referred to as k-mer. The k-mer approach has been explored as a basis for sequence analysis applications, including assembly, phylogenetic tree inference, and classification. Although this approach is not novel, selecting the appropriate k-mer length to obtain the optimal resolution is rather arbitrary. However, it is a very important parameter for achieving the appropriate resolution for genome/sequence distances to infer biologically meaningful phylogenetic relationships. Thus, there is a need for a systematic approach to identify the appropriate k-mer from whole-genome sequences. We present K-mer–length Iterative Selection for UNbiased Ecophylogenomics (KITSUNE), a tool for assessing the empirically optimal k-mer length of any given set of genomes of interest for phylogenomic analysis via a three-step approach based on (1) cumulative relative entropy (CRE), (2) average number of common features (ACF), and (3) observed common features (OCF). Using KITSUNE, we demonstrated the feasibility and reliability of these measurements to obtain empirically optimal k-mer lengths of 11, 17, and ∼34 from large genome datasets of viruses, bacteria, and fungi, respectively. Moreover, we demonstrated a feature of KITSUNE for accurate species identification for the two de novo assembled bacterial genomes derived from error-prone long-reads sequences, and for a published yeast genome. In addition, KITSUNE was used to identify the shortest species-specific k-mer accurately identifying viruses. KITSUNE is freely available at https://github.com/natapol/kitsune.
Collapse
Affiliation(s)
- Natapol Pornputtapong
- Department of Biochemistry and Microbiology, Faculty of Pharmaceutical Sciences, and Research Unit of DNA Barcoding of Thai Medicinal Plants, Chulalongkorn University, Bangkok, Thailand
| | - Daniel A Acheampong
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States.,Joint Graduate Program in Bioinformatics, University of Arkansas at Little Rock and University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Preecha Patumcharoenpol
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Piroon Jenjaroenpun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Thidathip Wongsurawat
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Se-Ran Jun
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| | - Suganya Yongkiettrakul
- National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
| | - Nipa Chokesajjawatee
- National Center for Genetic Engineering and Biotechnology, National Science and Technology Development Agency, Pathum Thani, Thailand
| | - Intawat Nookaew
- Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR, United States
| |
Collapse
|
21
|
Mismatch-tolerant, alignment-free sequence classification using multiple spaced seeds and multiindex Bloom filters. Proc Natl Acad Sci U S A 2020; 117:16961-16968. [PMID: 32641514 PMCID: PMC7382288 DOI: 10.1073/pnas.1903436117] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Alignment-free classification tools have enabled high-throughput processing of sequencing data in many bioinformatics analysis pipelines primarily due to their computational efficiency. Originally k-mer based, such tools often lack sensitivity when faced with sequencing errors and polymorphisms. In response, some tools have been augmented with spaced seeds, which are capable of tolerating mismatches. However, spaced seeds have seen little practical use in classification because they bring increased computational and memory costs compared to methods that use k-mers. These limitations have also caused the design and length of practical spaced seeds to be constrained, since storing spaced seeds can be costly. To address these challenges, we have designed a probabilistic data structure called a multiindex Bloom Filter (miBF), which can store multiple spaced seed sequences with a low memory cost that remains static regardless of seed length or seed design. We formalize how to minimize the false-positive rate of miBFs when classifying sequences from multiple targets or references. Available within BioBloom Tools, we illustrate the utility of miBF in two use cases: read-binning for targeted assembly, and taxonomic read assignment. In our benchmarks, an analysis pipeline based on miBF shows higher sensitivity and specificity for read-binning than sequence alignment-based methods, also executing in less time. Similarly, for taxonomic classification, miBF enables higher sensitivity than a conventional spaced seed-based approach, while using half the memory and an order of magnitude less computational time.
Collapse
|
22
|
Filion GJ, Cortini R, Zorita E. Calibrating Seed-Based Heuristics to Map Short Reads With Sesame. Front Genet 2020; 11:572. [PMID: 32670351 PMCID: PMC7331467 DOI: 10.3389/fgene.2020.00572] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2019] [Accepted: 05/11/2020] [Indexed: 11/13/2022] Open
Abstract
The increasing throughput of DNA sequencing technologies creates a need for faster algorithms. The fate of most reads is to be mapped to a reference sequence, typically a genome. Modern mappers rely on heuristics to gain speed at a reasonable cost for accuracy. In the seeding heuristic, short matches between the reads and the genome are used to narrow the search to a set of candidate locations. Several seeding variants used in modern mappers show good empirical performance but they are difficult to calibrate or to optimize for lack of theoretical results. Here we develop a theory to estimate the probability that the correct location of a read is filtered out during seeding, resulting in mapping errors. We describe the properties of simple exact seeds, skip seeds and MEM seeds (Maximal Exact Match seeds). The main innovation of this work is to use concepts from analytic combinatorics to represent reads as abstract sequences, and to specify their generative function to estimate the probabilities of interest. We provide several algorithms, which together give a workable solution for the problem of calibrating seeding heuristics for short reads. We also provide a C implementation of these algorithms in a library called Sesame. These results can improve current mapping algorithms and lay the foundation of a general strategy to tackle sequence alignment problems. The Sesame library is open source and available for download at https://github.com/gui11aume/sesame.
Collapse
Affiliation(s)
- Guillaume J. Filion
- Center for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
- University Pompeu Fabra (UPF), Barcelona, Spain
| | - Ruggero Cortini
- Center for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| | - Eduard Zorita
- Center for Genomic Regulation (CRG), The Barcelona Institute of Science and Technology, Barcelona, Spain
| |
Collapse
|
23
|
Comin M, Di Camillo B, Pizzi C, Vandin F. Comparison of microbiome samples: methods and computational challenges. Brief Bioinform 2020; 22:88-95. [PMID: 32577746 DOI: 10.1093/bib/bbaa121] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/22/2019] [Revised: 05/09/2020] [Accepted: 05/18/2020] [Indexed: 12/14/2022] Open
Abstract
The study of microbial communities crucially relies on the comparison of metagenomic next-generation sequencing data sets, for which several methods have been designed in recent years. Here, we review three key challenges in the comparison of such data sets: species identification and quantification, the efficient computation of distances between metagenomic samples and the identification of metagenomic features associated with a phenotype such as disease status. We present current solutions for such challenges, considering both reference-based methods relying on a database of reference genomes and reference-free methods working directly on all sequencing reads from the samples.
Collapse
|
24
|
Linard B, Swenson K, Pardi F. Rapid alignment-free phylogenetic identification of metagenomic sequences. Bioinformatics 2020; 35:3303-3312. [PMID: 30698645 DOI: 10.1093/bioinformatics/btz068] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2018] [Revised: 01/18/2019] [Accepted: 01/29/2019] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Taxonomic classification is at the core of environmental DNA analysis. When a phylogenetic tree can be built as a prior hypothesis to such classification, phylogenetic placement (PP) provides the most informative type of classification because each query sequence is assigned to its putative origin in the tree. This is useful whenever precision is sought (e.g. in diagnostics). However, likelihood-based PP algorithms struggle to scale with the ever-increasing throughput of DNA sequencing. RESULTS We have developed RAPPAS (Rapid Alignment-free Phylogenetic Placement via Ancestral Sequences) which uses an alignment-free approach, removing the hurdle of query sequence alignment as a preliminary step to PP. Our approach relies on the precomputation of a database of k-mers that may be present with non-negligible probability in relatives of the reference sequences. The placement is performed by inspecting the stored phylogenetic origins of the k-mers in the query, and their probabilities. The database can be reused for the analysis of several different metagenomes. Experiments show that the first implementation of RAPPAS is already faster than competing likelihood-based PP algorithms, while keeping similar accuracy for short reads. RAPPAS scales PP for the era of routine metagenomic diagnostics. AVAILABILITY AND IMPLEMENTATION Program and sources freely available for download at https://github.com/blinard-BIOINFO/RAPPAS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Benjamin Linard
- LIRMM, University of Montpellier, CNRS, Montpellier, France.,ISEM, University of Montpellier, CNRS, IRD, EPHE, CIRAD, INRAP, Montpellier, France.,AGAP, University of Montpellier, CIRAD, INRA, Montpellier Supagro, Montpellier, France
| | - Krister Swenson
- LIRMM, University of Montpellier, CNRS, Montpellier, France.,Institut de Biologie Computationnelle, Montpellier, France
| | - Fabio Pardi
- LIRMM, University of Montpellier, CNRS, Montpellier, France.,Institut de Biologie Computationnelle, Montpellier, France
| |
Collapse
|
25
|
Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS One 2020; 15:e0228070. [PMID: 32040534 PMCID: PMC7010260 DOI: 10.1371/journal.pone.0228070] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 01/08/2020] [Indexed: 12/14/2022] Open
Abstract
We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.
Collapse
Affiliation(s)
- Sophie Röhling
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Alexander Linne
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Jendrik Schellhorn
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | | | - Thomas Dencker
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
| |
Collapse
|
26
|
Rachtman E, Balaban M, Bafna V, Mirarab S. The impact of contaminants on the accuracy of genome skimming and the effectiveness of exclusion read filters. Mol Ecol Resour 2020; 20. [PMID: 31943790 DOI: 10.1111/1755-0998.13135] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2019] [Revised: 12/22/2019] [Accepted: 01/05/2020] [Indexed: 11/27/2022]
Abstract
The ability to detect the identity of a sample obtained from its environment is a cornerstone of molecular ecological research. Thanks to the falling price of shotgun sequencing, genome skimming, the acquisition of short reads spread across the genome at low coverage, is emerging as an alternative to traditional barcoding. By obtaining far more data across the whole genome, skimming has the promise to increase the precision of sample identification beyond traditional barcoding while keeping the costs manageable. While methods for assembly-free sample identification based on genome skims are now available, little is known about how these methods react to the presence of DNA from organisms other than the target species. In this paper, we show that the accuracy of distances computed between a pair of genome skims based on k-mer similarity can degrade dramatically if the skims include contaminant reads; i.e., any reads originating from other organisms. We establish a theoretical model of the impact of contamination. We then suggest and evaluate a solution to the contamination problem: Query reads in a genome skim against an extensive database of possible contaminants (e.g., all microbial organisms) and filter out any read that matches. We evaluate the effectiveness of this strategy when implemented using Kraken-II, in detailed analyses. Our results show substantial improvements in accuracy as a result of filtering but also point to limitations, including a need for relatively close matches in the contaminant database.
Collapse
Affiliation(s)
- Eleonora Rachtman
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA, USA
| | - Metin Balaban
- Bioinformatics and Systems Biology Graduate Program, UC San Diego, CA, USA
| | - Vineet Bafna
- Department of Computer Science and Engineering, UC San Diego, CA, USA
| | - Siavash Mirarab
- Department of Electrical and Computer Engineering, UC San Diego, CA, USA
| |
Collapse
|
27
|
Petrucci E, Noé L, Pizzi C, Comin M. Iterative Spaced Seed Hashing: Closing the Gap Between Spaced Seed Hashing and k-mer Hashing. J Comput Biol 2020; 27:223-233. [PMID: 31800307 DOI: 10.1089/cmb.2019.0298] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Alignment-free classification of sequences has enabled high-throughput processing of sequencing data in many bioinformatics pipelines. Much work has been done to speed up the indexing of k-mers through hash-table and other data structures. These efforts have led to very fast indexes, but because they are k-mer based, they often lack sensitivity due to sequencing errors or polymorphisms. Spaced seeds are a special type of pattern that accounts for errors or mutations. They allow to improve the sensitivity and they are now routinely used instead of k-mers in many applications. The major drawback of spaced seeds is that they cannot be efficiently hashed and thus their usage increases substantially the computational time. In this article we address the problem of efficient spaced seed hashing. We propose an iterative algorithm that combines multiple spaced seed hashes by exploiting the similarity of adjacent hash values to efficiently compute the next hash. We report a series of experiments on HTS reads hashing, with several spaced seeds. Our algorithm can compute the hashing values of spaced seeds with a speedup in range of [3.5 × -7 × ], outperforming previous methods. Software and data sets are available at Iterative Spaced Seed Hashing.
Collapse
Affiliation(s)
- Enrico Petrucci
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Laurent Noé
- CRIStAL UMR9189, Universit de Lille, Lille, France
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
28
|
Pellegrina L, Pizzi C, Vandin F. Fast Approximation of Frequent k-Mers and Applications to Metagenomics. J Comput Biol 2019; 27:534-549. [PMID: 31891535 DOI: 10.1089/cmb.2019.0314] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/23/2023] Open
Abstract
Estimating the abundances of all k-mers in a set of biological sequences is a fundamental and challenging problem with many applications in biological analysis. Although several methods have been designed for the exact or approximate solution of this problem, they all require to process the entire data set, which can be extremely expensive for high-throughput sequencing data sets. Although in some applications it is crucial to estimate all k-mers and their abundances, in other situations it may be sufficient to report only frequent k-mers, which appear with relatively high frequency in a data set. This is the case, for example, in the computation of k-mers' abundance-based distances among data sets of reads, commonly used in metagenomic analyses. In this study, we develop, analyze, and test a sampling-based approach, called Sampling Algorithm for K-mErs approxIMAtion (SAKEIMA), to approximate the frequent k-mers and their frequencies in a high-throughput sequencing data set while providing rigorous guarantees on the quality of the approximation. SAKEIMA employs an advanced sampling scheme and we show how the characterization of the Vapnik-Chervonenkis dimension, a core concept from statistical learning theory, of a properly defined set of functions leads to practical bounds on the sample size required for a rigorous approximation. Our experimental evaluation shows that SAKEIMA allows to rigorously approximate frequent k-mers by processing only a fraction of a data set and that the frequencies estimated by SAKEIMA lead to accurate estimates of k-mer-based distances between high-throughput sequencing data sets. Overall, SAKEIMA is an efficient and rigorous tool to estimate k-mers' abundances providing significant speedups in the analysis of large sequencing data sets.
Collapse
Affiliation(s)
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Fabio Vandin
- Department of Information Engineering, University of Padova, Padova, Italy
| |
Collapse
|
29
|
Ranjard L, Wong TKF, Rodrigo AG. Effective machine-learning assembly for next-generation amplicon sequencing with very low coverage. BMC Bioinformatics 2019; 20:654. [PMID: 31829137 PMCID: PMC6907241 DOI: 10.1186/s12859-019-3287-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2019] [Accepted: 11/20/2019] [Indexed: 01/20/2023] Open
Abstract
BACKGROUND In short-read DNA sequencing experiments, the read coverage is a key parameter to successfully assemble the reads and reconstruct the sequence of the input DNA. When coverage is very low, the original sequence reconstruction from the reads can be difficult because of the occurrence of uncovered gaps. Reference guided assembly can then improve these assemblies. However, when the available reference is phylogenetically distant from the sequencing reads, the mapping rate of the reads can be extremely low. Some recent improvements in read mapping approaches aim at modifying the reference according to the reads dynamically. Such approaches can significantly improve the alignment rate of the reads onto distant references but the processing of insertions and deletions remains challenging. RESULTS Here, we introduce a new algorithm to update the reference sequence according to previously aligned reads. Substitutions, insertions and deletions are performed in the reference sequence dynamically. We evaluate this approach to assemble a western-grey kangaroo mitochondrial amplicon. Our results show that more reads can be aligned and that this method produces assemblies of length comparable to the truth while limiting error rate when classic approaches fail to recover the correct length. Finally, we discuss how the core algorithm of this method could be improved and combined with other approaches to analyse larger genomic sequences. CONCLUSIONS We introduced an algorithm to perform dynamic alignment of reads on a distant reference. We showed that such approach can improve the reconstruction of an amplicon compared to classically used bioinformatic pipelines. Although not portable to genomic scale in the current form, we suggested several improvements to be investigated to make this method more flexible and allow dynamic alignment to be used for large genome assemblies.
Collapse
Affiliation(s)
- Louis Ranjard
- The Research School of Biology, The Australian National University, Canberra, Australia
| | - Thomas K. F. Wong
- The Research School of Biology, The Australian National University, Canberra, Australia
| | - Allen G. Rodrigo
- The Research School of Biology, The Australian National University, Canberra, Australia
| |
Collapse
|
30
|
Abstract
Although Kraken's k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.
Collapse
Affiliation(s)
- Derrick E Wood
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer Lu
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
31
|
Wood DE, Lu J, Langmead B. Improved metagenomic analysis with Kraken 2. Genome Biol 2019; 20:257. [PMID: 31779668 PMCID: PMC6883579 DOI: 10.1186/s13059-019-1891-0] [Citation(s) in RCA: 3218] [Impact Index Per Article: 536.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2019] [Accepted: 11/18/2019] [Indexed: 02/06/2023] Open
Abstract
Although Kraken’s k-mer-based approach provides a fast taxonomic classification of metagenomic sequence data, its large memory requirements can be limiting for some applications. Kraken 2 improves upon Kraken 1 by reducing memory usage by 85%, allowing greater amounts of reference genomic data to be used, while maintaining high accuracy and increasing speed fivefold. Kraken 2 also introduces a translated search mode, providing increased sensitivity in viral metagenomics analysis.
Collapse
Affiliation(s)
- Derrick E Wood
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.,Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA
| | - Jennifer Lu
- Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.,Department of Biomedical Engineering, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
| | - Ben Langmead
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA. .,Center for Computational Biology, Johns Hopkins University, Baltimore, MD, USA.
| |
Collapse
|
32
|
Kawulok J, Kawulok M, Deorowicz S. Environmental metagenome classification for constructing a microbiome fingerprint. Biol Direct 2019; 14:20. [PMID: 31722729 PMCID: PMC6854650 DOI: 10.1186/s13062-019-0251-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/19/2018] [Accepted: 10/02/2019] [Indexed: 12/22/2022] Open
Abstract
BACKGROUND Nowadays, not only are single genomes commonly analyzed, but also metagenomes, which are sets of, DNA fragments (reads) derived from microbes living in a given environment. Metagenome analysis is aimed at extracting crucial information on the organisms that have left their traces in an investigated environmental sample.In this study we focus on the MetaSUB Forensics Challenge (organized within the CAMDA 2018 conference) which consists in predicting the geographical origin of metagenomic samples. Contrary to the existing methods for environmental classification that are based on taxonomic or functional classification, we rely on the similarity between a sample and the reference database computed at a reads level. RESULTS We report the results of our extensive experimental study to investigate the behavior of our method and its sensitivity to different parameters. In our tests, we have followed the protocol of the MetaSUB Challenge, which allowed us to compare the obtained results with the solutions based on taxonomic and functional classification. CONCLUSIONS The results reported in the paper indicate that our method is competitive with those based on taxonomic classification. Importantly, by measuring the similarity at the reads level, we avoid the necessity of using large databases with annotated gene sequences. Hence our main finding is that environmental classification of metagenomic data can be proceeded without using large databases required for taxonomic or functional classification. REVIEWERS This article was reviewed by Eran Elhaik, Alexandra Bettina Graf, Chengsheng Zhu, and Andre Kahles.
Collapse
Affiliation(s)
- Jolanta Kawulok
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| | - Michal Kawulok
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| | - Sebastian Deorowicz
- Institute of Informatics, Silesian University of Technology, Gliwice, Poland
| |
Collapse
|
33
|
Luo Y, Yu YW, Zeng J, Berger B, Peng J. Metagenomic binning through low-density hashing. Bioinformatics 2019; 35:219-226. [PMID: 30010790 PMCID: PMC6330020 DOI: 10.1093/bioinformatics/bty611] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2018] [Accepted: 07/10/2018] [Indexed: 02/05/2023] Open
Abstract
Motivation Vastly greater quantities of microbial genome data are being generated where environmental samples mix together the DNA from many different species. Here, we present Opal for metagenomic binning, the task of identifying the origin species of DNA sequencing reads. We introduce ‘low-density’ locality sensitive hashing to bioinformatics, with the addition of Gallager codes for even coverage, enabling quick and accurate metagenomic binning. Results On public benchmarks, Opal halves the error on precision/recall (F1-score) as compared with both alignment-based and alignment-free methods for species classification. We demonstrate even more marked improvement at higher taxonomic levels, allowing for the discovery of novel lineages. Furthermore, the innovation of low-density, even-coverage hashing should itself prove an essential methodological advance as it enables the application of machine learning to other bioinformatic challenges. Availability and implementation Full source code and datasets are available at http://opal.csail.mit.edu and https://github.com/yunwilliamyu/opal. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yunan Luo
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL, USA
| | - Yun William Yu
- Department of Biomedical Informatics, Harvard Medical School, Boston, MA, USA.,Department of Mathematics and Computer Science and AI Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jianyang Zeng
- Machine Learning and Computational Biology Group, Institute for Interdisciplinary Information Sciences, Tsinghua University, Beijing, China and
| | - Bonnie Berger
- Department of Mathematics and Computer Science and AI Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Jian Peng
- Department of Computer Science, University of Illinois at Urbana-Champaign, Champaign, IL, USA
| |
Collapse
|
34
|
Leimeister CA, Dencker T, Morgenstern B. Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points. Bioinformatics 2019; 35:211-218. [PMID: 29992260 PMCID: PMC6330006 DOI: 10.1093/bioinformatics/bty592] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 07/09/2018] [Indexed: 01/30/2023] Open
Abstract
Motivation Most methods for pairwise and multiple genome alignment use fast local homology search tools to identify anchor points, i.e. high-scoring local alignments of the input sequences. Sequence segments between those anchor points are then aligned with slower, more sensitive methods. Finding suitable anchor points is therefore crucial for genome sequence comparison; speed and sensitivity of genome alignment depend on the underlying anchoring methods. Results In this article, we use filtered spaced word matches to generate anchor points for genome alignment. For a given binary pattern representing match and don't-care positions, we first search for spaced-word matches, i.e. ungapped local pairwise alignments with matching nucleotides at the match positions of the pattern and possible mismatches at the don't-care positions. Those spaced-word matches that have similarity scores above some threshold value are then extended using a standard X-drop algorithm; the resulting local alignments are used as anchor points. To evaluate this approach, we used the popular multiple-genome-alignment pipeline Mugsy and replaced the exact word matches that Mugsy uses as anchor points with our spaced-word-based anchor points. For closely related genome sequences, the two anchoring procedures lead to multiple alignments of similar quality. For distantly related genomes, however, alignments calculated with our filtered-spaced-word matches are superior to alignments produced with the original Mugsy program where exact word matches are used to find anchor points. Availability and implementation http://spacedanchor.gobics.de. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics.,Center for Computational Sciences, University of Goettingen, Goettingen, Germany
| |
Collapse
|
35
|
Shajii A, Numanagić I, Baghdadi R, Berger B, Amarasinghe S. Seq: A High-Performance Language for Bioinformatics. PROCEEDINGS OF THE ACM ON PROGRAMMING LANGUAGES 2019; 3:125. [PMID: 35775031 PMCID: PMC9241673 DOI: 10.1145/3360551] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The scope and scale of biological data are increasing at an exponential rate, as technologies like next-generation sequencing are becoming radically cheaper and more prevalent. Over the last two decades, the cost of sequencing a genome has dropped from $100 million to nearly $100-a factor of over 106-and the amount of data to be analyzed has increased proportionally. Yet, as Moore's Law continues to slow, computational biologists can no longer rely on computing hardware to compensate for the ever-increasing size of biological datasets. In a field where many researchers are primarily focused on biological analysis over computational optimization, the unfortunate solution to this problem is often to simply buy larger and faster machines. Here, we introduce Seq, the first language tailored specifically to bioinformatics, which marries the ease and productivity of Python with C-like performance. Seq starts with a subset of Python-and is in many cases a drop-in replacement-yet also incorporates novel bioinformatics- and computational genomics-oriented data types, language constructs and optimizations. Seq enables users to write high-level, Pythonic code without having to worry about low-level or domain-specific optimizations, and allows for the seamless expression of the algorithms, idioms and patterns found in many genomics or bioinformatics applications. We evaluated Seq on several standard computational genomics tasks like reverse complementation, k-mer manipulation, sequence pattern matching and large genomic index queries. On equivalent CPython code, Seq attains a performance improvement of up to two orders of magnitude, and a 160× improvement once domain-specific language features and optimizations are used. With parallelism, we demonstrate up to a 650× improvement. Compared to optimized C++ code, which is already difficult for most biologists to produce, Seq frequently attains up to a 2× improvement, and with shorter, cleaner code. Thus, Seq opens the door to an age of democratization of highly-optimized bioinformatics software.
Collapse
Affiliation(s)
- Ariya Shajii
- MIT CSAIL, 77 Massachusetts Ave, Cambridge, MA, 02139, USA
| | | | | | - Bonnie Berger
- MIT CSAIL, 77 Massachusetts Ave, Cambridge, MA, 02139, USA
| | | |
Collapse
|
36
|
Breitwieser FP, Lu J, Salzberg SL. A review of methods and databases for metagenomic classification and assembly. Brief Bioinform 2019; 20:1125-1136. [PMID: 29028872 PMCID: PMC6781581 DOI: 10.1093/bib/bbx120] [Citation(s) in RCA: 294] [Impact Index Per Article: 49.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2017] [Revised: 08/22/2017] [Indexed: 12/13/2022] Open
Abstract
Microbiome research has grown rapidly over the past decade, with a proliferation of new methods that seek to make sense of large, complex data sets. Here, we survey two of the primary types of methods for analyzing microbiome data: read classification and metagenomic assembly, and we review some of the challenges facing these methods. All of the methods rely on public genome databases, and we also discuss the content of these databases and how their quality has a direct impact on our ability to interpret a microbiome sample.
Collapse
Affiliation(s)
| | | | - Steven L Salzberg
- Corresponding author: Steven L. Salzberg, Center for Computational Biology, Johns Hopkins University, 1900 E. Monument St., Baltimore, MD, 21205, USA. E-mail:
| |
Collapse
|
37
|
Kucherov G. Evolution of biosequence search algorithms: a brief survey. Bioinformatics 2019; 35:3547-3552. [DOI: 10.1093/bioinformatics/btz272] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Revised: 04/01/2019] [Accepted: 04/11/2019] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Although modern high-throughput biomolecular technologies produce various types of data, biosequence data remain at the core of bioinformatic analyses. However, computational techniques for dealing with this data evolved dramatically.
Results
In this bird’s-eye review, we overview the evolution of main algorithmic techniques for comparing and searching biological sequences. We highlight key algorithmic ideas emerged in response to several interconnected factors: shifts of biological analytical paradigm, advent of new sequencing technologies and a substantial increase in size of the available data. We discuss the expansion of alignment-free techniques coming to replace alignment-based algorithms in large-scale analyses. We further emphasize recently emerged and growing applications of sketching methods which support comparison of massive datasets, such as metagenomics samples. Finally, we focus on the transition to population genomics and outline associated algorithmic challenges.
Collapse
Affiliation(s)
- Gregory Kucherov
- CNRS and LIGM/University of Paris-Est, Marne-la-Vallée, France
- SkolTech, Moscow, Russia
| |
Collapse
|
38
|
Meher PK, Sahu TK, Gahoi S, Tomar R, Rao AR. funbarRF: DNA barcode-based fungal species prediction using multiclass Random Forest supervised learning model. BMC Genet 2019; 20:2. [PMID: 30616524 PMCID: PMC6323839 DOI: 10.1186/s12863-018-0710-z] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2018] [Accepted: 12/26/2018] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Identification of unknown fungal species aids to the conservation of fungal diversity. As many fungal species cannot be cultured, morphological identification of those species is almost impossible. But, DNA barcoding technique can be employed for identification of such species. For fungal taxonomy prediction, the ITS (internal transcribed spacer) region of rDNA (ribosomal DNA) is used as barcode. Though the computational prediction of fungal species has become feasible with the availability of huge volume of barcode sequences in public domain, prediction of fungal species is challenging due to high degree of variability among ITS regions within species. RESULTS A Random Forest (RF)-based predictor was built for identification of unknown fungal species. The reference and query sequences were mapped onto numeric features based on gapped base pair compositions, and then used as training and test sets respectively for prediction of fungal species using RF. More than 85% accuracy was found when 4 sequences per species in the reference set were utilized; whereas it was seen to be stabilized at ~88% if ≥7 sequence per species in the reference set were used for training of the model. The proposed model achieved comparable accuracy, while evaluated against existing methods through cross-validation procedure. The proposed model also outperformed several existing models used for identification of different species other than fungi. CONCLUSIONS An online prediction server "funbarRF" is established at http://cabgrid.res.in:8080/funbarrf/ for fungal species identification. Besides, an R-package funbarRF ( https://cran.r-project.org/web/packages/funbarRF/ ) is also available for prediction using high throughput sequence data. The effort put in this work will certainly supplement the future endeavors in the direction of fungal taxonomy assignments based on DNA barcode.
Collapse
Affiliation(s)
- Prabina Kumar Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Tanmaya Kumar Sahu
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Shachi Gahoi
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| | - Ruchi Tomar
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
- Department of Bioinformatics, Janta Vedic College, Baraut, Baghpat, Uttar Pradesh 250611 India
| | - Atmakuri Ramakrishna Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, 110012 India
| |
Collapse
|
39
|
Pellegrina L, Pizzi C, Vandin F. Fast Approximation of Frequent k-mers and Applications to Metagenomics. LECTURE NOTES IN COMPUTER SCIENCE 2019. [DOI: 10.1007/978-3-030-17083-7_13] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/06/2023]
|
40
|
Abstract
Background Spaced-seeds, i.e. patterns in which some fixed positions are allowed to be wild-cards, play a crucial role in several bioinformatics applications involving substrings counting and indexing, by often providing better sensitivity with respect to k-mers based approaches. K-mers based approaches are usually fast, being based on efficient hashing and indexing that exploits the large overlap between consecutive k-mers. Spaced-seeds hashing is not as straightforward, and it is usually computed from scratch for each position in the input sequence. Recently, the FSH (Fast Spaced seed Hashing) approach was proposed to improve the time required for computation of the spaced seed hashing of DNA sequences with a speed-up of about 1.5 with respect to standard hashing computation. Results In this work we propose a novel algorithm, Fast Indexing for Spaced seed Hashing (FISH), based on the indexing of small blocks that can be combined to obtain the hashing of spaced-seeds of any length. The method exploits the fast computation of the hashing of runs of consecutive 1 in the spaced seeds, that basically correspond to k-mer of the length of the run. Conclusions We run several experiments, on NGS data from simulated and synthetic metagenomic experiments, to assess the time required for the computation of the hashing for each position in each read with respect to several spaced seeds. In our experiments, FISH can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.9x to 6.03x, depending on the structure of the spaced seeds. Electronic supplementary material The online version of this article (10.1186/s12859-018-2415-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Samuele Girotto
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy.
| | - Cinzia Pizzi
- Department of Information Engineering, University of Padova, via Gradenigo 6/A, Padova, Italy.
| |
Collapse
|
41
|
Mallet L, Bitard-Feildel T, Cerutti F, Chiapello H. PhylOligo: a package to identify contaminant or untargeted organism sequences in genome assemblies. Bioinformatics 2018. [PMID: 28637232 PMCID: PMC5860033 DOI: 10.1093/bioinformatics/btx396] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Motivation Genome sequencing projects sometimes uncover more organisms than expected, especially for complex and/or non-model organisms. It is therefore useful to develop software to identify mix of organisms from genome sequence assemblies. Results Here we present PhylOligo, a new package including tools to explore, identify and extract organism-specific sequences in a genome assembly using the analysis of their DNA compositional characteristics. Availability and implementation The tools are written in Python3 and R under the GPLv3 Licence and can be found at https://github.com/itsmeludo/Phyloligo/. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Ludovic Mallet
- INRA UR875, Unité Mathématiques et Informatique Appliquées de Toulouse (MIAT), Auzeville, 31326 Castanet-Tolosan, France
| | - Tristan Bitard-Feildel
- CNRS UMR7590, Sorbonne Universités, Université Pierre et Marie Curie - Paris 6, MNHN, IRD - IUC, Paris, France
| | - Franck Cerutti
- INRA UR875, Unité Mathématiques et Informatique Appliquées de Toulouse (MIAT), Auzeville, 31326 Castanet-Tolosan, France
| | - Hélène Chiapello
- INRA UR875, Unité Mathématiques et Informatique Appliquées de Toulouse (MIAT), Auzeville, 31326 Castanet-Tolosan, France
| |
Collapse
|
42
|
Zhang J, Guo J, Zhang M, Yu X, Yu X, Guo W, Zeng T, Chen L. Efficient Mining Multi-mers in a Variety of Biological Sequences. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 17:949-958. [PMID: 29993642 DOI: 10.1109/tcbb.2018.2828313] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Counting the occurrence frequency of each -mer in a biological sequence is a preliminary yet important step in many bioinformatics applications. However, most -mer counting algorithms rely on a given k to produce single-length -mers, which is inefficient for sequence analysis for different k. Moreover, existing -mer counters focus more on DNA and RNA sequences and less on protein ones. In practice, the analysis of -mers in protein sequences can provide substantial biological insights in structure, function and evolution. To this end, an efficient algorithm, called MulMer (Multiple-Mer mining), is proposed to mine -mers of various lengths termed multi-mers via inverted-index technique, which is orders of magnitude faster than the conventional forward-index methods. Moreover, to the best of our knowledge, MulMer is the first able to mine multi-mers in a variety of sequences, including DNARNA and protein sequences.
Collapse
|
43
|
Girotto S, Comin M, Pizzi C. FSH: fast spaced seed hashing exploiting adjacent hashes. Algorithms Mol Biol 2018; 13:8. [PMID: 29588651 PMCID: PMC5863468 DOI: 10.1186/s13015-018-0125-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2017] [Accepted: 03/12/2018] [Indexed: 01/27/2023] Open
Abstract
Background Patterns with wildcards in specified positions, namely spaced seeds, are increasingly used instead of k-mers in many bioinformatics applications that require indexing, querying and rapid similarity search, as they can provide better sensitivity. Many of these applications require to compute the hashing of each position in the input sequences with respect to the given spaced seed, or to multiple spaced seeds. While the hashing of k-mers can be rapidly computed by exploiting the large overlap between consecutive k-mers, spaced seeds hashing is usually computed from scratch for each position in the input sequence, thus resulting in slower processing. Results The method proposed in this paper, fast spaced-seed hashing (FSH), exploits the similarity of the hash values of spaced seeds computed at adjacent positions in the input sequence. In our experiments we compute the hash for each positions of metagenomics reads from several datasets, with respect to different spaced seeds. We also propose a generalized version of the algorithm for the simultaneous computation of multiple spaced seeds hashing. In the experiments, our algorithm can compute the hashing values of spaced seeds with a speedup, with respect to the traditional approach, between 1.6\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times$$\end{document}× to 5.3\documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\times$$\end{document}×, depending on the structure of the spaced seed. Conclusions Spaced seed hashing is a routine task for several bioinformatics application. FSH allows to perform this task efficiently and raise the question of whether other hashing can be exploited to further improve the speed up. This has the potential of major impact in the field, making spaced seed applications not only accurate, but also faster and more efficient. Availability The software FSH is freely available for academic use at: https://bitbucket.org/samu661/fsh/overview.
Collapse
|
44
|
Meher PK, Sahu TK, Gahoi S, Rao AR. ir-HSP: Improved Recognition of Heat Shock Proteins, Their Families and Sub-types Based On g-Spaced Di-peptide Features and Support Vector Machine. Front Genet 2018; 8:235. [PMID: 29379521 PMCID: PMC5770798 DOI: 10.3389/fgene.2017.00235] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/05/2017] [Accepted: 12/27/2017] [Indexed: 12/24/2022] Open
Abstract
Heat shock proteins (HSPs) play a pivotal role in cell growth and variability. Since conventional approaches are expensive and voluminous protein sequence information is available in the post-genomic era, development of an automated and accurate computational tool is highly desirable for prediction of HSPs, their families and sub-types. Thus, we propose a computational approach for reliable prediction of all these components in a single framework and with higher accuracy as well. The proposed approach achieved an overall accuracy of ~84% in predicting HSPs, ~97% in predicting six different families of HSPs, and ~94% in predicting four types of DnaJ proteins, with bench mark datasets. The developed approach also achieved higher accuracy as compared to most of the existing approaches. For easy prediction of HSPs by experimental scientists, a user friendly web server ir-HSP is made freely accessible at http://cabgrid.res.in:8080/ir-hsp. The ir-HSP was further evaluated for proteome-wide identification of HSPs by using proteome datasets of eight different species, and ~50% of the predicted HSPs in each species were found to be annotated with InterPro HSP families/domains. Thus, the developed computational method is expected to supplement the currently available approaches for prediction of HSPs, to the extent of their families and sub-types.
Collapse
Affiliation(s)
- Prabina K Meher
- Division of Statistical Genetics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Tanmaya K Sahu
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Shachi Gahoi
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| | - Atmakuri R Rao
- Centre for Agricultural Bioinformatics, ICAR-Indian Agricultural Statistics Research Institute, New Delhi, India
| |
Collapse
|
45
|
Noé L. Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds. Algorithms Mol Biol 2017; 12:1. [PMID: 28289437 PMCID: PMC5310094 DOI: 10.1186/s13015-017-0092-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2016] [Accepted: 01/30/2017] [Indexed: 12/02/2022] Open
Abstract
Background Spaced seeds, also named gapped q-grams, gapped k-mers, spaced q-grams, have been proven to be more sensitive than contiguous seeds (contiguous q-grams, contiguous k-mers) in nucleic and amino-acid sequences analysis. Initially proposed to detect sequence similarities and to anchor sequence alignments, spaced seeds have more recently been applied in several alignment-free related methods. Unfortunately, spaced seeds need to be initially designed. This task is known to be time-consuming due to the number of spaced seed candidates. Moreover, it can be altered by a set of arbitrary chosen parameters from the probabilistic alignment models used. In this general context, Dominant seeds have been introduced by Mak and Benson (Bioinformatics 25:302–308, 2009) on the Bernoulli model, in order to reduce the number of spaced seed candidates that are further processed in a parameter-free calculation of the sensitivity. Results We expand the scope of work of Mak and Benson on single and multiple seeds by considering the Hit Integration model of Chung and Park (BMC Bioinform 11:31, 2010), demonstrate that the same dominance definition can be applied, and that a parameter-free study can be performed without any significant additional cost. We also consider two new discrete models, namely the Heaviside and the Dirac models, where lossless seeds can be integrated. From a theoretical standpoint, we establish a generic framework on all the proposed models, by applying a counting semi-ring to quickly compute large polynomial coefficients needed by the dominance filter. From a practical standpoint, we confirm that dominant seeds reduce the set of, either single seeds to thoroughly analyse, or multiple seeds to store. Moreover, in http://bioinfo.cristal.univ-lille.fr/yass/iedera_dominance, we provide a full list of spaced seeds computed on the four aforementioned models, with one (continuous) parameter left free for each model, and with several (discrete) alignment lengths.
Collapse
|
46
|
Siegwald L, Touzet H, Lemoine Y, Hot D, Audebert C, Caboche S. Assessment of Common and Emerging Bioinformatics Pipelines for Targeted Metagenomics. PLoS One 2017; 12:e0169563. [PMID: 28052134 PMCID: PMC5215245 DOI: 10.1371/journal.pone.0169563] [Citation(s) in RCA: 42] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2016] [Accepted: 12/19/2016] [Indexed: 11/19/2022] Open
Abstract
Targeted metagenomics, also known as metagenetics, is a high-throughput sequencing application focusing on a nucleotide target in a microbiome to describe its taxonomic content. A wide range of bioinformatics pipelines are available to analyze sequencing outputs, and the choice of an appropriate tool is crucial and not trivial. No standard evaluation method exists for estimating the accuracy of a pipeline for targeted metagenomics analyses. This article proposes an evaluation protocol containing real and simulated targeted metagenomics datasets, and adequate metrics allowing us to study the impact of different variables on the biological interpretation of results. This protocol was used to compare six different bioinformatics pipelines in the basic user context: Three common ones (mothur, QIIME and BMP) based on a clustering-first approach and three emerging ones (Kraken, CLARK and One Codex) using an assignment-first approach. This study surprisingly reveals that the effect of sequencing errors has a bigger impact on the results that choosing different amplified regions. Moreover, increasing sequencing throughput increases richness overestimation, even more so for microbiota of high complexity. Finally, the choice of the reference database has a bigger impact on richness estimation for clustering-first pipelines, and on correct taxa identification for assignment-first pipelines. Using emerging assignment-first pipelines is a valid approach for targeted metagenomics analyses, with a quality of results comparable to popular clustering-first pipelines, even with an error-prone sequencing technology like Ion Torrent. However, those pipelines are highly sensitive to the quality of databases and their annotations, which makes clustering-first pipelines still the only reliable approach for studying microbiomes that are not well described.
Collapse
Affiliation(s)
- Léa Siegwald
- Gènes Diffusion, Douai, France
- CRIStAL (UMR CNRS 9189 University of Lille, Centre de Recherche en Informatique, Signal et Automatique de Lille) and Inria, Villeneuve d'Ascq, France
- Univ. Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, U1019 - UMR 8204 - CIIL - Centre d'Infection et d'Immunité de Lille, Lille, France
- PEGASE-Biosciences, Institut Pasteur de Lille, Lille, France
| | - Hélène Touzet
- CRIStAL (UMR CNRS 9189 University of Lille, Centre de Recherche en Informatique, Signal et Automatique de Lille) and Inria, Villeneuve d'Ascq, France
| | - Yves Lemoine
- Univ. Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, U1019 - UMR 8204 - CIIL - Centre d'Infection et d'Immunité de Lille, Lille, France
- PEGASE-Biosciences, Institut Pasteur de Lille, Lille, France
| | - David Hot
- Univ. Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, U1019 - UMR 8204 - CIIL - Centre d'Infection et d'Immunité de Lille, Lille, France
- PEGASE-Biosciences, Institut Pasteur de Lille, Lille, France
| | - Christophe Audebert
- Gènes Diffusion, Douai, France
- PEGASE-Biosciences, Institut Pasteur de Lille, Lille, France
| | - Ségolène Caboche
- Univ. Lille, CNRS, Inserm, CHU Lille, Institut Pasteur de Lille, U1019 - UMR 8204 - CIIL - Centre d'Infection et d'Immunité de Lille, Lille, France
- PEGASE-Biosciences, Institut Pasteur de Lille, Lille, France
- * E-mail:
| |
Collapse
|
47
|
Hahn L, Leimeister CA, Ounit R, Lonardi S, Morgenstern B. rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison. PLoS Comput Biol 2016; 12:e1005107. [PMID: 27760124 PMCID: PMC5070788 DOI: 10.1371/journal.pcbi.1005107] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2016] [Accepted: 08/11/2016] [Indexed: 12/05/2022] Open
Abstract
Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/.
Collapse
Affiliation(s)
- Lars Hahn
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | | | - Rachid Ounit
- University of California, Riverside, Department of Computer Science and Engineering, Riverside, California, United States of America
| | - Stefano Lonardi
- University of California, Riverside, Department of Computer Science and Engineering, Riverside, California, United States of America
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
- University of Göttingen, Center for Computational Sciences, Göttingen, Germany
| |
Collapse
|