1
|
Song L, Langmead B. Centrifuger: lossless compression of microbial genomes for efficient and accurate metagenomic sequence classification. Genome Biol 2024; 25:106. [PMID: 38664753 PMCID: PMC11046777 DOI: 10.1186/s13059-024-03244-4] [Citation(s) in RCA: 5] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2023] [Accepted: 04/10/2024] [Indexed: 04/28/2024] Open
Abstract
Centrifuger is an efficient taxonomic classification method that compares sequencing reads against a microbial genome database. In Centrifuger, the Burrows-Wheeler transformed genome sequences are losslessly compressed using a novel scheme called run-block compression. Run-block compression achieves sublinear space complexity and is effective at compressing diverse microbial databases like RefSeq while supporting fast rank queries. Combining this compression method with other strategies for compacting the Ferragina-Manzini (FM) index, Centrifuger reduces the memory footprint by half compared to other FM-index-based approaches. Furthermore, the lossless compression and the unconstrained match length help Centrifuger achieve greater accuracy than competing methods at lower taxonomic levels.
Collapse
Affiliation(s)
- Li Song
- Department of Biomedical Data Science, Dartmouth College, Hanover, NH, USA.
- Department of Computer Science, Dartmouth College, Hanover, NH, USA.
- Department of Microbiology and Immunology, Dartmouth College, Hanover, NH, USA.
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
2
|
Schulz T, Parmigiani L, Rempel A, Stoye J. Methods for Pangenomic Core Detection. Methods Mol Biol 2024; 2802:73-106. [PMID: 38819557 DOI: 10.1007/978-1-0716-3838-5_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2024]
Abstract
Computational pangenomics deals with the joint analysis of all genomic sequences of a species. It has already been successfully applied to various tasks in many research areas. Further advances in DNA sequencing technologies constantly let more and more genomic sequences become available for many species, leading to an increasing attractiveness of pangenomic studies. At the same time, larger datasets also pose new challenges for data structures and algorithms that are needed to handle the data. Efficient methods oftentimes make use of the concept of k-mers.Core detection is a common way of analyzing a pangenome. The pangenome's core is defined as the subset of genomic information shared among all individual members. Classically, it is not only determined on the abstract level of genes but can also be described on the sequence level.In this chapter, we provide an overview of k-mer-based methods in the context of pangenomics studies. We first revisit existing software solutions for k-mer counting and k-mer set representation. Afterward, we describe the usage of two k-mer-based approaches, Pangrowth and Corer, for pangenomic core detection.
Collapse
Affiliation(s)
- Tizian Schulz
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Luca Parmigiani
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Andreas Rempel
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology, Bielefeld University, Bielefeld, Germany.
| |
Collapse
|
3
|
Depuydt L, Renders L, Abeel T, Fostier J. Pan-genome de Bruijn graph using the bidirectional FM-index. BMC Bioinformatics 2023; 24:400. [PMID: 37884897 PMCID: PMC10605969 DOI: 10.1186/s12859-023-05531-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Accepted: 10/12/2023] [Indexed: 10/28/2023] Open
Abstract
BACKGROUND Pan-genome graphs are gaining importance in the field of bioinformatics as data structures to represent and jointly analyze multiple genomes. Compacted de Bruijn graphs are inherently suited for this purpose, as their graph topology naturally reveals similarity and divergence within the pan-genome. Most state-of-the-art pan-genome graphs are represented explicitly in terms of nodes and edges. Recently, an alternative, implicit graph representation was proposed that builds directly upon the unidirectional FM-index. As such, a memory-efficient graph data structure is obtained that inherits the FM-index' backward search functionality. However, this representation suffers from a number of shortcomings in terms of functionality and algorithmic performance. RESULTS We present a data structure for a pan-genome, compacted de Bruijn graph that aims to address these shortcomings. It is built on the bidirectional FM-index, extending the ability of its unidirectional counterpart to navigate and search the graph in both directions. All basic graph navigation steps can be performed in constant time. Based on these features, we implement subgraph visualization as well as lossless approximate pattern matching to the graph using search schemes. We demonstrate that we can retrieve all occurrences corresponding to a read within a certain edit distance in a very efficient manner. Through a case study, we show the potential of exploiting the information embedded in the graph's topology through visualization and sequence alignment. CONCLUSIONS We propose a memory-efficient representation of the pan-genome graph that supports subgraph visualization and lossless approximate pattern matching of reads against the graph using search schemes. The C++ source code of our software, called Nexus, is available at https://github.com/biointec/nexus under AGPL-3.0 license.
Collapse
Affiliation(s)
- Lore Depuydt
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium.
| | - Luca Renders
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium
| | - Thomas Abeel
- Delft Bioinformatics Lab, Delft University of Technology, 2628 XE, Delft, The Netherlands
- Infectious Disease and Microbiome Program, Broad Institute of MIT and Harvard, Cambridge, MA, 02142, USA
| | - Jan Fostier
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, 9052, Ghent, Belgium.
| |
Collapse
|
4
|
Chao KH, Chen PW, Seshia SA, Langmead B. WGT: Tools and algorithms for recognizing, visualizing, and generating Wheeler graphs. iScience 2023; 26:107402. [PMID: 37575187 PMCID: PMC10415921 DOI: 10.1016/j.isci.2023.107402] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2023] [Revised: 06/29/2023] [Accepted: 07/12/2023] [Indexed: 08/15/2023] Open
Abstract
A Wheeler graph represents a collection of strings in a way that is particularly easy to index and query. Such a graph is a practical choice for representing a graph-shaped pangenome, and it is the foundation for current graph-based pangenome indexes. However, there are no practical tools to visualize or to check graphs that may have the Wheeler properties. Here, we present Wheelie, an algorithm that combines a renaming heuristic with a permutation solver (Wheelie-PR) or a Satisfiability Modulo Theory (SMT) solver (Wheelie-SMT) to check whether a given graph has the Wheeler properties, a problem that is NP-complete in general. Wheelie can check a variety of random and real-world graphs in far less time than any algorithm proposed to date. It can check a graph with 1,000s of nodes in seconds. We implement these algorithms together with complementary visualization tools in the WGT toolkit, available as open source software at https://github.com/Kuanhao-Chao/Wheeler_Graph_Toolkit.
Collapse
Affiliation(s)
- Kuan-Hao Chao
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| | - Pei-Wei Chen
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Sanjit A. Seshia
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, MD 21218, USA
| |
Collapse
|
5
|
Schmidt S, Alanko JN. Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time. Algorithms Mol Biol 2023; 18:5. [PMID: 37403080 DOI: 10.1186/s13015-023-00227-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2023] [Accepted: 05/13/2023] [Indexed: 07/06/2023] Open
Abstract
A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. Our algorithm first constructs the de Bruijn graph in linear time and then uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.
Collapse
Affiliation(s)
- Sebastian Schmidt
- Department of Computer Science, University of Helsinki, Helsinki, Finland.
| | - Jarno N Alanko
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Institute of Biology, National University of Sciences, Kiel, Germany
| |
Collapse
|
6
|
Marchet C, Limasset A. Scalable sequence database search using partitioned aggregated Bloom comb trees. Bioinformatics 2023; 39:i252-i259. [PMID: 37387170 DOI: 10.1093/bioinformatics/btad225] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION The Sequence Read Archive public database has reached 45 petabytes of raw sequences and doubles its nucleotide content every 2 years. Although BLAST-like methods can routinely search for a sequence in a small collection of genomes, making searchable immense public resources accessible is beyond the reach of alignment-based strategies. In recent years, abundant literature tackled the task of finding a sequence in extensive sequence collections using k-mer-based strategies. At present, the most scalable methods are approximate membership query data structures that combine the ability to query small signatures or variants while being scalable to collections up to 10 000 eukaryotic samples. Results. Here, we present PAC, a novel approximate membership query data structure for querying collections of sequence datasets. PAC index construction works in a streaming fashion without any disk footprint besides the index itself. It shows a 3-6 fold improvement in construction time compared to other compressed methods for comparable index size. A PAC query can need single random access and be performed in constant time in favorable instances. Using limited computation resources, we built PAC for very large collections. They include 32 000 human RNA-seq samples in 5 days, the entire GenBank bacterial genome collection in a single day for an index size of 3.5 TB. The latter is, to our knowledge, the largest sequence collection ever indexed using an approximate membership query structure. We also showed that PAC's ability to query 500 000 transcript sequences in less than an hour. AVAILABILITY AND IMPLEMENTATION PAC's open-source software is available at https://github.com/Malfoy/PAC.
Collapse
Affiliation(s)
- Camille Marchet
- University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| | - Antoine Limasset
- University of Lille, CNRS, Centrale Lille, UMR 9189 CRIStAL, F-59000 Lille, France
| |
Collapse
|
7
|
Alanko JN, Vuohtoniemi J, Mäklin T, Puglisi SJ. Themisto: a scalable colored k-mer index for sensitive pseudoalignment against hundreds of thousands of bacterial genomes. Bioinformatics 2023; 39:i260-i269. [PMID: 37387143 DOI: 10.1093/bioinformatics/btad233] [Citation(s) in RCA: 17] [Impact Index Per Article: 8.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/01/2023] Open
Abstract
MOTIVATION Huge datasets containing whole-genome sequences of bacterial strains are now commonplace and represent a rich and important resource for modern genomic epidemiology and metagenomics. In order to efficiently make use of these datasets, efficient indexing data structures-that are both scalable and provide rapid query throughput-are paramount. RESULTS Here, we present Themisto, a scalable colored k-mer index designed for large collections of microbial reference genomes, that works for both short and long read data. Themisto indexes 179 thousand Salmonella enterica genomes in 9 h. The resulting index takes 142 gigabytes. In comparison, the best competing tools Metagraph and Bifrost were only able to index 11 000 genomes in the same time. In pseudoalignment, these other tools were either an order of magnitude slower than Themisto, or used an order of magnitude more memory. Themisto also offers superior pseudoalignment quality, achieving a higher recall than previous methods on Nanopore read sets. AVAILABILITY AND IMPLEMENTATION Themisto is available and documented as a C++ package at https://github.com/algbio/themisto available under the GPLv2 license.
Collapse
Affiliation(s)
- Jarno N Alanko
- Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
| | - Jaakko Vuohtoniemi
- Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
| | - Tommi Mäklin
- Department of Mathematics and Statistics, University of Helsinki, Helsinki 00014, Finland
| | - Simon J Puglisi
- Department of Computer Science, University of Helsinki, Helsinki 00014, Finland
| |
Collapse
|
8
|
Noll N, Molari M, Shaw LP, Neher RA. PanGraph: scalable bacterial pan-genome graph construction. Microb Genom 2023; 9:mgen001034. [PMID: 37278719 PMCID: PMC10327495 DOI: 10.1099/mgen.0.001034] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2022] [Accepted: 04/14/2023] [Indexed: 06/07/2023] Open
Abstract
The genomic diversity of microbes is commonly parameterized as SNPs relative to a reference genome of a well-characterized, but arbitrary, isolate. However, any reference genome contains only a fraction of the microbial pangenome, the total set of genes observed in a given species. Reference-based approaches are thus blind to the dynamics of the accessory genome, as well as variation within gene order and copy number. With the widespread usage of long-read sequencing, the number of high-quality, complete genome assemblies has increased dramatically. In addition to pangenomic approaches that focus on the variation in the sets of genes present in different genomes, complete assemblies allow investigations of the evolution of genome structure and gene order. This latter problem, however, is computationally demanding with few tools available that shed light on these dynamics. Here, we present PanGraph, a Julia-based library and command line interface for aligning whole genomes into a graph. Each genome is represented as a path along vertices, which in turn encapsulate homologous multiple sequence alignments. The resultant data structure succinctly summarizes population-level nucleotide and structural polymorphisms and can be exported into several common formats for either downstream analysis or immediate visualization.
Collapse
Affiliation(s)
- Nicholas Noll
- Kavli Institute for Theoretical Physics, University of California, Santa Barbara, CA, USA
| | - Marco Molari
- Swiss Institute of Bioinformatics, Basel, Switzerland
- Biozentrum, University of Basel, Basel, Switzerland
| | - Liam P. Shaw
- Department of Biology, University of Oxford, Oxford, UK
| | - Richard A. Neher
- Swiss Institute of Bioinformatics, Basel, Switzerland
- Biozentrum, University of Basel, Basel, Switzerland
| |
Collapse
|
9
|
Hasegawa N, Shimizu K. Efficient Colored de Bruijn Graph for Indexing Reads. J Comput Biol 2023. [PMID: 37115583 DOI: 10.1089/cmb.2022.0259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/29/2023] Open
Abstract
The colored de Bruijn graph is a variation of the de Bruijn graph that has recently been utilized for indexing sequencing reads. Although state-of-the-art methods have achieved small index sizes, they produce many read-incoherent paths that tend to cover the same regions in the source genome sequence. To solve this problem, we propose an accurate coloring method that can reduce the generation of read-incoherent paths by utilizing different colors for a single read depending on the position in the read, which reduces ambiguous coloring in cases where a node has two successors, and both of the successors have the same color. To avoid having to memorize the order of the colors, we utilize a hash function to generate and reproduce the series of colors from the initial color and then apply a Bloom filter for storing the colors to reduce the index size. Experimental results using simulated data and real data demonstrate that our method reduces the occurrence of read-incoherent paths from 149,556 to only 2 and 5596 to 0 respectively. Moreover, the depths of coverage for the reconstructed reads are equal to those for the input reads for the simulated data, whereas the previous method decreases the depth of coverage at many positions in the source genome. Our method achieves quite a high accuracy with a comparable construction time, peak memory size, and index size to the previous method.
Collapse
Affiliation(s)
- Nozomi Hasegawa
- The Department of Computer Science and Engineering, Waseda University, Shinjuku-ku, Japan
| | - Kana Shimizu
- The Department of Computer Science and Engineering, Waseda University, Shinjuku-ku, Japan
- National Institute of Advanced Industrial Science and Technology, Koto-ku, Japan
| |
Collapse
|
10
|
Lu TY, Smaruj PN, Fudenberg G, Mancuso N, Chaisson MJP. The motif composition of variable number tandem repeats impacts gene expression. Genome Res 2023; 33:511-524. [PMID: 37037626 PMCID: PMC10234305 DOI: 10.1101/gr.276768.122] [Citation(s) in RCA: 18] [Impact Index Per Article: 9.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2022] [Accepted: 03/29/2023] [Indexed: 04/12/2023]
Abstract
Understanding the impact of DNA variation on human traits is a fundamental question in human genetics. Variable number tandem repeats (VNTRs) make up ∼3% of the human genome but are often excluded from association analysis owing to poor read mappability or divergent repeat content. Although methods exist to estimate VNTR length from short-read data, it is known that VNTRs vary in both length and repeat (motif) composition. Here, we use a repeat-pangenome graph (RPGG) constructed on 35 haplotype-resolved assemblies to detect variation in both VNTR length and repeat composition. We align population-scale data from the Genotype-Tissue Expression (GTEx) Consortium to examine how variations in sequence composition may be linked to expression, including cases independent of overall VNTR length. We find that 9422 out of 39,125 VNTRs are associated with nearby gene expression through motif variations, of which only 23.4% are accessible from length. Fine-mapping identifies 174 genes to be likely driven by variation in certain VNTR motifs and not overall length. We highlight two genes, CACNA1C and RNF213, that have expression associated with motif variation, showing the utility of RPGG analysis as a new approach for trait association in multiallelic and highly variable loci.
Collapse
Affiliation(s)
- Tsung-Yu Lu
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California 90089, USA
| | - Paulina N Smaruj
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California 90089, USA
| | - Geoffrey Fudenberg
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California 90089, USA
| | - Nicholas Mancuso
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California 90089, USA
- Center for Genetic Epidemiology, Department of Population and Public Health Sciences, Keck School of Medicine, University of Southern California, Los Angeles, California 90033, USA
| | - Mark J P Chaisson
- Department of Quantitative and Computational Biology, University of Southern California, Los Angeles, California 90089, USA;
- The Genomic and Epigenomic Regulation Program, USC Norris Cancer Center, University of Southern California, Los Angeles, California 90033, USA
| |
Collapse
|
11
|
Schmidt S, Alanko JN. Eulertigs: minimum plain text representation of k-mer sets without repetitions in linear time. RESEARCH SQUARE 2023:rs.3.rs-2581995. [PMID: 36824947 PMCID: PMC9949180 DOI: 10.21203/rs.3.rs-2581995/v1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
Abstract
A fundamental operation in computational genomics is to reduce the input sequences to their constituent k-mers. For maximum performance of downstream applications it is important to store the k-mers in small space, while keeping the representation easy and efficient to use (i.e. without k-mer repetitions and in plain text). Recently, heuristics were presented to compute a near-minimum such representation. We present an algorithm to compute a minimum representation in optimal (linear) time and use it to evaluate the existing heuristics. Our algorithm first constructs the de Bruijn graph in linear time and then uses a Eulerian-cycle-based algorithm to compute the minimum representation, in time linear in the size of the output.
Collapse
Affiliation(s)
- Sebastian Schmidt
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Jarno N. Alanko
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Institute of Biology, National University of Sciences, Kiel, Germany
| |
Collapse
|
12
|
Chromosome-scale haplotype-resolved pangenomics. Trends Genet 2022; 38:1103-1107. [PMID: 35817620 DOI: 10.1016/j.tig.2022.06.011] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 06/14/2022] [Accepted: 06/16/2022] [Indexed: 01/24/2023]
Abstract
Complete pangenomics is crucial for understanding genetic diversity and evolution across the tree of life. Chromosome-scale, haplotype-resolved pangenomics allows complex structural variations, long-range interactions, and associated functions to be discerned in species populations. We explore the need for high-resolution pangenomes, discuss computational strategies for their development, and describe applications in biodiversity and human health.
Collapse
|
13
|
Karasikov M, Mustafa H, Rätsch G, Kahles A. Lossless indexing with counting de Bruijn graphs. Genome Res 2022; 32:1754-1764. [PMID: 35609994 PMCID: PMC9528980 DOI: 10.1101/gr.276607.122] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/17/2022] [Accepted: 05/05/2022] [Indexed: 11/25/2022]
Abstract
Sequencing data are rapidly accumulating in public repositories. Making this resource accessible for interactive analysis at scale requires efficient approaches for its storage and indexing. There have recently been remarkable advances in building compressed representations of annotated (or colored) de Bruijn graphs for efficiently indexing k-mer sets. However, approaches for representing quantitative attributes such as gene expression or genome positions in a general manner have remained underexplored. In this work, we propose counting de Bruijn graphs, a notion generalizing annotated de Bruijn graphs by supplementing each node-label relation with one or many attributes (e.g., a k-mer count or its positions). Counting de Bruijn graphs index k-mer abundances from 2652 human RNA-seq samples in over eightfold smaller representations compared with state-of-the-art bioinformatics tools and is faster to construct and query. Furthermore, counting de Bruijn graphs with positional annotations losslessly represent entire reads in indexes on average 27% smaller than the input compressed with gzip for human Illumina RNA-seq and 57% smaller for Pacific Biosciences (PacBio) HiFi sequencing of viral samples. A complete searchable index of all viral PacBio SMRT reads from NCBI's Sequence Read Archive (SRA) (152,884 samples, 875 Gbp) comprises only 178 GB. Finally, on the full RefSeq collection, we generate a lossless and fully queryable index that is 4.6-fold smaller than the MegaBLAST index. The techniques proposed in this work naturally complement existing methods and tools using de Bruijn graphs, and significantly broaden their applicability: from indexing k-mer counts and genome positions to implementing novel sequence alignment algorithms on top of highly compressed graph-based sequence indexes.
Collapse
Affiliation(s)
- Mikhail Karasikov
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Harun Mustafa
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
- Department of Biology at ETH Zurich, 8093 Zurich, Switzerland
- ETH AI Center, ETH Zurich, 8092 Zurich, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, 8092 Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, 8091 Zurich, Switzerland
- Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
14
|
Calabrese FM, Ameur H, Nikoloudaki O, Celano G, Vacca M, Junior WJFL, Manzari C, Vertè F, Di Cagno R, Pesole G, De Angelis M, Gobbetti M. Metabolic framework of spontaneous and synthetic sourdough metacommunities to reveal microbial players responsible for resilience and performance. MICROBIOME 2022; 10:148. [PMID: 36104726 PMCID: PMC9472446 DOI: 10.1186/s40168-022-01301-3] [Citation(s) in RCA: 15] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 05/29/2022] [Indexed: 06/15/2023]
Abstract
BACKGROUND In nature, microbial communities undergo changes in composition that threaten their resiliency. Here, we interrogated sourdough, a natural cereal-fermenting metacommunity, as a dynamic ecosystem in which players are subjected to continuous environmental and spatiotemporal stimuli. RESULTS The inspection of spontaneous sourdough metagenomes and transcriptomes revealed dominant, subdominant and satellite players that are engaged in different functional pathways. The highest microbial richness was associated with the highest number of gene copies per pathway. Based on meta-omics data collected from 8 spontaneous sourdoughs and their identified microbiota, we de novo reconstructed a synthetic microbial community SDG. We also reconstructed SMC-SD43 from scratch using the microbial composition of its spontaneous sourdough equivalent for comparison. The KEGG number of dominant players in the SDG was not affected by depletion of a single player, whereas the subdominant and satellite species fluctuated, revealing unique contributions. Compared to SMC-SD43, SDG exhibited broader transcriptome redundancy. The invariant volatilome profile of SDG after in situ long-term back slopping revealed its stability. In contrast, SMC-SD43 lost many taxon members. Dominant, subdominant and satellite players together ensured gene and transcript redundancy. CONCLUSIONS Our study demonstrates how, by starting from spontaneous sourdoughs and reconstructing these communities synthetically, it was possible to unravel the metabolic contributions of individual players. For resilience and good performance, the sourdough metacommunity must include dominant, subdominant and satellite players, which together ensure gene and transcript redundancy. Overall, our study changes the paradigm and introduces theoretical foundations for directing food fermentations. Video Abstract.
Collapse
Affiliation(s)
| | - Hana Ameur
- Faculty of Science and Technology, Libera Università Di Bolzano, Piazza Università 5, 39100, Bolzano, Italy
| | - Olga Nikoloudaki
- Faculty of Science and Technology, Libera Università Di Bolzano, Piazza Università 5, 39100, Bolzano, Italy
| | - Giuseppe Celano
- Department of Soil, Plant and Food Science, University of Bari Aldo Moro, Bari, Italy
| | - Mirco Vacca
- Department of Soil, Plant and Food Science, University of Bari Aldo Moro, Bari, Italy
| | - Wilson JFLemos Junior
- Faculty of Science and Technology, Libera Università Di Bolzano, Piazza Università 5, 39100, Bolzano, Italy
| | - Caterina Manzari
- Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari Aldo Moro, Bari, Italy
| | - Fabienne Vertè
- Puratos NV, Industrialaan 25, 1702, Groot-Bijgaarden, Belgium
| | - Raffaella Di Cagno
- Faculty of Science and Technology, Libera Università Di Bolzano, Piazza Università 5, 39100, Bolzano, Italy.
| | - Graziano Pesole
- Department of Biosciences, Biotechnology and Biopharmaceutics, University of Bari Aldo Moro, Bari, Italy
| | - Maria De Angelis
- Department of Soil, Plant and Food Science, University of Bari Aldo Moro, Bari, Italy
| | - Marco Gobbetti
- Faculty of Science and Technology, Libera Università Di Bolzano, Piazza Università 5, 39100, Bolzano, Italy
| |
Collapse
|
15
|
Almodaresi F, Khan J, Madaminov S, Ferdman M, Johnson R, Pandey P, Patro R. An incrementally updatable and scalable system for large-scale sequence search using the Bentley-Saxe transformation. Bioinformatics 2022; 38:3155-3163. [PMID: 35325039 PMCID: PMC9191210 DOI: 10.1093/bioinformatics/btac142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Revised: 01/10/2022] [Accepted: 03/22/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION In the past few years, researchers have proposed numerous indexing schemes for searching large datasets of raw sequencing experiments. Most of these proposed indexes are approximate (i.e. with one-sided errors) in order to save space. Recently, researchers have published exact indexes-Mantis, VariMerge and Bifrost-that can serve as colored de Bruijn graph representations in addition to serving as k-mer indexes. This new type of index is promising because it has the potential to support more complex analyses than simple searches. However, in order to be useful as indexes for large and growing repositories of raw sequencing data, they must scale to thousands of experiments and support efficient insertion of new data. RESULTS In this paper, we show how to build a scalable and updatable exact raw sequence-search index. Specifically, we extend Mantis using the Bentley-Saxe transformation to support efficient updates, called Dynamic Mantis. We demonstrate Dynamic Mantis's scalability by constructing an index of ≈40K samples from SRA by adding samples one at a time to an initial index of 10K samples. Compared to VariMerge and Bifrost, Dynamic Mantis is more efficient in terms of index-construction time and memory, query time and memory and index size. In our benchmarks, VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Dynamic Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost (VariMerge does not immediately support general search queries we require). Dynamic Mantis indexes were about 2.5× smaller than Bifrost's indexes and about half as big as VariMerge's indexes. AVAILABILITY AND IMPLEMENTATION Dynamic Mantis implementation is available at https://github.com/splatlab/mantis/tree/mergeMSTs. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Jamshed Khan
- Department of Computer Science, University of Maryland, USA
| | | | | | | | | | - Rob Patro
- To whom correspondence should be addressed.
| |
Collapse
|
16
|
Dufault‐Thompson K, Jiang X. Applications of de Bruijn graphs in microbiome research. IMETA 2022; 1:e4. [PMID: 38867733 PMCID: PMC10989854 DOI: 10.1002/imt2.4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Revised: 01/24/2022] [Accepted: 01/24/2022] [Indexed: 06/14/2024]
Abstract
High-throughput sequencing has become an increasingly central component of microbiome research. The development of de Bruijn graph-based methods for assembling high-throughput sequencing data has been an important part of the broader adoption of sequencing as part of biological studies. Recent advances in the construction and representation of de Bruijn graphs have led to new approaches that utilize the de Bruijn graph data structure to aid in different biological analyses. One type of application of these methods has been in alternative approaches to the assembly of sequencing data like gene-targeted assembly, where only gene sequences are assembled out of larger metagenomes, and differential assembly, where sequences that are differentially present between two samples are assembled. de Bruijn graphs have also been applied for comparative genomics where they can be used to represent large sets of multiple genomes or metagenomes where structural features in the graphs can be used to identify variants, indels, and homologous regions in sequences. These de Bruijn graph-based representations of sequencing data have even begun to be applied to whole sequencing databases for large-scale searches and experiment discovery. de Bruijn graphs have played a central role in how high-throughput sequencing data is worked with, and the rapid development of new tools that rely on these data structures suggests that they will continue to play an important role in biology in the future.
Collapse
Affiliation(s)
- Keith Dufault‐Thompson
- Intramural Research ProgramNational Library of Medicine, National Institutes of HealthBethesdaMarylandUSA
| | - Xiaofang Jiang
- Intramural Research ProgramNational Library of Medicine, National Institutes of HealthBethesdaMarylandUSA
| |
Collapse
|
17
|
Krannich T, White WTJ, Niehus S, Holley G, Halldórsson BV, Kehr B. Population-scale detection of non-reference sequence variants using colored de Bruijn graphs. Bioinformatics 2021; 38:604-611. [PMID: 34726732 PMCID: PMC8756200 DOI: 10.1093/bioinformatics/btab749] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2021] [Revised: 09/27/2021] [Accepted: 10/28/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION With the increasing throughput of sequencing technologies, structural variant (SV) detection has become possible across tens of thousands of genomes. Non-reference sequence (NRS) variants have drawn less attention compared with other types of SVs due to the computational complexity of detecting them. When using short-read data, the detection of NRS variants inevitably involves a de novo assembly which requires high-quality sequence data at high coverage. Previous studies have demonstrated how sequence data of multiple genomes can be combined for the reliable detection of NRS variants. However, the algorithms proposed in these studies have limited scalability to larger sets of genomes. RESULTS We introduce PopIns2, a tool to discover and characterize NRS variants in many genomes, which scales to considerably larger numbers of genomes than its predecessor PopIns. In this article, we briefly outline the PopIns2 workflow and highlight our novel algorithmic contributions. We developed an entirely new approach for merging contig assemblies of unaligned reads from many genomes into a single set of NRS using a colored de Bruijn graph. Our tests on simulated data indicate that the new merging algorithm ranks among the best approaches in terms of quality and reliability and that PopIns2 shows the best precision for a growing number of genomes processed. Results on the Polaris Diversity Cohort and a set of 1000 Icelandic human genomes demonstrate unmatched scalability for the application on population-scale datasets. AVAILABILITY AND IMPLEMENTATION The source code of PopIns2 is available from https://github.com/kehrlab/PopIns2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | | | - Sebastian Niehus
- Regensburg Center for Interventional Immunology (RCI), 93053 Regensburg, Germany
| | | | - Bjarni V Halldórsson
- deCODE Genetics, Reykjavík 102, Iceland,Department of Engineering, School of Technology, Reykjavík University, Reykjavík 102, Iceland
| | - Birte Kehr
- To whom correspondence should be addressed. or
| |
Collapse
|
18
|
Marchet C, Kerbiriou M, Limasset A. BLight: efficient exact associative structure for k-mers. Bioinformatics 2021; 37:2858-2865. [PMID: 33821954 DOI: 10.1093/bioinformatics/btab217] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2020] [Revised: 02/18/2021] [Accepted: 04/01/2021] [Indexed: 02/02/2023] Open
Abstract
MOTIVATION A plethora of methods and applications share the fundamental need to associate information to words for high-throughput sequence analysis. Doing so for billions of k-mers is commonly a scalability problem, as exact associative indexes can be memory expensive. Recent works take advantage of overlaps between k-mers to leverage this challenge. Yet, existing data structures are either unable to associate information to k-mers or are not lightweight enough. RESULTS We present BLight, a static and exact data structure able to associate unique identifiers to k-mers and determine their membership in a set without false positive that scales to huge k-mer sets with a low memory cost. This index combines an extremely compact representation along with very fast queries. Besides, its construction is efficient and needs no additional memory. Our implementation achieves to index the k-mers from the human genome using 8 GB of RAM (23 bits per k-mer) within 10 min and the k-mers from the large axolotl genome using 63 GB of memory (27 bits per k-mer) within 76 min. Furthermore, while being memory efficient, the index provides a very high throughput: 1.4 million queries per second on a single CPU or 16.1 million using 12 cores. Finally, we also present how BLight can practically represent metagenomic and transcriptomic sequencing data to highlight its wide applicative range. AVAILABILITY AND IMPLEMENTATION We wrote the BLight index as an open source C++ library under the AGPL3 license available at github.com/Malfoy/BLight. It is designed as a user-friendly library and comes along with code usage samples.
Collapse
Affiliation(s)
- Camille Marchet
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| | - Mael Kerbiriou
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| | - Antoine Limasset
- University of Lille, CRIStAL CNRS, UMR 9189 - F-59000 Lille, France
| |
Collapse
|
19
|
Alipanahi B, Kuhnle A, Puglisi SJ, Salmela L, Boucher C. Succinct dynamic de Bruijn graphs. Bioinformatics 2021; 37:1946-1952. [PMID: 32462192 DOI: 10.1093/bioinformatics/btaa546] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
MOTIVATION The de Bruijn graph is one of the fundamental data structures for analysis of high throughput sequencing data. In order to be applicable to population-scale studies, it is essential to build and store the graph in a space- and time-efficient manner. In addition, due to the ever-changing nature of population studies, it has become essential to update the graph after construction, e.g. add and remove nodes and edges. Although there has been substantial effort on making the construction and storage of the graph efficient, there is a limited amount of work in building the graph in an efficient and mutable manner. Hence, most space efficient data structures require complete reconstruction of the graph in order to add or remove edges or nodes. RESULTS In this article, we present DynamicBOSS, a succinct representation of the de Bruijn graph that allows for an unlimited number of additions and deletions of nodes and edges. We compare our method with other competing methods and demonstrate that DynamicBOSS is the only method that supports both addition and deletion and is applicable to very large samples (e.g. greater than 15 billion k-mers). Competing dynamic methods, e.g. FDBG cannot be constructed on large scale datasets, or cannot support both addition and deletion, e.g. BiFrost. AVAILABILITY AND IMPLEMENTATION DynamicBOSS is publicly available at https://github.com/baharpan/dynboss. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahar Alipanahi
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Alan Kuhnle
- Department of Computer Science, Florida State University, Tallahassee, FL 32306, USA
| | - Simon J Puglisi
- Department of Computer Science, Helsinki Institute for Information Technology, University of Helsinki, Helsinki 00014, Finland
| | - Leena Salmela
- Department of Computer Science, Helsinki Institute for Information Technology, University of Helsinki, Helsinki 00014, Finland
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, College of Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
20
|
Danciu D, Karasikov M, Mustafa H, Kahles A, Rätsch G. Topology-based sparsification of graph annotations. Bioinformatics 2021; 37:i169-i176. [PMID: 34252940 PMCID: PMC8346655 DOI: 10.1093/bioinformatics/btab330] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/03/2021] [Indexed: 01/03/2023] Open
Abstract
Motivation Since the amount of published biological sequencing data is growing exponentially, efficient methods for storing and indexing this data are more needed than ever to truly benefit from this invaluable resource for biomedical research. Labeled de Bruijn graphs are a frequently-used approach for representing large sets of sequencing data. While significant progress has been made to succinctly represent the graph itself, efficient methods for storing labels on such graphs are still rapidly evolving. Results In this article, we present RowDiff, a new technique for compacting graph labels by leveraging expected similarities in annotations of vertices adjacent in the graph. RowDiff can be constructed in linear time relative to the number of vertices and labels in the graph, and in space proportional to the graph size. In addition, construction can be efficiently parallelized and distributed, making the technique applicable to graphs with trillions of nodes. RowDiff can be viewed as an intermediary sparsification step of the original annotation matrix and can thus naturally be combined with existing generic schemes for compressed binary matrices. Experiments on 10 000 RNA-seq datasets show that RowDiff combined with multi-BRWT results in a 30% reduction in annotation footprint over Mantis-MST, the previously known most compact annotation representation. Experiments on the sparser Fungi subset of the RefSeq collection show that applying RowDiff sparsification reduces the size of individual annotation columns stored as compressed bit vectors by an average factor of 42. When combining RowDiff with a multi-BRWT representation, the resulting annotation is 26 times smaller than Mantis-MST. Availability and implementation RowDiff is implemented in C++ within the MetaGraph framework. The source code and the data used in the experiments are publicly available at https://github.com/ratschlab/row_diff.
Collapse
Affiliation(s)
- Daniel Danciu
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Harun Mustafa
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - André Kahles
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Gunnar Rätsch
- Department of Computer Science, Biomedical Informatics Group, ETH Zurich, Zurich, Switzerland.,Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland.,Swiss Institute of Bioinformatics, Zurich, Switzerland.,Department of Biology, ETH Zurich, Zurich, Switzerland
| |
Collapse
|
21
|
Khan J, Patro R. Cuttlefish: fast, parallel and low-memory compaction of de Bruijn graphs from large-scale genome collections. Bioinformatics 2021; 37:i177-i186. [PMID: 34252958 PMCID: PMC8275350 DOI: 10.1093/bioinformatics/btab309] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
Abstract
Motivation The construction of the compacted de Bruijn graph from collections of reference genomes is a task of increasing interest in genomic analyses. These graphs are increasingly used as sequence indices for short- and long-read alignment. Also, as we sequence and assemble a greater diversity of genomes, the colored compacted de Bruijn graph is being used more and more as the basis for efficient methods to perform comparative genomic analyses on these genomes. Therefore, time- and memory-efficient construction of the graph from reference sequences is an important problem. Results We introduce a new algorithm, implemented in the tool Cuttlefish, to construct the (colored) compacted de Bruijn graph from a collection of one or more genome references. Cuttlefish introduces a novel approach of modeling de Bruijn graph vertices as finite-state automata, and constrains these automata’s state-space to enable tracking their transitioning states with very low memory usage. Cuttlefish is also fast and highly parallelizable. Experimental results demonstrate that it scales much better than existing approaches, especially as the number and the scale of the input references grow. On a typical shared-memory machine, Cuttlefish constructed the graph for 100 human genomes in under 9 h, using ∼29 GB of memory. On 11 diverse conifer plant genomes, the compacted graph was constructed by Cuttlefish in under 9 h, using ∼84 GB of memory. The only other tool completing these tasks on the hardware took over 23 h using ∼126 GB of memory, and over 16 h using ∼289 GB of memory, respectively. Availability and implementation Cuttlefish is implemented in C++14, and is available under an open source license at https://github.com/COMBINE-lab/cuttlefish. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jamshed Khan
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, MD 20742, USA
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, MD 20742, USA
- To whom correspondence should be addressed.
| |
Collapse
|
22
|
Alanko J, Alipanahi B, Settle J, Boucher C, Gagie T. Buffering updates enables efficient dynamic de Bruijn graphs. Comput Struct Biotechnol J 2021; 19:4067-4078. [PMID: 34377371 PMCID: PMC8326735 DOI: 10.1016/j.csbj.2021.06.047] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 06/29/2021] [Accepted: 06/29/2021] [Indexed: 12/24/2022] Open
Abstract
MOTIVATION The de Bruijn graph has become a ubiquitous graph model for biological data ever since its initial introduction in the late 1990s. It has been used for a variety of purposes including genome assembly (Zerbino and Birney, 2008; Bankevich et al., 2012; Peng et al., 2012), variant detection (Alipanahi et al., 2020b; Iqbal et al., 2012), and storage of assembled genomes (Chikhi et al., 2016). For this reason, there have been over a dozen methods for building and representing the de Bruijn graph and its variants in a space and time efficient manner. RESULTS With the exception of a few data structures (Muggli et al., 2019; Holley and Melsted, 2020; Crawford et al.,2018), compressed and compact de Bruijn graphs do not allow for the graph to be efficiently updated, meaning that data can be added or deleted. The most recent compressed dynamic de Bruijn graph (Alipanahi et al., 2020a), relies on dynamic bit vectors which are slow in theory and practice. To address this shortcoming, we present a compressed dynamic de Bruijn graph that removes the necessity of dynamic bit vectors by buffering data that should be added or removed from the graph. We implement our method, which we refer to as BufBOSS, and compare its performance to Bifrost, DynamicBOSS, and FDBG. Our experiments demonstrate that BufBOSS achieves attractive trade-offs compared to other tools in terms of time, memory and disk, and has the best deletion performance by an order of magnitude.
Collapse
Affiliation(s)
- Jarno Alanko
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Bahar Alipanahi
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Jonathen Settle
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Travis Gagie
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| |
Collapse
|
23
|
Almodaresi F, Zakeri M, Patro R. Puffaligner : A Fast, Efficient, and Accurate Aligner Based on the Pufferfish Index. Bioinformatics 2021; 37:4048-4055. [PMID: 34117875 PMCID: PMC9502150 DOI: 10.1093/bioinformatics/btab408] [Citation(s) in RCA: 17] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2020] [Revised: 04/30/2021] [Accepted: 06/11/2021] [Indexed: 12/22/2022] Open
Abstract
Motivation Sequence alignment is one of the first steps in many modern genomic analyses, such as variant detection, transcript abundance estimation and metagenomic profiling. Unfortunately, it is often a computationally expensive procedure. As the quantity of data and wealth of different assays and applications continue to grow, the need for accurate and fast alignment tools that scale to large collections of reference sequences persists. Results In this article, we introduce PuffAligner, a fast, accurate and versatile aligner built on top of the Pufferfish index. PuffAligner is able to produce highly sensitive alignments, similar to those of Bowtie2, but much more quickly. While exhibiting similar speed to the ultrafast STAR aligner, PuffAligner requires considerably less memory to construct its index and align reads. PuffAligner strikes a desirable balance with respect to the time, space and accuracy tradeoffs made by different alignment tools and provides a promising foundation on which to test new alignment ideas over large collections of sequences. Availability and implementation All the data used for preparing the results of this paper can be found with 10.5281/zenodo.4902332. PuffAligner is a free and open-source software. It is implemented in C++14 and can be obtained from https://github.com/COMBINE-lab/pufferfish/tree/cigar-strings. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Mohsen Zakeri
- Computer Science Department, University of Maryland, College Park, USA
| | - Rob Patro
- Computer Science Department, University of Maryland, College Park, USA
| |
Collapse
|
24
|
Guo J, Pang E, Song H, Lin K. A tri-tuple coordinate system derived for fast and accurate analysis of the colored de Bruijn graph-based pangenomes. BMC Bioinformatics 2021; 22:282. [PMID: 34044757 PMCID: PMC8161984 DOI: 10.1186/s12859-021-04149-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2021] [Accepted: 04/25/2021] [Indexed: 11/25/2022] Open
Abstract
Background With the rapid development of accurate sequencing and assembly technologies, an increasing number of high-quality chromosome-level and haplotype-resolved assemblies of genomic sequences have been derived, from which there will be great opportunities for computational pangenomics. Although genome graphs are among the most useful models for pangenome representation, their structural complexity makes it difficult to present genome information intuitively, such as the linear reference genome. Thus, efficiently and accurately analyzing the genome graph spatial structure and coordinating the information remains a substantial challenge. Results We developed a new method, a colored superbubble (cSupB), that can overcome the complexity of graphs and organize a set of species- or population-specific haplotype sequences of interest. Based on this model, we propose a tri-tuple coordinate system that combines an offset value, topological structure and sample information. Additionally, cSupB provides a novel method that utilizes complete topological information and efficiently detects small indels (< 50 bp) for highly similar samples, which can be validated by simulated datasets. Moreover, we demonstrated that cSupB can adapt to the complex cycle structure. Conclusions Although the solution is made suitable for increasingly complex genome graphs by relaxing the constraint, the directed acyclic graph, the motif cSupB and the cSupB method can be extended to any colored directed acyclic graph. We anticipate that our method will facilitate the analysis of individual haplotype variants and population genomic diversity. We have developed a C + + program for implementing our method that is available at https://github.com/eggleader/cSupB. Supplementary information The online version contains supplementary material available at 10.1186/s12859-021-04149-w.
Collapse
Affiliation(s)
- Jindan Guo
- State Key Laboratory of Earth Surface Processes and Resource Ecology, Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Erli Pang
- State Key Laboratory of Earth Surface Processes and Resource Ecology, Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Hongtao Song
- State Key Laboratory of Earth Surface Processes and Resource Ecology, Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China
| | - Kui Lin
- State Key Laboratory of Earth Surface Processes and Resource Ecology, Ministry of Education Key Laboratory for Biodiversity Science and Ecological Engineering, College of Life Sciences, Beijing Normal University, Beijing, China.
| |
Collapse
|
25
|
Alipanahi B, Muggli MD, Jundi M, Noyes NR, Boucher C. Metagenome SNP calling via read-colored de Bruijn graphs. Bioinformatics 2021; 36:5275-5281. [PMID: 32049324 DOI: 10.1093/bioinformatics/btaa081] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2018] [Revised: 01/08/2020] [Accepted: 02/03/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Metagenomics refers to the study of complex samples containing of genetic contents of multiple individual organisms and, thus, has been used to elucidate the microbiome and resistome of a complex sample. The microbiome refers to all microbial organisms in a sample, and the resistome refers to all of the antimicrobial resistance (AMR) genes in pathogenic and non-pathogenic bacteria. Single-nucleotide polymorphisms (SNPs) can be effectively used to 'fingerprint' specific organisms and genes within the microbiome and resistome and trace their movement across various samples. However, to effectively use these SNPs for this traceability, a scalable and accurate metagenomics SNP caller is needed. Moreover, such an SNP caller should not be reliant on reference genomes since 95% of microbial species is unculturable, making the determination of a reference genome extremely challenging. In this article, we address this need. RESULTS We present LueVari, a reference-free SNP caller based on the read-colored de Bruijn graph, an extension of the traditional de Bruijn graph that allows repeated regions longer than the k-mer length and shorter than the read length to be identified unambiguously. LueVari is able to identify SNPs in both AMR genes and chromosomal DNA from shotgun metagenomics data with reliable sensitivity (between 91% and 99%) and precision (between 71% and 99%) as the performance of competing methods varies widely. Furthermore, we show that LueVari constructs sequences containing the variation, which span up to 97.8% of genes in datasets, which can be helpful in detecting distinct AMR genes in large metagenomic datasets. AVAILABILITY AND IMPLEMENTATION Code and datasets are publicly available at https://github.com/baharpan/cosmo/tree/LueVari. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Bahar Alipanahi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Martin D Muggli
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Musa Jundi
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Noelle R Noyes
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| | - Christina Boucher
- Department of Computer & Information Science & Engineering, University of Florida, Gainesville, FL 32611, USA
| |
Collapse
|
26
|
Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol 2021; 22:96. [PMID: 33823902 PMCID: PMC8025321 DOI: 10.1186/s13059-021-02297-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 02/10/2021] [Indexed: 12/30/2022] Open
Abstract
de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.
Collapse
Affiliation(s)
- Karel Břinda
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA.
- Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, USA.
| | - Michael Baym
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA
| | - Gregory Kucherov
- CNRS/LIGM Univ Gustave Eiffel, Marne-la-Vallée, France
- Skolkovo Institute of Science and Technology, Moscow, Russia
| |
Collapse
|
27
|
Shokrof M, Brown CT, Mansour TA. MQF and buffered MQF: quotient filters for efficient storage of k-mers with their counts and metadata. BMC Bioinformatics 2021; 22:71. [PMID: 33593271 PMCID: PMC7885209 DOI: 10.1186/s12859-021-03996-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2020] [Accepted: 02/04/2021] [Indexed: 11/30/2022] Open
Abstract
Background Specialized data structures are required for online algorithms to efficiently handle large sequencing datasets. The counting quotient filter (CQF), a compact hashtable, can efficiently store k-mers with a skewed distribution.
Result Here, we present the mixed-counters quotient filter (MQF) as a new variant of the CQF with novel counting and labeling systems. The new counting system adapts to a wider range of data distributions for increased space efficiency and is faster than the CQF for insertions and queries in most of the tested scenarios. A buffered version of the MQF can offload storage to disk, trading speed of insertions and queries for a significant memory reduction. The labeling system provides a flexible framework for assigning labels to member items while maintaining good data locality and a concise memory representation. These labels serve as a minimal perfect hash function but are ~ tenfold faster than BBhash, with no need to re-analyze the original data for further insertions or deletions. Conclusions The MQF is a flexible and efficient data structure that extends our ability to work with high throughput sequencing data.
Collapse
Affiliation(s)
- Moustafa Shokrof
- Department of Computer Science, University of California, Davis, CA, USA
| | - C Titus Brown
- Department of Population Health and Reproduction, School of Veterinary Medicine, University of California, Davis, CA, USA
| | - Tamer A Mansour
- Department of Population Health and Reproduction, School of Veterinary Medicine, University of California, Davis, CA, USA. .,Department of Clinical Pathology, School of Medicine, University of Mansoura, Mansoura, Egypt.
| |
Collapse
|
28
|
Schulz T, Wittler R, Rahmann S, Hach F, Stoye J. Detecting High Scoring Local Alignments in Pangenome Graphs. Bioinformatics 2021; 37:2266-2274. [PMID: 33532821 PMCID: PMC8388040 DOI: 10.1093/bioinformatics/btab077] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 12/02/2020] [Accepted: 01/29/2021] [Indexed: 11/23/2022] Open
Abstract
Motivation Increasing amounts of individual genomes sequenced per species motivate the usage of pangenomic approaches. Pangenomes may be represented as graphical structures, e.g. compacted colored de Bruijn graphs, which offer a low memory usage and facilitate reference-free sequence comparisons. While sequence-to-graph mapping to graphical pangenomes has been studied for some time, no local alignment search tool in the vein of BLAST has been proposed yet. Results We present a new heuristic method to find maximum scoring local alignments of a DNA query sequence to a pangenome represented as a compacted colored de Bruijn graph. Our approach additionally allows a comparison of similarity among sequences within the pangenome. We show that local alignment scores follow an exponential-tail distribution similar to BLAST scores, and we discuss how to estimate its parameters to separate local alignments representing sequence homology from spurious findings. An implementation of our method is presented, and its performance and usability are shown. Our approach scales sublinearly in running time and memory usage with respect to the number of genomes under consideration. This is an advantage over classical methods that do not make use of sequence similarity within the pangenome. Availability and implementation Source code and test data are available from https://gitlab.ub.uni-bielefeld.de/gi/plast. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Tizian Schulz
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, 33615, Germany.,Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, Bielefeld, 33615, Germany.,Graduate School "Digital Infrastructure for the Life Sciences" (DILS), Bielefeld University, Bielefeld, 33615, Germany
| | - Roland Wittler
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, 33615, Germany.,Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, Bielefeld, 33615, Germany
| | - Sven Rahmann
- Genome Informatics, Institute of Human Genetics, University Hospital Essen, University of Duisburg-Essen, Essen, 45122, Germany
| | - Faraz Hach
- Vancouver Prostate Centre, Vancouver, V6H 3Z6, Canada.,Department of Urologic Sciences, University of British Columbia, Vancouver, V6T 1Z4, Canada
| | - Jens Stoye
- Faculty of Technology and Center for Biotechnology (CeBiTec), Bielefeld University, Bielefeld, 33615, Germany.,Bielefeld Institute for Bioinformatics Infrastructure (BIBI), Bielefeld University, Bielefeld, 33615, Germany
| |
Collapse
|
29
|
Marchet C, Boucher C, Puglisi SJ, Medvedev P, Salson M, Chikhi R. Data structures based on k-mers for querying large collections of sequencing data sets. Genome Res 2021; 31:1-12. [PMID: 33328168 PMCID: PMC7849385 DOI: 10.1101/gr.260604.119] [Citation(s) in RCA: 50] [Impact Index Per Article: 12.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2019] [Accepted: 09/14/2020] [Indexed: 12/19/2022]
Abstract
High-throughput sequencing data sets are usually deposited in public repositories (e.g., the European Nucleotide Archive) to ensure reproducibility. As the amount of data has reached petabyte scale, repositories do not allow one to perform online sequence searches, yet, such a feature would be highly useful to investigators. Toward this goal, in the last few years several computational approaches have been introduced to index and query large collections of data sets. Here, we propose an accessible survey of these approaches, which are generally based on representing data sets as sets of k-mers. We review their properties, introduce a classification, and present their general intuition. We summarize their performance and highlight their current strengths and limitations.
Collapse
Affiliation(s)
- Camille Marchet
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, Florida 32611, USA
| | - Simon J Puglisi
- Department of Computer Science, University of Helsinki, FI-00014, Helsinki, Finland
| | - Paul Medvedev
- Department of Computer Science, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
- Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | - Mikaël Salson
- Université de Lille, CNRS, CRIStAL UMR 9189, F-59000 Lille, France
| | - Rayan Chikhi
- Institut Pasteur & CNRS, C3BI USR 3756, F-75015 Paris, France
| |
Collapse
|
30
|
Holley G, Melsted P. Bifrost: highly parallel construction and indexing of colored and compacted de Bruijn graphs. Genome Biol 2020; 21:249. [PMID: 32943081 PMCID: PMC7499882 DOI: 10.1186/s13059-020-02135-8] [Citation(s) in RCA: 95] [Impact Index Per Article: 19.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2019] [Accepted: 08/06/2020] [Indexed: 02/07/2023] Open
Abstract
Memory consumption of de Bruijn graphs is often prohibitive. Most de Bruijn graph-based assemblers reduce the complexity by compacting paths into single vertices, but this is challenging as it requires the uncompacted de Bruijn graph to be available in memory. We present a parallel and memory-efficient algorithm enabling the direct construction of the compacted de Bruijn graph without producing the intermediate uncompacted graph. Bifrost features a broad range of functions, such as indexing, editing, and querying the graph, and includes a graph coloring method that maps each k-mer of the graph to the genomes it occurs in.Availability https://github.com/pmelsted/bifrost.
Collapse
Affiliation(s)
- Guillaume Holley
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland.
| | - Páll Melsted
- Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavík, Iceland
| |
Collapse
|
31
|
Garimella KV, Iqbal Z, Krause MA, Campino S, Kekre M, Drury E, Kwiatkowski D, Sá JM, Wellems TE, McVean G. Detection of simple and complex de novo mutations with multiple reference sequences. Genome Res 2020; 30:1154-1169. [PMID: 32817236 PMCID: PMC7462078 DOI: 10.1101/gr.255505.119] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2019] [Accepted: 07/17/2020] [Indexed: 12/25/2022]
Abstract
The characterization of de novo mutations in regions of high sequence and structural diversity from whole-genome sequencing data remains highly challenging. Complex structural variants tend to arise in regions of high repetitiveness and low complexity, challenging both de novo assembly, in which short reads do not capture the long-range context required for resolution, and mapping approaches, in which improper alignment of reads to a reference genome that is highly diverged from that of the sample can lead to false or partial calls. Long-read technologies can potentially solve such problems but are currently unfeasible to use at scale. Here we present Corticall, a graph-based method that combines the advantages of multiple technologies and prior data sources to detect arbitrary classes of genetic variant. We construct multisample, colored de Bruijn graphs from short-read data for all samples, align long-read–derived haplotypes and multiple reference data sources to restore graph connectivity information, and call variants using graph path-finding algorithms and a model for simultaneous alignment and recombination. We validate and evaluate the approach using extensive simulations and use it to characterize the rate and spectrum of de novo mutation events in 119 progeny from four Plasmodium falciparum experimental crosses, using long-read data on the parents to inform reconstructions of the progeny and to detect several known and novel nonallelic homologous recombination events.
Collapse
Affiliation(s)
- Kiran V Garimella
- Data Sciences Platform, Broad Institute of MIT and Harvard, Cambridge, Massachusetts 02142, USA.,Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SD, United Kingdom
| | - Michael A Krause
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom.,Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Susana Campino
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Mihir Kekre
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Eleanor Drury
- The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Dominic Kwiatkowski
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom.,The Wellcome Trust Sanger Institute, Wellcome Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, United Kingdom
| | - Juliana M Sá
- Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Thomas E Wellems
- Laboratory of Malaria and Vector Research, National Institute of Allergy and Infectious Diseases, National Institutes of Health, Bethesda, Maryland 20892, USA
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, Oxfordshire, OX3 7BN, United Kingdom.,Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, Oxfordshire, OX3 7LF, United Kingdom
| |
Collapse
|
32
|
Marchet C, Iqbal Z, Gautheret D, Salson M, Chikhi R. REINDEER: efficient indexing of k-mer presence and abundance in sequencing datasets. Bioinformatics 2020; 36:i177-i185. [PMID: 32657392 PMCID: PMC7355249 DOI: 10.1093/bioinformatics/btaa487] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
MOTIVATION In this work we present REINDEER, a novel computational method that performs indexing of sequences and records their abundances across a collection of datasets. To the best of our knowledge, other indexing methods have so far been unable to record abundances efficiently across large datasets. RESULTS We used REINDEER to index the abundances of sequences within 2585 human RNA-seq experiments in 45 h using only 56 GB of RAM. This makes REINDEER the first method able to record abundances at the scale of ∼4 billion distinct k-mers across 2585 datasets. REINDEER also supports exact presence/absence queries of k-mers. Briefly, REINDEER constructs the compacted de Bruijn graph of each dataset, then conceptually merges those de Bruijn graphs into a single global one. Then, REINDEER constructs and indexes monotigs, which in a nutshell are groups of k-mers of similar abundances. AVAILABILITY AND IMPLEMENTATION https://github.com/kamimrcht/REINDEER. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Camille Marchet
- CNRS, UMR 9189 – CRIStAL, Université de Lille, F-59000 Lille, France
| | - Zamin Iqbal
- European Bioinformatics Institute, Cambridge CB10 1SD, UK
| | - Daniel Gautheret
- CEA, CNRS, Institute for Integrative Biology of the Cell (I2BC), Université Paris-Saclay, Gif-sur-Yvette 91190, France
| | - Mikaël Salson
- CNRS, UMR 9189 – CRIStAL, Université de Lille, F-59000 Lille, France
| | - Rayan Chikhi
- Institut Pasteur, CNRS, C3BI – USR 3756, 75015 Paris, France
| |
Collapse
|
33
|
Minkin I, Medvedev P. Scalable Pairwise Whole-Genome Homology Mapping of Long Genomes with BubbZ. iScience 2020; 23:101224. [PMID: 32563153 PMCID: PMC7303978 DOI: 10.1016/j.isci.2020.101224] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2020] [Revised: 05/25/2020] [Accepted: 05/28/2020] [Indexed: 11/17/2022] Open
Abstract
Pairwise whole-genome homology mapping is the problem of finding all pairs of homologous intervals between a pair of genomes. As the number of available whole genomes has been rising dramatically in the last few years, there has been a need for more scalable homology mappers. In this paper, we develop an algorithm (BubbZ) for computing whole-genome pairwise homology mappings, especially in the context of all-to-all comparison for multiple genomes. BubbZ is based on an algorithm for computing chains in compacted de Bruijn graphs. We evaluate BubbZ on simulated datasets, a dataset composed of 16 long mouse genomes, and a large dataset of 1,600 Salmonella genomes. We show up to approximately an order of magnitude speed improvement, compared with MashMap2 and Minimap2, while retaining similar accuracy.
Collapse
Affiliation(s)
- Ilia Minkin
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA.
| | - Paul Medvedev
- Department of Computer Science and Engineering, The Pennsylvania State University, University Park, PA 16802, USA; Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, PA 16802, USA; Center for Computational Biology and Bioinformatics, The Pennsylvania State University, University Park, PA 16802, USA.
| |
Collapse
|
34
|
Wittler R. Alignment- and reference-free phylogenomics with colored de Bruijn graphs. Algorithms Mol Biol 2020; 15:4. [PMID: 32280365 PMCID: PMC7137503 DOI: 10.1186/s13015-020-00164-3] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2019] [Accepted: 03/21/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The increasing amount of available genome sequence data enables large-scale comparative studies. A common task is the inference of phylogenies-a challenging task if close reference sequences are not available, genome sequences are incompletely assembled, or the high number of genomes precludes multiple sequence alignment in reasonable time. RESULTS We present a new whole-genome based approach to infer phylogenies that is alignment- and reference-free. In contrast to other methods, it does not rely on pairwise comparisons to determine distances to infer edges in a tree. Instead, a colored de Bruijn graph is constructed, and information on common subsequences is extracted to infer phylogenetic splits. CONCLUSIONS The introduced new methodology for large-scale phylogenomics shows high potential. Application to different datasets confirms robustness of the approach. A comparison to other state-of-the-art whole-genome based methods indicates comparable or higher accuracy and efficiency.
Collapse
Affiliation(s)
- Roland Wittler
- Genome Informatics, Faculty of Technology, Bielefeld University, Bielefeld, Germany
- Center for Biotechnology, Bielefeld University, Bielefeld, Germany
- Center for Biotechnology, Bielefeld Institute for Bioinformatics Infrastructure, Bielefeld University, Bielefeld, Germany
| |
Collapse
|
35
|
Almodaresi F, Pandey P, Ferdman M, Johnson R, Patro R. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. J Comput Biol 2020; 27:485-499. [PMID: 32176522 PMCID: PMC7185321 DOI: 10.1089/cmb.2019.0322] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).
Collapse
Affiliation(s)
- Fatemeh Almodaresi
- Department of Computer Science, University of Maryland, College Park, Maryland
| | - Prashant Pandey
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
| | - Michael Ferdman
- Department of Computer Science, Stony Brook University, Stony Brook, New York
| | - Rob Johnson
- Department of Computer Science, Stony Brook University, Stony Brook, New York
- VMware Research, Palo Alto, California
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, Maryland
| |
Collapse
|
36
|
Karasikov M, Mustafa H, Joudaki A, Javadzadeh-no S, Rätsch G, Kahles A. Sparse Binary Relation Representations for Genome Graph Annotation. J Comput Biol 2020; 27:626-639. [PMID: 31891531 PMCID: PMC7185347 DOI: 10.1089/cmb.2019.0324] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/18/2023] Open
Abstract
High-throughput DNA sequencing data are accumulating in public repositories, and efficient approaches for storing and indexing such data are in high demand. In recent research, several graph data structures have been proposed to represent large sets of sequencing data and to allow for efficient querying of sequences. In particular, the concept of labeled de Bruijn graphs has been explored by several groups. Although there has been good progress toward representing the sequence graph in small space, methods for storing a set of labels on top of such graphs are still not sufficiently explored. It is also currently not clear how characteristics of the input data, such as the sparsity and correlations of labels, can help to inform the choice of method to compress the graph labeling. In this study, we present a new compression approach, Multi-binary relation wavelet tree (BRWT), which is adaptive to different kinds of input data. We show an up to 29% improvement in compression performance over the basic BRWT method, and up to a 68% improvement over the current state-of-the-art for de Bruijn graph label compression. To put our results into perspective, we present a systematic analysis of five different state-of-the-art annotation compression schemes, evaluate key metrics on both artificial and real-world data, and discuss how different data characteristics influence the compression performance. We show that the improvements of our new method can be robustly reproduced for different representative real-world data sets.
Collapse
Affiliation(s)
- Mikhail Karasikov
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- University Hospital Zurich, Zurich, Switzerland
| | - Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- University Hospital Zurich, Zurich, Switzerland
| | - Amir Joudaki
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- University Hospital Zurich, Zurich, Switzerland
| | | | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- University Hospital Zurich, Zurich, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, Lausanne, Switzerland
- University Hospital Zurich, Zurich, Switzerland
| |
Collapse
|
37
|
Abstract
Since the early days of the genome era, the scientific community has relied on a single 'reference' genome for each species, which is used as the basis for a wide range of genetic analyses, including studies of variation within and across species. As sequencing costs have dropped, thousands of new genomes have been sequenced, and scientists have come to realize that a single reference genome is inadequate for many purposes. By sampling a diverse set of individuals, one can begin to assemble a pan-genome: a collection of all the DNA sequences that occur in a species. Here we review efforts to create pan-genomes for a range of species, from bacteria to humans, and we further consider the computational methods that have been proposed in order to capture, interpret and compare pan-genome data. As scientists continue to survey and catalogue the genomic variation across human populations and begin to assemble a human pan-genome, these efforts will increase our power to connect variation to human diversity, disease and beyond.
Collapse
Affiliation(s)
- Rachel M Sherman
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA.
| | - Steven L Salzberg
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
- Center for Computational Biology, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA
- Department of Biomedical Engineering, Johns Hopkins School of Medicine, Baltimore, MD, USA
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
| |
Collapse
|
38
|
San JE, Baichoo S, Kanzi A, Moosa Y, Lessells R, Fonseca V, Mogaka J, Power R, de Oliveira T. Current Affairs of Microbial Genome-Wide Association Studies: Approaches, Bottlenecks and Analytical Pitfalls. Front Microbiol 2020; 10:3119. [PMID: 32082269 PMCID: PMC7002396 DOI: 10.3389/fmicb.2019.03119] [Citation(s) in RCA: 46] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2019] [Accepted: 12/24/2019] [Indexed: 12/12/2022] Open
Abstract
Microbial genome-wide association studies (mGWAS) are a new and exciting research field that is adapting human GWAS methods to understand how variations in microbial genomes affect host or pathogen phenotypes, such as drug resistance, virulence, host specificity and prognosis. Several computational tools and methods have been developed or adapted from human GWAS to facilitate the discovery of novel mutations and structural variations that are associated with the phenotypes of interest. However, no comprehensive, end-to-end, user-friendly tool is currently available. The development of a broadly applicable pipeline presents a real opportunity among computational biologists. Here, (i) we review the prominent and promising tools, (ii) discuss analytical pitfalls and bottlenecks in mGWAS, (iii) provide insights into the selection of appropriate tools, (iv) highlight the gaps that still need to be filled and how users and developers can work together to overcome these bottlenecks. Use of mGWAS research can inform drug repositioning decisions as well as accelerate the discovery and development of more effective vaccines and antimicrobials for pressing infectious diseases of global health significance, such as HIV, TB, influenza, and malaria.
Collapse
Affiliation(s)
- James Emmanuel San
- Kwazulu-Natal Research and Innovation Sequencing Platform (KRISP), College of Health Sciences, University of KwaZulu-Natal, Durban, South Africa
| | - Shakuntala Baichoo
- Department of Digital Technologies, FoICDT, University of Mauritius, Réduit, Mauritius
| | - Aquillah Kanzi
- Kwazulu-Natal Research and Innovation Sequencing Platform (KRISP), College of Health Sciences, University of KwaZulu-Natal, Durban, South Africa
| | - Yumna Moosa
- Kwazulu-Natal Research and Innovation Sequencing Platform (KRISP), College of Health Sciences, University of KwaZulu-Natal, Durban, South Africa
| | - Richard Lessells
- Kwazulu-Natal Research and Innovation Sequencing Platform (KRISP), College of Health Sciences, University of KwaZulu-Natal, Durban, South Africa
| | - Vagner Fonseca
- Kwazulu-Natal Research and Innovation Sequencing Platform (KRISP), College of Health Sciences, University of KwaZulu-Natal, Durban, South Africa
- Laboratório de Genética Celular e Molecular, ICB, Universidade Federal de Minas Gerais, Belo Horizonte, Brazil
| | - John Mogaka
- Discipline of Public Health, University of Kwazulu-Natal, Durban, South Africa
| | - Robert Power
- St Edmund Hall, Oxford University, Oxford, United Kingdom
| | - Tulio de Oliveira
- Kwazulu-Natal Research and Innovation Sequencing Platform (KRISP), College of Health Sciences, University of KwaZulu-Natal, Durban, South Africa
- Department of Global Health, University of Washington, Seattle, WA, United States
| |
Collapse
|
39
|
Overlap graphs and de Bruijn graphs: data structures for de novo genome assembly in the big data era. QUANTITATIVE BIOLOGY 2019. [DOI: 10.1007/s40484-019-0181-x] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
40
|
Abstract
A wave of technologies transformed sequencing over a decade ago into the high-throughput era, demanding research in new computational methods to analyze these data. The applications of these sequencing technologies have continuously expanded since then. The RECOMB Satellite Workshop on Massively Parallel Sequencing (RECOMB-Seq) meeting, established in 2011, brings together leading researchers in computational genomics and genomic biology to discuss emerging frontiers in algorithm development for massively parallel sequencing data. The ninth edition of this workshop was held in Washington, DC, in George Washington University on May 3 and 4, 2019. There was an exploration of several traditional topics in sequence analysis, including genome assembly, sequence alignment, and data compression, and development of methods for new sequencing technologies, including linked reads and single-molecule long-read sequencing. Here we revisit these topics and discuss the current status and perspectives of sequencing technologies and analyses.
Collapse
Affiliation(s)
- Vikas Bansal
- Department of Pediatrics, School of Medicine, University of California, San Diego, La Jolla, CA, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| |
Collapse
|
41
|
Almodaresi F, Sarkar H, Srivastava A, Patro R. A space and time-efficient index for the compacted colored de Bruijn graph. Bioinformatics 2019; 34:i169-i177. [PMID: 29949982 PMCID: PMC6022659 DOI: 10.1093/bioinformatics/bty292] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023] Open
Abstract
Motivation Indexing reference sequences for search—both individual genomes and collections of genomes—is an important building block for many sequence analysis tasks. Much work has been dedicated to developing full-text indices for genomic sequences, based on data structures such as the suffix array, the BWT and the FM-index. However, the de Bruijn graph, commonly used for sequence assembly, has recently been gaining attention as an indexing data structure, due to its natural ability to represent multiple references using a graphical structure, and to collapse highly-repetitive sequence regions. Yet, much less attention has been given as to how to best index such a structure, such that queries can be performed efficiently and memory usage remains practical as the size and number of reference sequences being indexed grows large. Results We present a novel data structure for representing and indexing the compacted colored de Bruijn graph, which allows for efficient pattern matching and retrieval of the reference information associated with each k-mer. As the popularity of the de Bruijn graph as an index has increased over the past few years, so have the number of proposed representations of this structure. Existing structures typically fall into two categories; those that are hashing-based and provide very fast access to the underlying k-mer information, and those that are space-frugal and provide asymptotically efficient but practically slower pattern search. Our representation achieves a compromise between these two extremes. By building upon minimum perfect hashing and making use of succinct representations where applicable, our data structure provides practically fast lookup while greatly reducing the space compared to traditional hashing-based implementations. Further, we describe a sampling scheme for this index, which provides the ability to trade off query speed for a reduction in the index size. We believe this representation strikes a desirable balance between speed and space usage, and allows for fast search on large reference sequences. Finally, we describe an application of this index to the taxonomic read assignment problem. We show that by adopting, essentially, the approach of Kraken, but replacing k-mer presence with coverage by chains of consistent unique maximal matches, we can improve the space, speed and accuracy of taxonomic read assignment. Availability and implementation pufferfish is written in C++11, is open source, and is available at https://github.com/COMBINE-lab/pufferfish. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Fatemeh Almodaresi
- Department of Computer Science, Stony Brook University, Stonybrook, NY, USA
| | - Hirak Sarkar
- Department of Computer Science, Stony Brook University, Stonybrook, NY, USA
| | - Avi Srivastava
- Department of Computer Science, Stony Brook University, Stonybrook, NY, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stonybrook, NY, USA
| |
Collapse
|
42
|
Harris RS, Medvedev P. Improved representation of sequence bloom trees. Bioinformatics 2019; 36:721-727. [PMID: 31504157 PMCID: PMC8215923 DOI: 10.1093/bioinformatics/btz662] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2019] [Revised: 08/15/2019] [Accepted: 08/20/2019] [Indexed: 01/31/2023] Open
Abstract
MOTIVATION Algorithmic solutions to index and search biological databases are a fundamental part of bioinformatics, providing underlying components to many end-user tools. Inexpensive next generation sequencing has filled publicly available databases such as the Sequence Read Archive beyond the capacity of traditional indexing methods. Recently, the Sequence Bloom Tree (SBT) and its derivatives were proposed as a way to efficiently index such data for queries about transcript presence. RESULTS We build on the SBT framework to construct the HowDe-SBT data structure, which uses a novel partitioning of information to reduce the construction and query time as well as the size of the index. Compared to previous SBT methods, on real RNA-seq data, HowDe-SBT can construct the index in less than 36% of the time and with 39% less space and can answer small-batch queries at least five times faster. We also develop a theoretical framework in which we can analyze and bound the space and query performance of HowDe-SBT compared to other SBT methods. AVAILABILITY AND IMPLEMENTATION HowDe-SBT is available as a free open source program on https://github.com/medvedevgroup/HowDeSBT. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
43
|
Abstract
MOTIVATION There exist several large genomic and metagenomic data collection efforts, including GenomeTrakr and MetaSub, which are routinely updated with new data. To analyze such datasets, memory-efficient methods to construct and store the colored de Bruijn graph were developed. Yet, a problem that has not been considered is constructing the colored de Bruijn graph in a scalable manner that allows new data to be added without reconstruction. This problem is important for large public datasets as scalability is needed but also the ability to update the construction is also needed. RESULTS We create a method for constructing the colored de Bruijn graph for large datasets that is based on partitioning the data into smaller datasets, building the colored de Bruijn graph using a FM-index based representation, and succinctly merging these representations to build a single graph. The last step, merging succinctly, is the algorithmic challenge which we solve in this article. We refer to the resulting method as VariMerge. This construction method also allows the graph to be updated with new data. We validate our approach and show it produces a three-fold reduction in working space when constructing a colored de Bruijn graph for 8000 strains. Lastly, we compare VariMerge to other competing methods-including Vari, Rainbowfish, Mantis, Bloom Filter Trie, the method of Almodaresi et al. and Multi-BRWT-and illustrate that VariMerge is the only method that is capable of building the colored de Bruijn graph for 16 000 strains in a manner that allows it to be updated. Competing methods either did not scale to this large of a dataset or do not allow for additions without reconstruction. AVAILABILITY AND IMPLEMENTATION VariMerge is available at https://github.com/cosmo-team/cosmo/tree/VARI-merge under GPLv3 license. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin D Muggli
- Department of Computer Science, Colorado State University, Fort Collins, CO, USA
| | - Bahar Alipanahi
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| | - Christina Boucher
- Department of Computer and Information Science and Engineering, University of Florida, Gainesville, FL, USA
| |
Collapse
|
44
|
Egidi L, Louza FA, Manzini G, Telles GP. External memory BWT and LCP computation for sequence collections with applications. Algorithms Mol Biol 2019; 14:6. [PMID: 30899322 PMCID: PMC6408864 DOI: 10.1186/s13015-019-0140-0] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2018] [Accepted: 02/23/2019] [Indexed: 11/10/2022] Open
Abstract
Background Sequencing technologies produce larger and larger collections of biosequences that have to be stored in compressed indices supporting fast search operations. Many compressed indices are based on the Burrows–Wheeler Transform (BWT) and the longest common prefix (LCP) array. Because of the sheer size of the input it is important to build these data structures in external memory and time using in the best possible way the available RAM. Results We propose a space-efficient algorithm to compute the BWT and LCP array for a collection of sequences in the external or semi-external memory setting. Our algorithm splits the input collection into subcollections sufficiently small that it can compute their BWT in RAM using an optimal linear time algorithm. Next, it merges the partial BWTs in external or semi-external memory and in the process it also computes the LCP values. Our algorithm can be modified to output two additional arrays that, combined with the BWT and LCP array, provide simple, scan-based, external memory algorithms for three well known problems in bioinformatics: the computation of maximal repeats, the all pairs suffix–prefix overlaps, and the construction of succinct de Bruijn graphs. Conclusions We prove that our algorithm performs \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$${\mathcal {O}}(n\, \mathsf {maxlcp})$$\end{document}O(nmaxlcp) sequential I/Os, where n is the total length of the collection and \documentclass[12pt]{minimal}
\usepackage{amsmath}
\usepackage{wasysym}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage{amsbsy}
\usepackage{mathrsfs}
\usepackage{upgreek}
\setlength{\oddsidemargin}{-69pt}
\begin{document}$$\mathsf {maxlcp}$$\end{document}maxlcp is the maximum LCP value. The experimental results show that our algorithm is only slightly slower than the state of the art for short sequences but it is up to 40 times faster for longer sequences or when the available RAM is at least equal to the size of the input.
Collapse
|
45
|
Ultrafast search of all deposited bacterial and viral genomic data. Nat Biotechnol 2019; 37:152-159. [PMID: 30718882 PMCID: PMC6420049 DOI: 10.1038/s41587-018-0010-1] [Citation(s) in RCA: 71] [Impact Index Per Article: 11.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2017] [Accepted: 12/20/2018] [Indexed: 02/07/2023]
Abstract
Exponentially increasing amounts of unprocessed bacterial and viral genomic sequence data are stored in the global archives. The ability to query these data for sequence search-terms would facilitate both basic research and applications such as real-time genomic epidemiology and surveillance. However, this is not possible with current methods. To solve this problem, we combine knowledge of microbial population genomics with computational methods devised for web-search to produce a searchable data structure named Bitsliced Genomic Signature Index (BIGSI). We indexed the entire global corpus of 447,833 bacterial and viral whole genome sequence datasets using 4 orders of magnitude less storage than previous methods. We applied our BIGSI search function to rapidly find resistance genes MCR-1/2/3, determine the host-range of 2827 plasmids, and quantify antibiotic resistance in archived datasets. Our index can grow incrementally as new (unprocessed or assembled) sequence datasets are deposited and can scale to millions of datasets.
Collapse
|
46
|
Mustafa H, Schilken I, Karasikov M, Eickhoff C, Rätsch G, Kahles A. Dynamic compression schemes for graph coloring. Bioinformatics 2019; 35:407-414. [PMID: 30020403 PMCID: PMC6530811 DOI: 10.1093/bioinformatics/bty632] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Revised: 06/22/2018] [Accepted: 07/16/2018] [Indexed: 11/30/2022] Open
Abstract
Motivation Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. Results We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. Availability and implementation We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Ingo Schilken
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Carsten Eickhoff
- Brown Center for Biomedical Informatics, Brown University, Providence, RI, USA
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
47
|
Bolger AM, Poorter H, Dumschott K, Bolger ME, Arend D, Osorio S, Gundlach H, Mayer KFX, Lange M, Scholz U, Usadel B. Computational aspects underlying genome to phenome analysis in plants. THE PLANT JOURNAL : FOR CELL AND MOLECULAR BIOLOGY 2019; 97:182-198. [PMID: 30500991 PMCID: PMC6849790 DOI: 10.1111/tpj.14179] [Citation(s) in RCA: 37] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Revised: 11/06/2018] [Accepted: 11/16/2018] [Indexed: 05/18/2023]
Abstract
Recent advances in genomics technologies have greatly accelerated the progress in both fundamental plant science and applied breeding research. Concurrently, high-throughput plant phenotyping is becoming widely adopted in the plant community, promising to alleviate the phenotypic bottleneck. While these technological breakthroughs are significantly accelerating quantitative trait locus (QTL) and causal gene identification, challenges to enable even more sophisticated analyses remain. In particular, care needs to be taken to standardize, describe and conduct experiments robustly while relying on plant physiology expertise. In this article, we review the state of the art regarding genome assembly and the future potential of pangenomics in plant research. We also describe the necessity of standardizing and describing phenotypic studies using the Minimum Information About a Plant Phenotyping Experiment (MIAPPE) standard to enable the reuse and integration of phenotypic data. In addition, we show how deep phenotypic data might yield novel trait-trait correlations and review how to link phenotypic data to genomic data. Finally, we provide perspectives on the golden future of machine learning and their potential in linking phenotypes to genomic features.
Collapse
Affiliation(s)
- Anthony M. Bolger
- Institute for Biology I, BioSCRWTH Aachen UniversityWorringer Weg 352074AachenGermany
| | - Hendrik Poorter
- Forschungszentrum Jülich (FZJ) Institute of Bio‐ and Geosciences (IBG‐2) Plant SciencesWilhelm‐Johnen‐Straße52428JülichGermany
- Department of Biological SciencesMacquarie UniversityNorth RydeNSW2109Australia
| | - Kathryn Dumschott
- Institute for Biology I, BioSCRWTH Aachen UniversityWorringer Weg 352074AachenGermany
| | - Marie E. Bolger
- Forschungszentrum Jülich (FZJ) Institute of Bio‐ and Geosciences (IBG‐2) Plant SciencesWilhelm‐Johnen‐Straße52428JülichGermany
| | - Daniel Arend
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) GaterslebenCorrensstraße 306466SeelandGermany
| | - Sonia Osorio
- Department of Molecular Biology and BiochemistryInstituto de Hortofruticultura Subtropical y Mediterránea “La Mayora”Universidad de Málaga‐Consejo Superior de Investigaciones CientíficasCampus de Teatinos29071MálagaSpain
| | - Heidrun Gundlach
- Plant Genome and Systems Biology (PGSB)Helmholtz Zentrum München (HMGU)Ingolstädter Landstraße 185764NeuherbergGermany
| | - Klaus F. X. Mayer
- Plant Genome and Systems Biology (PGSB)Helmholtz Zentrum München (HMGU)Ingolstädter Landstraße 185764NeuherbergGermany
| | - Matthias Lange
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) GaterslebenCorrensstraße 306466SeelandGermany
| | - Uwe Scholz
- Leibniz Institute of Plant Genetics and Crop Plant Research (IPK) GaterslebenCorrensstraße 306466SeelandGermany
| | - Björn Usadel
- Institute for Biology I, BioSCRWTH Aachen UniversityWorringer Weg 352074AachenGermany
- Forschungszentrum Jülich (FZJ) Institute of Bio‐ and Geosciences (IBG‐2) Plant SciencesWilhelm‐Johnen‐Straße52428JülichGermany
| |
Collapse
|
48
|
Pandey P, Almodaresi F, Bender MA, Ferdman M, Johnson R, Patro R. Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index. Cell Syst 2018; 7:201-207.e4. [PMID: 29936185 PMCID: PMC10964368 DOI: 10.1016/j.cels.2018.05.021] [Citation(s) in RCA: 69] [Impact Index Per Article: 9.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2018] [Revised: 05/08/2018] [Accepted: 05/25/2018] [Indexed: 01/08/2023]
Abstract
Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6-108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days.
Collapse
Affiliation(s)
- Prashant Pandey
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA
| | - Fatemeh Almodaresi
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA
| | - Michael A Bender
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA
| | - Michael Ferdman
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA
| | - Rob Johnson
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA; VMware Research, 3425 Hillview Ave, Palo Alto, CA 94304, USA
| | - Rob Patro
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA.
| |
Collapse
|
49
|
Turner I, Garimella KV, Iqbal Z, McVean G. Integrating long-range connectivity information into de Bruijn graphs. Bioinformatics 2018; 34:2556-2565. [PMID: 29554215 PMCID: PMC6061703 DOI: 10.1093/bioinformatics/bty157] [Citation(s) in RCA: 36] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Revised: 11/25/2017] [Accepted: 03/14/2018] [Indexed: 12/27/2022] Open
Abstract
Motivation The de Bruijn graph is a simple and efficient data structure that is used in many areas of sequence analysis including genome assembly, read error correction and variant calling. The data structure has a single parameter k, is straightforward to implement and is tractable for large genomes with high sequencing depth. It also enables representation of multiple samples simultaneously to facilitate comparison. However, unlike the string graph, a de Bruijn graph does not retain long range information that is inherent in the read data. For this reason, applications that rely on de Bruijn graphs can produce sub-optimal results given their input data. Results We present a novel assembly graph data structure: the Linked de Bruijn Graph (LdBG). Constructed by adding annotations on top of a de Bruijn graph, it stores long range connectivity information through the graph. We show that with error-free data it is possible to losslessly store and recover sequence from a Linked de Bruijn graph. With assembly simulations we demonstrate that the LdBG data structure outperforms both our de Bruijn graph and the String Graph Assembler (SGA). Finally we apply the LdBG to Klebsiella pneumoniae short read data to make large (12 kbp) variant calls, which we validate using PacBio sequencing data, and to characterize the genomic context of drug-resistance genes. Availability and implementation Linked de Bruijn Graphs and associated algorithms are implemented as part of McCortex, which is available under the MIT license at https://github.com/mcveanlab/mccortex. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Isaac Turner
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
| | - Kiran V Garimella
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| | - Zamin Iqbal
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- European Bioinformatics Institute (EMBL-EBI), Wellcome Genome Campus, Hinxton, UK
| | - Gil McVean
- Wellcome Trust Centre for Human Genetics, University of Oxford, Oxford, UK
- Big Data Institute, Li Ka Shing Centre for Health Information and Discovery, University of Oxford, Oxford, UK
| |
Collapse
|
50
|
Farruggia A, Gagie T, Navarro G, Puglisi SJ, Sirén J. Relative Suffix Trees. THE COMPUTER JOURNAL 2018; 61:773-788. [PMID: 29795706 PMCID: PMC5956352 DOI: 10.1093/comjnl/bxx108] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/12/2017] [Revised: 09/01/2017] [Indexed: 06/08/2023]
Abstract
Suffix trees are one of the most versatile data structures in stringology, with many applications in bioinformatics. Their main drawback is their size, which can be tens of times larger than the input sequence. Much effort has been put into reducing the space usage, leading ultimately to compressed suffix trees. These compressed data structures can efficiently simulate the suffix tree, while using space proportional to a compressed representation of the sequence. In this work, we take a new approach to compressed suffix trees for repetitive sequence collections, such as collections of individual genomes. We compress the suffix trees of individual sequences relative to the suffix tree of a reference sequence. These relative data structures provide competitive time/space trade-offs, being almost as small as the smallest compressed suffix trees for repetitive collections, and competitive in time with the largest and fastest compressed suffix trees.
Collapse
Affiliation(s)
- Andrea Farruggia
- Department of Computer Science, University of Pisa, Largo Bruno Pontecorvo 3, 56127 Pisa PI, Italy
| | - Travis Gagie
- CeBiB-Center for Biotechnology and Bioengineering, Santiago, Chile
- Escuela de Informática y Telecomunicaciones, Diego Portales University, Ejército 441, Santiago, Chile
| | - Gonzalo Navarro
- CeBiB-Center for Biotechnology and Bioengineering, Santiago, Chile
- Department of Computer Science, University of Chile, Beauchef 851, Santiago, Chile
| | - Simon J Puglisi
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Jouni Sirén
- Wellcome Trust Sanger Institute, Hinxton CB10 1SA, UK
| |
Collapse
|