1
|
Xie P, Guo Y, Teng Y, Zhou W, Yu Y. GeneMiner: A tool for extracting phylogenetic markers from next-generation sequencing data. Mol Ecol Resour 2024; 24:e13924. [PMID: 38197287 DOI: 10.1111/1755-0998.13924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2023] [Revised: 12/14/2023] [Accepted: 12/21/2023] [Indexed: 01/11/2024]
Abstract
The advancement of next-generation sequencing (NGS) technologies has been revolutionary for the field of evolutionary biology. This technology has led to an abundance of available genomes and transcriptomes for researchers to mine. Specifically, researchers can mine for various types of molecular markers that are vital for phylogenetic, evolutionary and ecological studies. Numerous tools have been developed to extract these molecular markers from NGS data. However, due to an insufficient number of well-annotated reference genomes for non-model organisms, it remains challenging to obtain these markers accurately and efficiently. Here, we present GeneMiner, an improved and expanded version of our previous tool, Easy353. GeneMiner combines the reference-guided de Bruijn graph assembly with seed self-discovery and greedy extension. Additionally, it includes a verification step using a parameter-bootstrap method to reduce the pitfalls associated with using a relatively distant reference. Our results, using both experimental and simulation data, showed GeneMiner can accurately acquire phylogenetic molecular markers for plants using transcriptomic, genomic and other NGS data. GeneMiner is designed to be user-friendly, fast and memory-efficient. Further, it is compatible with Linux, Windows and macOS. All source codes are publicly available on GitHub (https://github.com/sculab/GeneMiner) and Gitee (https://gitee.com/sculab/GeneMiner) for easy accessibility and transparency.
Collapse
Affiliation(s)
- Pulin Xie
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China
| | - Yongling Guo
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China
| | - Yue Teng
- State Key Laboratory of Pathogen and Biosecurity, Beijing Institute of Microbiology and Epidemiology, Beijing, China
| | - Wenbin Zhou
- Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, North Carolina, USA
| | - Yan Yu
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, China
| |
Collapse
|
2
|
Pibiri GE. On weighted k-mer dictionaries. Algorithms Mol Biol 2023; 18:3. [PMID: 37328897 DOI: 10.1186/s13015-023-00226-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/30/2023] [Accepted: 05/13/2023] [Indexed: 06/18/2023] Open
Abstract
We consider the problem of representing a set of [Formula: see text]-mers and their abundance counts, or weights, in compressed space so that assessing membership and retrieving the weight of a [Formula: see text]-mer is efficient. The representation is called a weighted dictionary of [Formula: see text]-mers and finds application in numerous tasks in Bioinformatics that usually count [Formula: see text]-mers as a pre-processing step. In fact, [Formula: see text]-mer counting tools produce very large outputs that may result in a severe bottleneck for subsequent processing. In this work we extend the recently introduced SSHash dictionary (Pibiri in Bioinformatics 38:185-194, 2022) to also store compactly the weights of the [Formula: see text]-mers. From a technical perspective, we exploit the order of the [Formula: see text]-mers represented in SSHash to encode runs of weights, hence allowing much better compression than the empirical entropy of the weights. We study the problem of reducing the number of runs in the weights to improve compression even further and give an optimal algorithm for this problem. Lastly, we corroborate our findings with experiments on real-world datasets and comparison with competitive alternatives. Up to date, SSHash is the only [Formula: see text]-mer dictionary that is exact, weighted, associative, fast, and small.
Collapse
Affiliation(s)
- Giulio Ermanno Pibiri
- Department of Environmental Sciences, Informatics and Statistics (DAIS), Ca' Foscari University of Venice, Venice, Italy.
- ISTI-CNR, Pisa, Italy.
| |
Collapse
|
3
|
Lu Y, Ge C, Cai B, Xu Q, Kong R, Chang S. Antibody sequences assembly method based on weighted de Bruijn graph. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2023; 20:6174-6190. [PMID: 37161102 DOI: 10.3934/mbe.2023266] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/11/2023]
Abstract
With the development of next-generation protein sequencing technologies, sequence assembly algorithm has become a key technology for de novo sequencing process. At present, the existing methods can address the assembly of an unknown single protein chain. However, for monoclonal antibodies with light and heavy chains, the assembly is still an unsolved question. To address this problem, we propose a new assembly method, DBAS, which integrates the quality scores and sequence alignment scores from de novo sequencing peptides into a weighted de Bruijn graph to assemble the final protein sequences. The established method is used to assembling sequences from two datasets with mixed light and heavy chains from antibodies. The results show that the DBAS can assemble long antibody sequences for both mixed light and heavy chains and single chains. In addition, DBAS is able to distinguish the light and heavy chains by using BLAST sequence alignment. The results show that the algorithm has good performance for both target sequence coverage and contig assembly accuracy.
Collapse
Affiliation(s)
- Yi Lu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Cheng Ge
- Key Laboratory of Marine Drugs, Chinese Ministry of Education, School of Medicine and Pharmacy, Ocean University of China, Qingdao 266003, China
| | - Biao Cai
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Qing Xu
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Ren Kong
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| | - Shan Chang
- Institute of Bioinformatics and Medical Engineering, School of Electrical and Information Engineering, Jiangsu University of Technology, Changzhou 213001, China
| |
Collapse
|
4
|
Zhang Z, Xie P, Guo Y, Zhou W, Liu E, Yu Y. Easy353: A Tool to Get Angiosperms353 Genes for Phylogenomic Research. Mol Biol Evol 2022; 39:6862883. [PMID: 36458838 PMCID: PMC9757696 DOI: 10.1093/molbev/msac261] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2022] [Revised: 10/28/2022] [Accepted: 11/29/2022] [Indexed: 12/04/2022] Open
Abstract
The Angiosperms353 gene set (AGS) consists of a set of 353 universal low-copy nuclear genes that were selected by examining more than 600 angiosperm species. These genes can be used for phylogenetic studies and population genetics at multiple taxonomic scales. However, current pipelines are not able to recover Angiosperms353 genes efficiently and accurately from high-throughput sequences. Here, we developed Easy353, a reference-guided assembly tool to recover the AGS from high-throughput sequencing (HTS) data (including genome skimming, RNA-seq, and target enrichment). Easy353 is an open-source user-friendly assembler for diverse types of high-throughput data. It has a graphical user interface and a command-line interface that is compatible with all widely-used computer systems. Evaluations, based on both simulated and empirical data, suggest that Easy353 yields low rates of assembly errors.
Collapse
Affiliation(s)
- Zhen Zhang
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan, 610065, P. R. China
| | - Pulin Xie
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan, 610065, P. R. China
| | - Yongling Guo
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan, 610065, P. R. China
| | - Wenbin Zhou
- Department of Biology, University of North Carolina at Chapel Hill, Chapel Hill, NC 27599, USA
| | - Enyan Liu
- Key Laboratory of Bio-Resources and Eco-Environment of Ministry of Education, College of Life Sciences, Sichuan University, Chengdu, Sichuan, 610065, P. R. China
| | - Yan Yu
- Corresponding author: E-mail:
| |
Collapse
|
5
|
Almodaresi F, Khan J, Madaminov S, Ferdman M, Johnson R, Pandey P, Patro R. An incrementally updatable and scalable system for large-scale sequence search using the Bentley-Saxe transformation. Bioinformatics 2022; 38:3155-3163. [PMID: 35325039 PMCID: PMC9191210 DOI: 10.1093/bioinformatics/btac142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2021] [Revised: 01/10/2022] [Accepted: 03/22/2022] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION In the past few years, researchers have proposed numerous indexing schemes for searching large datasets of raw sequencing experiments. Most of these proposed indexes are approximate (i.e. with one-sided errors) in order to save space. Recently, researchers have published exact indexes-Mantis, VariMerge and Bifrost-that can serve as colored de Bruijn graph representations in addition to serving as k-mer indexes. This new type of index is promising because it has the potential to support more complex analyses than simple searches. However, in order to be useful as indexes for large and growing repositories of raw sequencing data, they must scale to thousands of experiments and support efficient insertion of new data. RESULTS In this paper, we show how to build a scalable and updatable exact raw sequence-search index. Specifically, we extend Mantis using the Bentley-Saxe transformation to support efficient updates, called Dynamic Mantis. We demonstrate Dynamic Mantis's scalability by constructing an index of ≈40K samples from SRA by adding samples one at a time to an initial index of 10K samples. Compared to VariMerge and Bifrost, Dynamic Mantis is more efficient in terms of index-construction time and memory, query time and memory and index size. In our benchmarks, VariMerge and Bifrost scaled to only 5K and 80 samples, respectively, while Dynamic Mantis scaled to more than 39K samples. Queries were over 24× faster in Mantis than in Bifrost (VariMerge does not immediately support general search queries we require). Dynamic Mantis indexes were about 2.5× smaller than Bifrost's indexes and about half as big as VariMerge's indexes. AVAILABILITY AND IMPLEMENTATION Dynamic Mantis implementation is available at https://github.com/splatlab/mantis/tree/mergeMSTs. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Jamshed Khan
- Department of Computer Science, University of Maryland, USA
| | | | | | | | | | - Rob Patro
- To whom correspondence should be addressed.
| |
Collapse
|
6
|
Dufault‐Thompson K, Jiang X. Applications of de Bruijn graphs in microbiome research. IMETA 2022; 1:e4. [PMID: 38867733 PMCID: PMC10989854 DOI: 10.1002/imt2.4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/20/2021] [Revised: 01/24/2022] [Accepted: 01/24/2022] [Indexed: 06/14/2024]
Abstract
High-throughput sequencing has become an increasingly central component of microbiome research. The development of de Bruijn graph-based methods for assembling high-throughput sequencing data has been an important part of the broader adoption of sequencing as part of biological studies. Recent advances in the construction and representation of de Bruijn graphs have led to new approaches that utilize the de Bruijn graph data structure to aid in different biological analyses. One type of application of these methods has been in alternative approaches to the assembly of sequencing data like gene-targeted assembly, where only gene sequences are assembled out of larger metagenomes, and differential assembly, where sequences that are differentially present between two samples are assembled. de Bruijn graphs have also been applied for comparative genomics where they can be used to represent large sets of multiple genomes or metagenomes where structural features in the graphs can be used to identify variants, indels, and homologous regions in sequences. These de Bruijn graph-based representations of sequencing data have even begun to be applied to whole sequencing databases for large-scale searches and experiment discovery. de Bruijn graphs have played a central role in how high-throughput sequencing data is worked with, and the rapid development of new tools that rely on these data structures suggests that they will continue to play an important role in biology in the future.
Collapse
Affiliation(s)
- Keith Dufault‐Thompson
- Intramural Research ProgramNational Library of Medicine, National Institutes of HealthBethesdaMarylandUSA
| | - Xiaofang Jiang
- Intramural Research ProgramNational Library of Medicine, National Institutes of HealthBethesdaMarylandUSA
| |
Collapse
|
7
|
Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol 2021; 22:96. [PMID: 33823902 PMCID: PMC8025321 DOI: 10.1186/s13059-021-02297-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 02/10/2021] [Indexed: 12/30/2022] Open
Abstract
de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.
Collapse
Affiliation(s)
- Karel Břinda
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA.
- Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, USA.
| | - Michael Baym
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA
| | - Gregory Kucherov
- CNRS/LIGM Univ Gustave Eiffel, Marne-la-Vallée, France
- Skolkovo Institute of Science and Technology, Moscow, Russia
| |
Collapse
|
8
|
Abstract
Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST (Unitig-STitch) that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which, we show, can store a set of k-mers by using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which, we show, improves index size by 10%-44% compared with other state-of-the-art low-memory indices.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, Penn State, University Park, State College, PA, USA
| | - Paul Medevedev
- Department of Computer Science and Engineering, Penn State, University Park, State College, PA, USA
- Department of Biochemistry and Molecular Biology, Penn State, University Park, State College, PA, USA
- Center for Computational Biology and Bioinformatics, Penn State, University Park, State College, PA, USA
| |
Collapse
|
9
|
Jiang P, Luo J, Wang Y, Deng P, Schmidt B, Tang X, Chen N, Wong L, Zhao L. kmcEx: memory-frugal and retrieval-efficient encoding of counted k-mers. Bioinformatics 2020; 35:4871-4878. [PMID: 31038666 DOI: 10.1093/bioinformatics/btz299] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2018] [Revised: 04/02/2019] [Accepted: 04/19/2019] [Indexed: 12/25/2022] Open
Abstract
MOTIVATION K-mers along with their frequency have served as an elementary building block for error correction, repeat detection, multiple sequence alignment, genome assembly, etc., attracting intensive studies in k-mer counting. However, the output of k-mer counters itself is large; very often, it is too large to fit into main memory, leading to highly narrowed usability. RESULTS We introduce a novel idea of encoding k-mers as well as their frequency, achieving good memory saving and retrieval efficiency. Specifically, we propose a Bloom filter-like data structure to encode counted k-mers by coupled-bit arrays-one for k-mer representation and the other for frequency encoding. Experiments on five real datasets show that the average memory-saving ratio on all 31-mers is as high as 13.81 as compared with raw input, with 7 hash functions. At the same time, the retrieval time complexity is well controlled (effectively constant), and the false-positive rate is decreased by two orders of magnitude. AVAILABILITY AND IMPLEMENTATION The source codes of our algorithm are available at github.com/lzhLab/kmcEx. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Peng Jiang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Jie Luo
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Yiqi Wang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Pingji Deng
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Bertil Schmidt
- Institute of Computer Science, Johannes Gutenberg University Mainz, Mainz Germany
| | - Xiangjun Tang
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China
| | - Ningjiang Chen
- School of Computing and Electronic Information, Guangxi University, Nanning, Guangxi, China
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore, Singapore
| | - Liang Zhao
- Precision Medicine Research Center, Taihe Hospital, Hubei University of Medicine, Shiyan, Hubei, China.,School of Computing and Electronic Information, Guangxi University, Nanning, Guangxi, China
| |
Collapse
|
10
|
Almodaresi F, Pandey P, Ferdman M, Johnson R, Patro R. An Efficient, Scalable, and Exact Representation of High-Dimensional Color Information Enabled Using de Bruijn Graph Search. J Comput Biol 2020; 27:485-499. [PMID: 32176522 PMCID: PMC7185321 DOI: 10.1089/cmb.2019.0322] [Citation(s) in RCA: 17] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The colored de Bruijn graph (cdbg) and its variants have become an important combinatorial structure used in numerous areas in genomics, such as population-level variation detection in metagenomic samples, large-scale sequence search, and cdbg-based reference sequence indices. As samples or genomes are added to the cdbg, the color information comes to dominate the space required to represent this data structure. In this article, we show how to represent the color information efficiently by adopting a hierarchical encoding that exploits correlations among color classes-patterns of color occurrence-present in the de Bruijn graph (dbg). A major challenge in deriving an efficient encoding of the color information that takes advantage of such correlations is determining which color classes are close to each other in the high-dimensional space of possible color patterns. We demonstrate that the dbg itself can be used as an efficient mechanism to search for approximate nearest neighbors in this space. While our approach reduces the encoding size of the color information even for relatively small cdbgs (hundreds of experiments), the gains are particularly consequential as the number of potential colors (i.e., samples or references) grows into thousands. We apply this encoding in the context of two different applications; the implicit cdbg used for a large-scale sequence search index, Mantis, as well as the encoding of color information used in population-level variation detection by tools such as Vari and Rainbowfish. Our results show significant improvements in the overall size and scalability of representation of the color information. In our experiment on 10,000 samples, we achieved >11 × better compression compared to Ramen, Ramen, Rao (RRR).
Collapse
Affiliation(s)
- Fatemeh Almodaresi
- Department of Computer Science, University of Maryland, College Park, Maryland
| | - Prashant Pandey
- School of Computer Science, Carnegie Mellon University, Pittsburgh, Pennsylvania
| | - Michael Ferdman
- Department of Computer Science, Stony Brook University, Stony Brook, New York
| | - Rob Johnson
- Department of Computer Science, Stony Brook University, Stony Brook, New York
- VMware Research, Palo Alto, California
| | - Rob Patro
- Department of Computer Science, University of Maryland, College Park, Maryland
| |
Collapse
|
11
|
Rowe WPM. When the levee breaks: a practical guide to sketching algorithms for processing the flood of genomic data. Genome Biol 2019; 20:199. [PMID: 31519212 PMCID: PMC6744645 DOI: 10.1186/s13059-019-1809-x] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2019] [Accepted: 09/02/2019] [Indexed: 01/21/2023] Open
Abstract
Considerable advances in genomics over the past decade have resulted in vast amounts of data being generated and deposited in global archives. The growth of these archives exceeds our ability to process their content, leading to significant analysis bottlenecks. Sketching algorithms produce small, approximate summaries of data and have shown great utility in tackling this flood of genomic data, while using minimal compute resources. This article reviews the current state of the field, focusing on how the algorithms work and how genomicists can utilize them effectively. References to interactive workbooks for explaining concepts and demonstrating workflows are included at https://github.com/will-rowe/genome-sketching .
Collapse
Affiliation(s)
- Will P M Rowe
- Institute of Microbiology and Infection, School of Biosciences, University of Birmingham, Birmingham, B15 2TT, UK.
- Scientific Computing Department, The Hartree Centre, STFC Daresbury Laboratory, Warrington, WA4 4AD, UK.
| |
Collapse
|
12
|
Abstract
Large-scale genomics demands computational methods that scale sublinearly with the growth of data. We review several data structures and sketching techniques that have been used in genomic analysis methods. Specifically, we focus on four key ideas that take different approaches to achieve sublinear space usage and processing time: compressed full-text indices, approximate membership query data structures, locality-sensitive hashing, and minimizers schemes. We describe these techniques at a high level and give several representative applications of each.
Collapse
Affiliation(s)
- Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA;,
| | - Brad Solomon
- Department of Computer Science, Whiting School of Engineering, Johns Hopkins University, Baltimore, Maryland 21218, USA
| | - Rob Patro
- Department of Computer Science, Stony Brook University, Stony Brook, New York 11794, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania 15213, USA;,
| |
Collapse
|
13
|
Mustafa H, Schilken I, Karasikov M, Eickhoff C, Rätsch G, Kahles A. Dynamic compression schemes for graph coloring. Bioinformatics 2019; 35:407-414. [PMID: 30020403 PMCID: PMC6530811 DOI: 10.1093/bioinformatics/bty632] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2018] [Revised: 06/22/2018] [Accepted: 07/16/2018] [Indexed: 11/30/2022] Open
Abstract
Motivation Technological advancements in high-throughput DNA sequencing have led to an exponential growth of sequencing data being produced and stored as a byproduct of biomedical research. Despite its public availability, a majority of this data remains hard to query for the research community due to a lack of efficient data representation and indexing solutions. One of the available techniques to represent read data is a condensed form as an assembly graph. Such a representation contains all sequence information but does not store contextual information and metadata. Results We present two new approaches for a compressed representation of a graph coloring: a lossless compression scheme based on a novel application of wavelet tries as well as a highly accurate lossy compression based on a set of Bloom filters. Both strategies retain a coloring even when adding to the underlying graph topology. We present construction and merge procedures for both methods and evaluate their performance on a wide range of different datasets. By dropping the requirement of a fully lossless compression and using the topological information of the underlying graph, we can reduce memory requirements by up to three orders of magnitude. Representing individual colors as independently stored modules, our approaches can be efficiently parallelized and provide strategies for dynamic use. These properties allow for an easy upscaling to the problem sizes common to the biomedical domain. Availability and implementation We provide prototype implementations in C++, summaries of our experiments as well as links to all datasets publicly at https://github.com/ratschlab/graph_annotation. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Harun Mustafa
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Ingo Schilken
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
| | - Mikhail Karasikov
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - Carsten Eickhoff
- Brown Center for Biomedical Informatics, Brown University, Providence, RI, USA
| | - Gunnar Rätsch
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| | - André Kahles
- Department of Computer Science, ETH Zurich, Zurich, Switzerland
- Biomedical Informatics Research, University Hospital Zurich, Zurich, Switzerland
- SIB Swiss Institute of Bioinformatics, 1015 Lausanne, Switzerland
| |
Collapse
|
14
|
Pandey P, Almodaresi F, Bender MA, Ferdman M, Johnson R, Patro R. Mantis: A Fast, Small, and Exact Large-Scale Sequence-Search Index. Cell Syst 2018; 7:201-207.e4. [PMID: 29936185 PMCID: PMC10964368 DOI: 10.1016/j.cels.2018.05.021] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2018] [Revised: 05/08/2018] [Accepted: 05/25/2018] [Indexed: 01/08/2023]
Abstract
Sequence-level searches on large collections of RNA sequencing experiments, such as the NCBI Sequence Read Archive (SRA), would enable one to ask many questions about the expression or variation of a given transcript in a population. Existing approaches, such as the sequence Bloom tree, suffer from fundamental limitations of the Bloom filter, resulting in slow build and query times, less-than-optimal space usage, and potentially large numbers of false-positives. This paper introduces Mantis, a space-efficient system that uses new data structures to index thousands of raw-read experiments and facilitates large-scale sequence searches. In our evaluation, index construction with Mantis is 6× faster and yields a 20% smaller index than the state-of-the-art split sequence Bloom tree (SSBT). For queries, Mantis is 6-108× faster than SSBT and has no false-positives or -negatives. For example, Mantis was able to search for all 200,400 known human transcripts in an index of 2,652 RNA sequencing experiments in 82 min; SSBT took close to 4 days.
Collapse
Affiliation(s)
- Prashant Pandey
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA
| | - Fatemeh Almodaresi
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA
| | - Michael A Bender
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA
| | - Michael Ferdman
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA
| | - Rob Johnson
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA; VMware Research, 3425 Hillview Ave, Palo Alto, CA 94304, USA
| | - Rob Patro
- Computer Science Department, Stony Brook University, 100 Nicolls Rd, Stony Brook, NY 11794, USA.
| |
Collapse
|