1
|
Beeloo R, Zomer A, Deorowicz S, Dutilh B. Graphite: painting genomes using a colored de Bruijn graph. NAR Genom Bioinform 2024; 6:lqae142. [PMID: 39445080 PMCID: PMC11497850 DOI: 10.1093/nargab/lqae142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/08/2024] [Revised: 08/02/2024] [Accepted: 10/05/2024] [Indexed: 10/25/2024] Open
Abstract
The recent growth of microbial sequence data allows comparisons at unprecedented scales, enabling the tracking of strains, mobile genetic elements, or genes. Querying a genome against a large reference database can easily yield thousands of matches that are tedious to interpret and pose computational challenges. We developed Graphite that uses a colored de Bruijn graph (cDBG) to paint query genomes, selecting the local best matches along the full query length. By focusing on the best genomic match of each query region, Graphite reduces the number of matches while providing the most promising leads for sequence tracking or genomic forensics. When applied to hundreds of Campylobacter genomes we found extensive gene sharing, including a previously undetected C. coli plasmid that matched a C. jejuni chromosome. Together, genome painting using cDBGs as enabled by Graphite, can reveal new biological phenomena by mitigating computational hurdles.
Collapse
Affiliation(s)
- Rick Beeloo
- Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
| | - Aldert L Zomer
- Department of Infectious Diseases and Immunology, Faculty of Veterinary Medicine, Utrecht University, 3584 Utrecht, The Netherlands
| | - Sebastian Deorowicz
- Department of Algorithmics and Software, Silesian University of Technology, Akademicka 16, Gliwice PL-44100, Poland
| | - Bas E Dutilh
- Theoretical Biology and Bioinformatics, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- Institute of Biodiversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich Schiller University Jena, 07743 Jena, Germany
| |
Collapse
|
2
|
Yi W, Wang C, Xie Q, Zhao Y, Jia J. PSBF: p-adic Integer Scalable Bloom Filter. SENSORS (BASEL, SWITZERLAND) 2023; 23:7775. [PMID: 37765833 PMCID: PMC10537130 DOI: 10.3390/s23187775] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/26/2023] [Revised: 09/07/2023] [Accepted: 09/08/2023] [Indexed: 09/29/2023]
Abstract
Given the challenges associated with the dynamic expansion of the conventional bloom filter's capacity, the prevalence of false positives, and the subpar access performance, this study employs the algebraic and topological characteristics of p-adic integers to introduce an innovative approach for dynamically expanding the p-adic Integer Scalable Bloom Filter (PSBF). The proposed method involves converting the target element into an integer using a string hash function, followed by the conversion of said integer into a p-adic integer through algebraic properties. This process automatically establishes the topological tree access structure of the PSBF. The experiment involved a comparison of access performance among the standard bloom filter, dynamic bloom filter, and scalable bloom filter. The findings indicate that the PSBF offers advantages such as avoidance of a linear storage structure, enhanced efficiency in element insertion and query, improved storage space utilization, and reduced likelihood of false positives. Consequently, the PSBF presents a novel approach to the dynamic extensibility of bloom filters.
Collapse
Affiliation(s)
| | | | | | | | - Jing Jia
- School of Software, Jiangxi Agricultural University, Nanchang 330045, China; (W.Y.); (C.W.); (Q.X.); (Y.Z.)
| |
Collapse
|
3
|
Quan L, Chu X, Sun X, Wu T, Lyu Q. How Deepbics Quantifies Intensities of Transcription Factor-DNA Binding and Facilitates Prediction of Single Nucleotide Variant Pathogenicity With a Deep Learning Model Trained On ChIP-Seq Data Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1594-1599. [PMID: 35471887 DOI: 10.1109/tcbb.2022.3170343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The binding of DNA sequences to cell type-specific transcription factors is essential for regulating gene expression in all organisms. Many variants occurring in these binding regions play crucial roles in human disease by disrupting the cis-regulation of gene expression. We first implemented a sequence-based deep learning model called deepBICS to quantify the intensity of transcription factors-DNA binding. The experimental results not only showed the superiority of deepBICS on ChIP-seq data sets but also suggested deepBICS as a language model could help the classification of disease-related and neutral variants. We then built a language model-based method called deepBICS4SNV to predict the pathogenicity of single nucleotide variants. The good performance of deepBICS4SNV on 2 tests related to Mendelian disorders and viral diseases shows the sequence contextual information derived from language models can improve prediction accuracy and generalization capability.
Collapse
|
4
|
Zhao C, Shi ZJ, Pollard KS. Pitfalls of genotyping microbial communities with rapidly growing genome collections. Cell Syst 2023; 14:160-176.e3. [PMID: 36657438 PMCID: PMC9957970 DOI: 10.1016/j.cels.2022.12.007] [Citation(s) in RCA: 9] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2022] [Revised: 10/15/2022] [Accepted: 12/19/2022] [Indexed: 01/20/2023]
Abstract
Detecting genetic variants in metagenomic data is a priority for understanding the evolution, ecology, and functional characteristics of microbial communities. Many tools that perform this metagenotyping rely on aligning reads of unknown origin to a database of sequences from many species before calling variants. In this synthesis, we investigate how databases of increasingly diverse and closely related species have pushed the limits of current alignment algorithms, thereby degrading the performance of metagenotyping tools. We identify multi-mapping reads as a prevalent source of errors and illustrate a trade-off between retaining correct alignments versus limiting incorrect alignments, many of which map reads to the wrong species. Then we evaluate several actionable mitigation strategies and review emerging methods showing promise to further improve metagenotyping in response to the rapid growth in genome collections. Our results have implications beyond metagenotyping to the many tools in microbial genomics that depend upon accurate read mapping.
Collapse
Affiliation(s)
- Chunyu Zhao
- Chan Zuckerberg Biohub, San Francisco, CA, USA; Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA
| | - Zhou Jason Shi
- Chan Zuckerberg Biohub, San Francisco, CA, USA; Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA
| | - Katherine S Pollard
- Chan Zuckerberg Biohub, San Francisco, CA, USA; Gladstone Institute of Data Science and Biotechnology, San Francisco, CA, USA; Department of Epidemiology & Biostatistics, University of California, San Francisco, San Francisco, CA, USA.
| |
Collapse
|
5
|
Yao H, Hu G, Liu S, Fang H, Ji Y. SparkGC: Spark based genome compression for large collections of genomes. BMC Bioinformatics 2022; 23:297. [PMID: 35879669 PMCID: PMC9310413 DOI: 10.1186/s12859-022-04825-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/28/2022] [Accepted: 07/06/2022] [Indexed: 11/23/2022] Open
Abstract
Since the completion of the Human Genome Project at the turn of the century, there has been an unprecedented proliferation of sequencing data. One of the consequences is that it becomes extremely difficult to store, backup, and migrate enormous amount of genomic datasets, not to mention they continue to expand as the cost of sequencing decreases. Herein, a much more efficient and scalable program to perform genome compression is required urgently. In this manuscript, we propose a new Apache Spark based Genome Compression method called SparkGC that can run efficiently and cost-effectively on a scalable computational cluster to compress large collections of genomes. SparkGC uses Spark’s in-memory computation capabilities to reduce compression time by keeping data active in memory between the first-order and second-order compression. The evaluation shows that the compression ratio of SparkGC is better than the best state-of-the-art methods, at least better by 30%. The compression speed is also at least 3.8 times that of the best state-of-the-art methods on only one worker node and scales quite well with the number of nodes. SparkGC is of significant benefit to genomic data storage and transmission. The source code of SparkGC is publicly available at https://github.com/haichangyao/SparkGC.
Collapse
Affiliation(s)
- Haichang Yao
- School of Computer and Software, Nanjing Vocational University of Industry Technology, Nanjing, 210023, China
| | - Guangyong Hu
- School of Computer and Software, Nanjing Vocational University of Industry Technology, Nanjing, 210023, China
| | - Shangdong Liu
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Houzhi Fang
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China
| | - Yimu Ji
- School of Computer Science, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China. .,Jiangsu HPC and Intelligent Processing Engineer Research Center, Nanjing, 210003, China. .,Institute of High Performance Computing and Bigdata, Nanjing University of Posts and Telecommunications, Nanjing, 210023, China.
| |
Collapse
|
6
|
Shi ZJ, Dimitrov B, Zhao C, Nayfach S, Pollard KS. Fast and accurate metagenotyping of the human gut microbiome with GT-Pro. Nat Biotechnol 2022; 40:507-516. [PMID: 34949778 DOI: 10.1038/s41587-021-01102-3] [Citation(s) in RCA: 18] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Accepted: 09/20/2021] [Indexed: 02/07/2023]
Abstract
Single nucleotide polymorphisms (SNPs) in metagenomics are used to quantify population structure, track strains and identify genetic determinants of microbial phenotypes. However, existing alignment-based approaches for metagenomic SNP detection require high-performance computing and enough read coverage to distinguish SNPs from sequencing errors. To address these issues, we developed the GenoTyper for Prokaryotes (GT-Pro), a suite of methods to catalog SNPs from genomes and use unique k-mers to rapidly genotype these SNPs from metagenomes. Compared to methods that use read alignment, GT-Pro is more accurate and two orders of magnitude faster. Using high-quality genomes, we constructed a catalog of 104 million SNPs in 909 human gut species and used unique k-mers targeting this catalog to characterize the global population structure of gut microbes from 7,459 samples. GT-Pro enables fast and memory-efficient metagenotyping of millions of SNPs on a personal computer.
Collapse
Affiliation(s)
- Zhou Jason Shi
- Data Science, Chan Zuckerberg Biohub, San Francisco, CA, USA
- Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA, USA
| | | | - Chunyu Zhao
- Data Science, Chan Zuckerberg Biohub, San Francisco, CA, USA
| | - Stephen Nayfach
- Department of Energy, Joint Genome Institute, Walnut Creek, CA, USA.
- Environmental Genomics and Systems Biology Division, Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| | - Katherine S Pollard
- Data Science, Chan Zuckerberg Biohub, San Francisco, CA, USA.
- Data Science and Biotechnology, Gladstone Institutes, San Francisco, CA, USA.
- Epidemiology and Biostatistics, University of California, San Francisco, CA, USA.
| |
Collapse
|
7
|
Quan L, Sun X, Wu J, Mei J, Huang L, He R, Nie L, Chen Y, Lyu Q. Learning Useful Representations of DNA Sequences From ChIP-Seq Datasets for Exploring Transcription Factor Binding Specificities. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2022; 19:998-1008. [PMID: 32976105 DOI: 10.1109/tcbb.2020.3026787] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Deep learning has been successfully applied to surprisingly different domains. Researchers and practitioners are employing trained deep learning models to enrich our knowledge. Transcription factors (TFs)are essential for regulating gene expression in all organisms by binding to specific DNA sequences. Here, we designed a deep learning model named SemanticCS (Semantic ChIP-seq)to predict TF binding specificities. We trained our learning model on an ensemble of ChIP-seq datasets (Multi-TF-cell)to learn useful intermediate features across multiple TFs and cells. To interpret these feature vectors, visualization analysis was used. Our results indicate that these learned representations can be used to train shallow machines for other tasks. Using diverse experimental data and evaluation metrics, we show that SemanticCS outperforms other popular methods. In addition, from experimental data, SemanticCS can help to identify the substitutions that cause regulatory abnormalities and to evaluate the effect of substitutions on the binding affinity for the RXR transcription factor. The online server for SemanticCS is freely available at http://qianglab.scst.suda.edu.cn/semanticCS/.
Collapse
|
8
|
Abstract
MOTIVATION Minimizers are efficient methods to sample k-mers from genomic sequences that unconditionally preserve sufficiently long matches between sequences. Well-established methods to construct efficient minimizers focus on sampling fewer k-mers on a random sequence and use universal hitting sets (sets of k-mers that appear frequently enough) to upper bound the sketch size. In contrast, the problem of sequence-specific minimizers, which is to construct efficient minimizers to sample fewer k-mers on a specific sequence such as the reference genome, is less studied. Currently, the theoretical understanding of this problem is lacking, and existing methods do not specialize well to sketch specific sequences. RESULTS We propose the concept of polar sets, complementary to the existing idea of universal hitting sets. Polar sets are k-mer sets that are spread out enough on the reference, and provably specialize well to specific sequences. Link energy measures how well spread out a polar set is, and with it, the sketch size can be bounded from above and below in a theoretically sound way. This allows for direct optimization of sketch size. We propose efficient heuristics to construct polar sets, and via experiments on the human reference genome, show their practical superiority in designing efficient sequence-specific minimizers. AVAILABILITY AND IMPLEMENTATION A reference implementation and code for analyses under an open-source license are at https://github.com/kingsford-group/polarset. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Hongyu Zheng
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Carl Kingsford
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| | - Guillaume Marçais
- Computational Biology Department, School of Computer Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA
| |
Collapse
|
9
|
Kaye AM, Wasserman WW. The genome atlas: navigating a new era of reference genomes. Trends Genet 2021; 37:807-818. [PMID: 33419587 DOI: 10.1016/j.tig.2020.12.002] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2020] [Revised: 12/03/2020] [Accepted: 12/07/2020] [Indexed: 10/22/2022]
Abstract
The reference genome serves two distinct purposes within the field of genomics. First, it provides a persistent structure against which findings can be reported, allowing for universal knowledge exchange between users. Second, it reduces the computational costs and time required to process genomic data by creating a scaffold that can be relied upon by analysis software. Here, we posit that current efforts to extend the linear reference to a graph-based structure while trying to fulfil both of these purposes concurrently will face a trade-off between comprehensiveness and computational efficiency. In this article, we explore how the reference genome is used and suggest an alternative structure, The Genome Atlas (TGA), to fulfil the bipartite role of the reference genome.
Collapse
Affiliation(s)
- Alice M Kaye
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada
| | - Wyeth W Wasserman
- Centre for Molecular Medicine and Therapeutics, BC Children's Hospital Research Institute, University of British Columbia, Vancouver, BC, Canada.
| |
Collapse
|
10
|
Liu Y, Wong L, Li J. Allowing mutations in maximal matches boosts genome compression performance. Bioinformatics 2020; 36:4675-4681. [PMID: 33118018 DOI: 10.1093/bioinformatics/btaa572] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/25/2020] [Revised: 05/05/2020] [Accepted: 06/10/2020] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION A maximal match between two genomes is a contiguous non-extendable sub-sequence common in the two genomes. DNA bases mutate very often from the genome of one individual to another. When a mutation occurs in a maximal match, it breaks the maximal match into shorter match segments. The coding cost using these broken segments for reference-based genome compression is much higher than that of using the maximal match which is allowed to contain mutations. RESULTS We present memRGC, a novel reference-based genome compression algorithm that leverages mutation-containing matches (MCMs) for genome encoding. MemRGC detects maximal matches between two genomes using a coprime double-window k-mer sampling search scheme, the method then extends these matches to cover mismatches (mutations) and their neighbouring maximal matches to form long and MCMs. Experiments reveal that memRGC boosts the compression performance by an average of 27% in reference-based genome compression. MemRGC is also better than the best state-of-the-art methods on all of the benchmark datasets, sometimes better by 50%. Moreover, memRGC uses much less memory and de-compression resources, while providing comparable compression speed. These advantages are of significant benefits to genome data storage and transmission. AVAILABILITY AND IMPLEMENTATION https://github.com/yuansliu/memRGC. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yuansheng Liu
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| | - Limsoon Wong
- School of Computing, National University of Singapore, Singapore 117417, Singapore
| | - Jinyan Li
- Advanced Analytics Institute, Faculty of Engineering and IT, University of Technology Sydney, Ultimo, NSW 2007, Australia
| |
Collapse
|
11
|
Kutzner A, Kim PS, Schmidt M. A performant bridge between fixed-size and variable-size seeding. BMC Bioinformatics 2020; 21:328. [PMID: 32703211 PMCID: PMC7376731 DOI: 10.1186/s12859-020-03642-y] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2020] [Accepted: 07/02/2020] [Indexed: 11/16/2022] Open
Abstract
Background Seeding is usually the initial step of high-throughput sequence aligners. Two popular seeding strategies are fixed-size seeding (k-mers, minimizers) and variable-size seeding (MEMs, SMEMs, maximal spanning seeds). The former strategy supports fast seed computation, while the latter one benefits from a high seed uniqueness. Algorithmic bridges between instances of both seeding strategies are of interest for combining their respective advantages. Results We introduce an efficient strategy for computing MEMs out of fixed-size seeds (k-mers or minimizers). In contrast to previously proposed extend-purge strategies, our merge-extend strategy prevents the creation and filtering of duplicate MEMs. Further, we describe techniques for extracting SMEMs or maximal spanning seeds out of MEMs. A comprehensive benchmarking shows the applicability, strengths, shortcomings and computational requirements of all discussed seeding techniques. Additionally, we report the effects of seed occurrence filters in the context of these techniques. Aside from our novel algorithmic approaches, we analyze hierarchies within fixed-size and variable-size seeding along with a mapping between instances of both seeding strategies. Conclusion Benchmarking shows that our proposed merge-extend strategy for MEM computation outperforms previous extend-purge strategies in the context of PacBio reads. The observed superiority grows with increasing read size and read quality. Further, the presented filters for extracting SMEMs or maximal spanning seeds out of MEMs outperform FMD-index based extension techniques. All code used for benchmarking is available via GitHub at https://github.com/ITBE-Lab/seed-evaluation.
Collapse
Affiliation(s)
- Arne Kutzner
- Department of Information Systems, College of Engineering, Hanyang University, 222 Wangsimni-ro, Seongdong-gu, Seoul, 04763, Republic of Korea
| | - Pok-Son Kim
- Department of Information Security, Cryptology, and Mathematics, Kookmin University, 77, Jeongneung-ro, Seongbuk-gu, Seoul, 02707, Republic of Korea
| | - Markus Schmidt
- Department of Information Systems, College of Engineering, Hanyang University, 222 Wangsimni-ro, Seongdong-gu, Seoul, 04763, Republic of Korea.
| |
Collapse
|