1
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
2
|
Rossignolo E, Comin M. Enhanced Compression of k-Mer Sets with Counters via de Bruijn Graphs. J Comput Biol 2024; 31:524-538. [PMID: 38820168 DOI: 10.1089/cmb.2024.0530] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/02/2024] Open
Abstract
An essential task in computational genomics involves transforming input sequences into their constituent k-mers. The quest for an efficient representation of k-mer sets is crucial for enhancing the scalability of bioinformatic analyses. One widely used method involves converting the k-mer set into a de Bruijn graph (dBG), followed by seeking a compact graph representation via the smallest path cover. This study introduces USTAR* (Unitig STitch Advanced constRuction), a tool designed to compress both a set of k-mers and their associated counts. USTAR leverages the connectivity and density of dBGs, enabling a more efficient path selection for constructing the path cover. The efficacy of USTAR is demonstrated through its application in compressing real read data sets. USTAR improves the compression achieved by UST (Unitig STitch), the best algorithm, by percentages ranging from 2.3% to 26.4%, depending on the k-mer size, and it is up to 7 × times faster.
Collapse
Affiliation(s)
- Enrico Rossignolo
- Department of Information Engineering, University of Padua, Padua, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padua, Padua, Italy
| |
Collapse
|
3
|
Schmidt S, Khan S, Alanko JN, Pibiri GE, Tomescu AI. Matchtigs: minimum plain text representation of k-mer sets. Genome Biol 2023; 24:136. [PMID: 37296461 PMCID: PMC10251615 DOI: 10.1186/s13059-023-02968-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Accepted: 05/10/2023] [Indexed: 06/12/2023] Open
Abstract
We propose a polynomial algorithm computing a minimum plain-text representation of k-mer sets, as well as an efficient near-minimum greedy heuristic. When compressing read sets of large model organisms or bacterial pangenomes, with only a minor runtime increase, we shrink the representation by up to 59% over unitigs and 26% over previous work. Additionally, the number of strings is decreased by up to 97% over unitigs and 90% over previous work. Finally, a small representation has advantages in downstream applications, as it speeds up SSHash-Lite queries by up to 4.26× over unitigs and 2.10× over previous work.
Collapse
Affiliation(s)
- Sebastian Schmidt
- Department of Computer Science, University of Helsinki, Helsinki, Finland
| | - Shahbaz Khan
- Department of Computer Science and Engineering, Indian Institute of Technology Roorkee, Roorkee, India
| | - Jarno N. Alanko
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Faculty of Computer Science, Dalhousie University, Halifax, Canada
| | - Giulio E. Pibiri
- Department of Environmental Sciences, Informatics and Statistics, Ca’ Foscari University of Venice, Venice, Italy
- ISTI-CNR, Pisa, Italy
| | | |
Collapse
|
4
|
Grytten I, Dagestad Rand K, Sandve GK. KAGE: fast alignment-free graph-based genotyping of SNPs and short indels. Genome Biol 2022; 23:209. [PMID: 36195962 PMCID: PMC9531401 DOI: 10.1186/s13059-022-02771-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2021] [Accepted: 09/09/2022] [Indexed: 11/10/2022] Open
Abstract
Genotyping is a core application of high-throughput sequencing. We present KAGE, a genotyper for SNPs and short indels that is inspired by recent developments within graph-based genome representations and alignment-free methods. KAGE uses a pan-genome representation of the population to efficiently and accurately predict genotypes. Two novel ideas improve both the speed and accuracy: a Bayesian model incorporates genotypes from thousands of individuals to improve prediction accuracy, and a computationally efficient method leverages correlation between variants. We show that the accuracy of KAGE is at par with the best existing alignment-free genotypers, while being an order of magnitude faster.
Collapse
Affiliation(s)
- Ivar Grytten
- Department of Informatics, University of Oslo, Gaustadalleen 23 B, 0371, Oslo, Norway. .,Centre for Bioinformatics, University of Oslo, Gaustadalleen 30, 0373, Oslo, Norway.
| | - Knut Dagestad Rand
- Department of Informatics, University of Oslo, Gaustadalleen 23 B, 0371, Oslo, Norway.,Centre for Bioinformatics, University of Oslo, Gaustadalleen 30, 0373, Oslo, Norway
| | - Geir Kjetil Sandve
- Department of Informatics, University of Oslo, Gaustadalleen 23 B, 0371, Oslo, Norway.,Centre for Bioinformatics, University of Oslo, Gaustadalleen 30, 0373, Oslo, Norway
| |
Collapse
|
5
|
Santoro D, Pellegrina L, Comin M, Vandin F. SPRISS: approximating frequent k-mers by sampling reads, and applications. Bioinformatics 2022; 38:3343-3350. [PMID: 35583271 PMCID: PMC9237683 DOI: 10.1093/bioinformatics/btac180] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 02/25/2022] [Accepted: 05/16/2022] [Indexed: 11/29/2022] Open
Abstract
MOTIVATION The extraction of k-mers is a fundamental component in many complex analyses of large next-generation sequencing datasets, including reads classification in genomics and the characterization of RNA-seq datasets. The extraction of all k-mers and their frequencies is extremely demanding in terms of running time and memory, owing to the size of the data and to the exponential number of k-mers to be considered. However, in several applications, only frequent k-mers, which are k-mers appearing in a relatively high proportion of the data, are required by the analysis. RESULTS In this work, we present SPRISS, a new efficient algorithm to approximate frequent k-mers and their frequencies in next-generation sequencing data. SPRISS uses a simple yet powerful reads sampling scheme, which allows to extract a representative subset of the dataset that can be used, in combination with any k-mer counting algorithm, to perform downstream analyses in a fraction of the time required by the analysis of the whole data, while obtaining comparable answers. Our extensive experimental evaluation demonstrates the efficiency and accuracy of SPRISS in approximating frequent k-mers, and shows that it can be used in various scenarios, such as the comparison of metagenomic datasets, the identification of discriminative k-mers, and SNP (single nucleotide polymorphism) genotyping, to extract insights in a fraction of the time required by the analysis of the whole dataset. AVAILABILITY AND IMPLEMENTATION SPRISS [a preliminary version (Santoro et al., 2021) of this work was presented at RECOMB 2021] is available at https://github.com/VandinLab/SPRISS. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Diego Santoro
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Leonardo Pellegrina
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Matteo Comin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| | - Fabio Vandin
- Department of Information Engineering, University of Padova, 35131 Padova, Italy
| |
Collapse
|
6
|
Ebler J, Ebert P, Clarke WE, Rausch T, Audano PA, Houwaart T, Mao Y, Korbel JO, Eichler EE, Zody MC, Dilthey AT, Marschall T. Pangenome-based genome inference allows efficient and accurate genotyping across a wide spectrum of variant classes. Nat Genet 2022; 54:518-525. [PMID: 35410384 PMCID: PMC9005351 DOI: 10.1038/s41588-022-01043-w] [Citation(s) in RCA: 121] [Impact Index Per Article: 40.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2021] [Accepted: 03/03/2022] [Indexed: 12/30/2022]
Abstract
Typical genotyping workflows map reads to a reference genome before identifying genetic variants. Generating such alignments introduces reference biases and comes with substantial computational burden. Furthermore, short-read lengths limit the ability to characterize repetitive genomic regions, which are particularly challenging for fast k-mer-based genotypers. In the present study, we propose a new algorithm, PanGenie, that leverages a haplotype-resolved pangenome reference together with k-mer counts from short-read sequencing data to genotype a wide spectrum of genetic variation-a process we refer to as genome inference. Compared with mapping-based approaches, PanGenie is more than 4 times faster at 30-fold coverage and achieves better genotype concordances for almost all variant types and coverages tested. Improvements are especially pronounced for large insertions (≥50 bp) and variants in repetitive regions, enabling the inclusion of these classes of variants in genome-wide association studies. PanGenie efficiently leverages the increasing amount of haplotype-resolved assemblies to unravel the functional impact of previously inaccessible variants while being faster compared with alignment-based workflows.
Collapse
Affiliation(s)
- Jana Ebler
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Peter Ebert
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | | | - Tobias Rausch
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
- European Molecular Biology Laboratory, GeneCore, Heidelberg, Germany
| | - Peter A Audano
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Torsten Houwaart
- Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
| | - Yafei Mao
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
| | - Jan O Korbel
- European Molecular Biology Laboratory, Genome Biology Unit, Heidelberg, Germany
| | - Evan E Eichler
- Department of Genome Sciences, University of Washington School of Medicine, Seattle, WA, USA
- Howard Hughes Medical Institute, University of Washington, Seattle, WA, USA
| | | | - Alexander T Dilthey
- Institute of Medical Microbiology and Hospital Hygiene, Heinrich Heine University Düsseldorf, Düsseldorf, Germany
- Institute of Medical Statistics and Computational Biology, University of Cologne, Cologne, Germany
- Cologne Excellence Cluster on Cellular Stress Responses in Aging-Associated Diseases, University of Cologne, Cologne, Germany
| | - Tobias Marschall
- Institute for Medical Biometry and Bioinformatics, Medical Faculty, Heinrich Heine University Düsseldorf, Düsseldorf, Germany.
| |
Collapse
|
7
|
Blanca A, Harris RS, Koslicki D, Medvedev P. The Statistics of k-mers from a Sequence Undergoing a Simple Mutation Process Without Spurious Matches. J Comput Biol 2022; 29:155-168. [PMID: 35108101 PMCID: PMC11978275 DOI: 10.1089/cmb.2021.0431] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023] Open
Abstract
k-mer-based methods are widely used in bioinformatics, but there are many gaps in our understanding of their statistical properties. Here, we consider the simple model where a sequence S (e.g., a genome or a read) undergoes a simple mutation process through which each nucleotide is mutated independently with some probability r, under the assumption that there are no spurious k-mer matches. How does this process affect the k-mers of S? We derive the expectation and variance of the number of mutated k-mers and of the number of islands (a maximal interval of mutated k-mers) and oceans (a maximal interval of nonmutated k-mers). We then derive hypothesis tests and confidence intervals (CIs) for r given an observed number of mutated k-mers, or, alternatively, given the Jaccard similarity (with or without MinHash). We demonstrate the usefulness of our results using a few select applications: obtaining a CI to supplement the Mash distance point estimate, filtering out reads during alignment by Minimap2, and rating long-read alignments to a de Bruijn graph by Jabba.
Collapse
Affiliation(s)
- Antonio Blanca
- Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Robert S. Harris
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - David Koslicki
- Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA
| | - Paul Medvedev
- Department of Computer Science and Engineering, and The Pennsylvania State University, University Park, Pennsylvania, USA
- Huck Institutes of the Life Sciences, The Pennsylvania State University, University Park, Pennsylvania, USA
- Department of Biochemistry and Molecular Biology, The Pennsylvania State University, University Park, Pennsylvania, USA
| |
Collapse
|
8
|
Bernardini G, Denti L, Previtali M. Alignment-Free Genotyping of Known Variations with MALVA. Methods Mol Biol 2022; 2493:247-256. [PMID: 35751819 DOI: 10.1007/978-1-0716-2293-3_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The discovery and characterization of sequence variations in human populations are crucial in genetic studies. Standard methods for addressing this problem are computationally expensive and highly time consuming, thus impractical for clinical applications, where time is often an issue. When the task is to genotype variations that have been previously annotated, alignment-free methods come to the aid. Here, we describe MALVA, an alignment-free approach for genotyping a set of known variations. MALVA is the first mapping-free tool which is able to genotype multi-allelic SNPs and indels, even in high-density genomic regions, and to effectively handle a huge number of variations.
Collapse
Affiliation(s)
| | - Luca Denti
- Department of Computational Biology, C3BI USR 3756 CNRS, Institut Pasteur, Paris, France.
| | - Marco Previtali
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milan, Italy
| |
Collapse
|
9
|
Jha RM, Zusman BE, Puccio AM, Okonkwo DO, Pease M, Desai SM, Leach M, Conley YP, Kochanek PM. Genetic Variants Associated With Intraparenchymal Hemorrhage Progression After Traumatic Brain Injury. JAMA Netw Open 2021; 4:e2116839. [PMID: 34309670 PMCID: PMC8314141 DOI: 10.1001/jamanetworkopen.2021.16839] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 12/17/2022] Open
Abstract
IMPORTANCE Intracerebral hemorrhage progression is associated with unfavorable outcome after traumatic brain injury (TBI). No effective treatments are currently available. This secondary injury process reflects an extreme form of vasogenic edema and blood-brain barrier breakdown. The sulfonylurea receptor 1-transient receptor potential melastatin 4 (SUR1-TRPM4) cation channel is a key underlying mechanism. A phase 2 trial of SUR1-TRPM4 inhibition in contusional TBI is ongoing, and a phase 3 trial is being designed. Targeted identification of patients at increased risk for hemorrhage progression may inform prognostication, trial design (including patient selection), and ultimately treatment response. OBJECTIVE To determine whether ABCC8 (SUR1) and TRPM4 genetic variability are associated with intraparenchymal hemorrhage (IPH) progression after severe TBI, based on the putative involvement of the SUR1-TRPM4 channel in this pathophysiology. DESIGN, SETTING, AND PARTICIPANTS In this genetic association study, DNA was extracted from 416 patients with severe TBI prospectively enrolled from a level I trauma academic medical center from May 9, 2002, to August 8, 2014. Forty ABCC8 and TRPM4 single-nucleotide variants (SNVs) were genotyped (multiplex, unbiased). Data were analyzed from January 7, 2020, to May 3, 2021. MAIN OUTCOMES AND MEASURES Primary analyses addressed IPH progression at 6, 24, and 120 hours in patients without acute craniectomy (n = 321). Multivariable regressions and receiver operating characteristic curves assessed SNV and haplotype associations with progression. Spatial modeling and functional predictions were determined using standard software. RESULTS Of the 321 patients included in the analysis (mean [SD] age, 37.0 [16.3] years; 247 [76.9%] male), IPH progression occurred in 102. Four ABCC8 SNVs were associated with markedly increased odds of progression (rs2237982 [odds ratio (OR), 2.60-3.80; 95% CI, 1.14-5.90 to 1.80-8.02; P = .02 to P < .001], rs2283261 [OR, 3.37-4.77; 95% CI, 1.07-10.77 to 1.89-12.07; P = .04 to P = .001], rs3819521 [OR, 2.96-3.92; 95% CI, 1.13-7.75 to 1.42-10.87; P = .03 to P = .009], and rs8192695 [OR, 3.06-4.95; 95% CI, 1.02-9.12 to 1.67-14.68]; P = .03-.004). These are brain-specific expression quantitative trait loci (eQTL) associated with increased ABCC8 messenger RNA levels. Regulatory annotations revealed promoter and enhancer marks and strong and/or active brain-tissue transcription, directionally consistent with increased progression. Three SNVs (rs2283261, rs2237982, and rs3819521) in this cohort have been associated with intracranial hypertension. Four TRPM4 SNVs were associated with decreased IPH progression (rs3760666 [OR, 0.40-0.49; 95% CI, 0.19-0.86 to 0.27-0.89; P = .02 to P = .009], rs1477363 [OR, 0.40-0.43; 95% CI, 0.18-0.88 to 0.23-0.81; P = .02 to P = .006], rs10410857 [OR, 0.36-0.41; 95% CI, 0.20-0.67 to 0.20-0.85; P = .02 to P = .001], and rs909010 [OR, 0.27-0.40; 95% CI, 0.12-0.62 to 0.16-0.58; P = .002 to P < .001]). Significant SNVs in both genes cluster downstream, flanking exons encoding the receptor site and SUR1-TRPM4 binding interface. Adding genetic variation to clinical models improved receiver operating characteristic curve performance from 0.6959 to 0.8030 (P = .003). CONCLUSIONS AND RELEVANCE In this genetic association study, 8 ABCC8 and TRPM4 SNVs were associated with IPH progression. Spatial clustering, brain-specific eQTL, and regulatory annotations suggest biological plausibility. These findings may have important implications for neurocritical care risk stratification, patient selection, and precision medicine, including an upcoming phase 3 trial design for SUR1-TRPM4 inhibition in severe TBI.
Collapse
Affiliation(s)
- Ruchira M. Jha
- Department of Neurology, Barrow Neurological Institute, Phoenix, Arizona
- Department of Neurological Surgery, Barrow Neurological Institute, Phoenix, Arizona
- Department of Neurobiology, Barrow Neurological Institute, Phoenix, Arizona
| | - Benjamin E. Zusman
- medical student at School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
- now affiliated with Department of Medicine, Massachusetts General Hospital, Harvard Medical School, Boston
| | - Ava M. Puccio
- Department of Neurological Surgery, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - David O. Okonkwo
- Department of Neurological Surgery, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Matthew Pease
- Department of Neurological Surgery, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Shashvat M. Desai
- Department of Neurology, Barrow Neurological Institute, Phoenix, Arizona
| | - Matthew Leach
- Department of Critical Care Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Yvette P. Conley
- School of Nursing, University of Pittsburgh, Pittsburgh, Pennsylvania
- Clinical and Translational Science Institute, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
| | - Patrick M. Kochanek
- Department of Critical Care Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
- Clinical and Translational Science Institute, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
- Department of Pediatrics, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
- Safar Center for Resuscitation Research, School of Medicine, University of Pittsburgh, Pittsburgh, Pennsylvania
| |
Collapse
|
10
|
Rahman A, Chikhi R, Medvedev P. Disk compression of k-mer sets. Algorithms Mol Biol 2021; 16:10. [PMID: 34154632 PMCID: PMC8218509 DOI: 10.1186/s13015-021-00192-7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2021] [Accepted: 06/08/2021] [Indexed: 12/23/2022] Open
Abstract
K-mer based methods have become prevalent in many areas of bioinformatics. In applications such as database search, they often work with large multi-terabyte-sized datasets. Storing such large datasets is a detriment to tool developers, tool users, and reproducibility efforts. General purpose compressors like gzip, or those designed for read data, are sub-optimal because they do not take into account the specific redundancy pattern in k-mer sets. In our earlier work (Rahman and Medvedev, RECOMB 2020), we presented an algorithm UST-Compress that uses a spectrum-preserving string set representation to compress a set of k-mers to disk. In this paper, we present two improved methods for disk compression of k-mer sets, called ESS-Compress and ESS-Tip-Compress. They use a more relaxed notion of string set representation to further remove redundancy from the representation of UST-Compress. We explore their behavior both theoretically and on real data. We show that they improve the compression sizes achieved by UST-Compress by up to 27 percent, across a breadth of datasets. We also derive lower bounds on how well this type of compression strategy can hope to do.
Collapse
Affiliation(s)
| | - Rayan Chikhi
- Department of Computational Biology, C3BI USR 3756 CNRS, Institut Pasteur, Paris, France
| | | |
Collapse
|
11
|
Khorsand P, Hormozdiari F. Nebula: ultra-efficient mapping-free structural variant genotyper. Nucleic Acids Res 2021; 49:e47. [PMID: 33503255 PMCID: PMC8096284 DOI: 10.1093/nar/gkab025] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2020] [Revised: 01/03/2021] [Accepted: 01/11/2021] [Indexed: 11/24/2022] Open
Abstract
Large scale catalogs of common genetic variants (including indels and structural variants) are being created using data from second and third generation whole-genome sequencing technologies. However, the genotyping of these variants in newly sequenced samples is a nontrivial task that requires extensive computational resources. Furthermore, current approaches are mostly limited to only specific types of variants and are generally prone to various errors and ambiguities when genotyping complex events. We are proposing an ultra-efficient approach for genotyping any type of structural variation that is not limited by the shortcomings and complexities of current mapping-based approaches. Our method Nebula utilizes the changes in the count of k-mers to predict the genotype of structural variants. We have shown that not only Nebula is an order of magnitude faster than mapping based approaches for genotyping structural variants, but also has comparable accuracy to state-of-the-art approaches. Furthermore, Nebula is a generic framework not limited to any specific type of event. Nebula is publicly available at https://github.com/Parsoa/Nebula.
Collapse
Affiliation(s)
| | - Fereydoun Hormozdiari
- Genome Center, UC Davis, Davis, California, 95616, USA.,UC Davis MIND Institute, Sacramento, California, 95817, USA.,Department of Biochemistry and Molecular Medicine, UC Davis, Sacramento, California, 95817, USA
| |
Collapse
|
12
|
Khorsand P, Denti L, Human Genome Structural Variant Consortium, Bonizzoni P, Chikhi R, Hormozdiari F. Comparative genome analysis using sample-specific string detection in accurate long reads. BIOINFORMATICS ADVANCES 2021; 1:vbab005. [PMID: 36700094 PMCID: PMC9710709 DOI: 10.1093/bioadv/vbab005] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Abstract
Motivation Comparative genome analysis of two or more whole-genome sequenced (WGS) samples is at the core of most applications in genomics. These include the discovery of genomic differences segregating in populations, case-control analysis in common diseases and diagnosing rare disorders. With the current progress of accurate long-read sequencing technologies (e.g. circular consensus sequencing from PacBio sequencers), we can dive into studying repeat regions of the genome (e.g. segmental duplications) and hard-to-detect variants (e.g. complex structural variants). Results We propose a novel framework for comparative genome analysis through the discovery of strings that are specific to one genome ('samples-specific' strings). We have developed a novel, accurate and efficient computational method for the discovery of sample-specific strings between two groups of WGS samples. The proposed approach will give us the ability to perform comparative genome analysis without the need to map the reads and is not hindered by shortcomings of the reference genome and mapping algorithms. We show that the proposed approach is capable of accurately finding sample-specific strings representing nearly all variation (>98%) reported across pairs or trios of WGS samples using accurate long reads (e.g. PacBio HiFi data). Availability and implementation Data, code and instructions for reproducing the results presented in this manuscript are publicly available at https://github.com/Parsoa/PingPong. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
| | - Luca Denti
- Department of Computational Biology, Institut Pasteur, Paris 75015, France
| | | | - Paola Bonizzoni
- Department of Informatics, Systems and Communication, University of Milano-Bicocca, Milano, 20126, Italy,To whom correspondence should be addressed. or or
| | - Rayan Chikhi
- Department of Computational Biology, Institut Pasteur, Paris 75015, France,To whom correspondence should be addressed. or or
| | - Fereydoun Hormozdiari
- Genome Center, UC Davis, Davis, CA 95616, USA,UC Davis MIND Institute, Sacramento, CA 95817, USA,Department of Biochemistry and Molecular Medicine, Sacramento, UC Davis, Sacramento, CA 95817, USA,To whom correspondence should be addressed. or or
| |
Collapse
|
13
|
Břinda K, Baym M, Kucherov G. Simplitigs as an efficient and scalable representation of de Bruijn graphs. Genome Biol 2021; 22:96. [PMID: 33823902 PMCID: PMC8025321 DOI: 10.1186/s13059-021-02297-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2020] [Accepted: 02/10/2021] [Indexed: 12/30/2022] Open
Abstract
de Bruijn graphs play an essential role in bioinformatics, yet they lack a universal scalable representation. Here, we introduce simplitigs as a compact, efficient, and scalable representation, and ProphAsm, a fast algorithm for their computation. For the example of assemblies of model organisms and two bacterial pan-genomes, we compare simplitigs to unitigs, the best existing representation, and demonstrate that simplitigs provide a substantial improvement in the cumulative sequence length and their number. When combined with the commonly used Burrows-Wheeler Transform index, simplitigs reduce memory, and index loading and query times, as demonstrated with large-scale examples of GenBank bacterial pan-genomes.
Collapse
Affiliation(s)
- Karel Břinda
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA.
- Center for Communicable Disease Dynamics, Department of Epidemiology, Harvard T.H. Chan School of Public Health, Boston, USA.
| | - Michael Baym
- Department of Biomedical Informatics and Laboratory of Systems Pharmacology, Harvard Medical School, Boston, USA and Broad Institute of MIT and Harvard, Cambridge, USA
| | - Gregory Kucherov
- CNRS/LIGM Univ Gustave Eiffel, Marne-la-Vallée, France
- Skolkovo Institute of Science and Technology, Moscow, Russia
| |
Collapse
|
14
|
Richmond PA, Kaye AM, Kounkou GJ, Av-Shalom TV, Wasserman WW. Demonstrating the utility of flexible sequence queries against indexed short reads with FlexTyper. PLoS Comput Biol 2021; 17:e1008815. [PMID: 33750951 PMCID: PMC8016220 DOI: 10.1371/journal.pcbi.1008815] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2020] [Revised: 04/01/2021] [Accepted: 02/17/2021] [Indexed: 11/26/2022] Open
Abstract
Across the life sciences, processing next generation sequencing data commonly relies upon a computationally expensive process where reads are mapped onto a reference sequence. Prior to such processing, however, there is a vast amount of information that can be ascertained from the reads, potentially obviating the need for processing, or allowing optimized mapping approaches to be deployed. Here, we present a method termed FlexTyper which facilitates a “reverse mapping” approach in which high throughput sequence queries, in the form of k-mer searches, are run against indexed short-read datasets in order to extract useful information. This reverse mapping approach enables the rapid counting of target sequences of interest. We demonstrate FlexTyper’s utility for recovering depth of coverage, and accurate genotyping of SNP sites across the human genome. We show that genotyping unmapped reads can correctly inform a sample’s population, sex, and relatedness in a family setting. Detection of pathogen sequences within RNA-seq data was sensitive and accurate, performing comparably to existing methods, but with increased flexibility. We present two examples of ways in which this flexibility allows the analysis of genome features not well-represented in a linear reference. First, we analyze contigs from African genome sequencing studies, showing how they distribute across families from three distinct populations. Second, we show how gene-marking k-mers for the killer immune receptor locus allow allele detection in a region that is challenging for standard read mapping pipelines. The future adoption of the reverse mapping approach represented by FlexTyper will be enabled by more efficient methods for FM-index generation and biology-informed collections of reference queries. In the long-term, selection of population-specific references or weighting of edges in pan-population reference genome graphs will be possible using the FlexTyper approach. FlexTyper is available at https://github.com/wassermanlab/OpenFlexTyper. In the past 15 years, next generation sequencing technology has revolutionized our capacity to process and analyze DNA sequencing data. From agriculture to medicine, this technology is enabling a deeper understanding of the blueprint of life. Next generation sequencing data is composed of short sequences of DNA, referred to as “reads”, which are often shorter than 200 base pairs making them many orders of magnitude smaller than the entirety of a human genome. Gaining insights from this data has typically leveraged a reference-guided mapping approach, where the reads are aligned to a reference genome and then post-processed to gain actionable information such as presence or absence of genomic sequence, or variation between the reference genome and the sequenced sample. Many experts in the field of genomics have concluded that selecting a single, linear reference genome for mapping reads against is limiting, and several current research endeavors are focused on exploring options for improved analysis methods to unlock the full utility of sequencing data. Among these improvements are the usage of sex-matched genomes, population-specific reference genomes, and emergent graph-based reference pan-genomes. However, advanced methods that use raw DNA sequencing data to inform the choice of reference genome and guide the alignment of reads to enriched reference genomes are needed. Here we develop a method termed FlexTyper, which creates a searchable index of the short read data and enables flexible, user-guided queries to provide valuable insights without the need for reference-guided mapping. We demonstrate the utility of our method by identifying sample ancestry and sex in human whole genome sequencing data, detecting viral pathogen reads in RNA-seq data, African-enriched genome regions absent from the global reference, and killer-cell immune receptor alleles that are complex to discern using standard read mapping. We anticipate early adoption of FlexTyper within analysis pipelines as a pre-mapping component, and further envision the bioinformatics and genomics community will leverage the tool for creative uses of sequence queries from unmapped data.
Collapse
Affiliation(s)
- Phillip Andrew Richmond
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children’s Hospital Research Institute, University of British Columbia, Vancouver, Canada
| | - Alice Mary Kaye
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children’s Hospital Research Institute, University of British Columbia, Vancouver, Canada
| | - Godfrain Jacques Kounkou
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children’s Hospital Research Institute, University of British Columbia, Vancouver, Canada
| | - Tamar Vered Av-Shalom
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children’s Hospital Research Institute, University of British Columbia, Vancouver, Canada
| | - Wyeth W. Wasserman
- Centre for Molecular Medicine and Therapeutics, Department of Medical Genetics, BC Children’s Hospital Research Institute, University of British Columbia, Vancouver, Canada
- * E-mail:
| |
Collapse
|
15
|
Standage DS, Brown CT, Hormozdiari F. Kevlar: A Mapping-Free Framework for Accurate Discovery of De Novo Variants. iScience 2019; 18:28-36. [PMID: 31377530 PMCID: PMC6682328 DOI: 10.1016/j.isci.2019.07.032] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 06/24/2019] [Accepted: 07/19/2019] [Indexed: 01/05/2023] Open
Abstract
De novo genetic variants are an important source of causative variation in complex genetic disorders. Many methods for variant discovery rely on mapping reads to a reference genome, detecting numerous inherited variants irrelevant to the phenotype of interest. To distinguish between inherited and de novo variation, sequencing of families (parents and siblings) is commonly pursued. However, standard mapping-based approaches tend to have a high false-discovery rate for de novo variant prediction. Kevlar is a mapping-free method for de novo variant discovery, based on direct comparison of sequences between related individuals. Kevlar identifies high-abundance k-mers unique to the individual of interest. Reads containing these k-mers are partitioned into disjoint sets by shared k-mer content for variant calling, and preliminary variant predictions are sorted using a probabilistic score. We evaluated Kevlar on simulated and real datasets, demonstrating its ability to detect both de novo single-nucleotide variants and indels with high accuracy.
Collapse
Affiliation(s)
- Daniel S Standage
- Population Health and Reproduction, University of California, Davis, USA.
| | - C Titus Brown
- Population Health and Reproduction, University of California, Davis, USA; Genome Center, University of California, Davis, USA.
| | - Fereydoun Hormozdiari
- Genome Center, University of California, Davis, USA; MIND Institute, University of California, Davis, USA; Biochemistry and Molecular Medicine, University of California, Davis, 1 Shields Avenue, Davis, CA 95616, USA.
| |
Collapse
|
16
|
Denti L, Previtali M, Bernardini G, Schönhuth A, Bonizzoni P. MALVA: Genotyping by Mapping-free ALlele Detection of Known VAriants. iScience 2019; 18:20-27. [PMID: 31352182 PMCID: PMC6664100 DOI: 10.1016/j.isci.2019.07.011] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/11/2019] [Revised: 06/05/2019] [Accepted: 07/08/2019] [Indexed: 12/30/2022] Open
Abstract
The amount of genetic variation discovered in human populations is growing rapidly leading to challenging computational tasks, such as variant calling. Standard methods for addressing this problem include read mapping, a computationally expensive procedure; thus, mapping-free tools have been proposed in recent years. These tools focus on isolated, biallelic SNPs, providing limited support for multi-allelic SNPs and short insertions and deletions of nucleotides (indels). Here we introduce MALVA, a mapping-free method to genotype an individual from a sample of reads. MALVA is the first mapping-free tool able to genotype multi-allelic SNPs and indels, even in high-density genomic regions, and to effectively handle a huge number of variants. MALVA requires one order of magnitude less time to genotype a donor than alignment-based pipelines, providing similar accuracy. Remarkably, on indels, MALVA provides even better results than the most widely adopted variant discovery tools. MALVA is able to genotype multi-allelic SNPs and indels without mapping reads MALVA calls correctly more indels than the most widely adopted genotyping pipelines Mapping-free approaches are as accurate as alignment-based ones, while being faster
Collapse
Affiliation(s)
- Luca Denti
- Department of Computer Sciences, Systems and Communications, University of Milan-Bicocca, Piazza dell'Ateneo Nuovo, 1, 20126 Milan, Italy
| | - Marco Previtali
- Department of Computer Sciences, Systems and Communications, University of Milan-Bicocca, Piazza dell'Ateneo Nuovo, 1, 20126 Milan, Italy.
| | - Giulia Bernardini
- Department of Computer Sciences, Systems and Communications, University of Milan-Bicocca, Piazza dell'Ateneo Nuovo, 1, 20126 Milan, Italy
| | - Alexander Schönhuth
- Centrum Wiskunde & Informatica, Science Park 123, 1098 XG Amsterdam, The Netherlands
| | - Paola Bonizzoni
- Department of Computer Sciences, Systems and Communications, University of Milan-Bicocca, Piazza dell'Ateneo Nuovo, 1, 20126 Milan, Italy
| |
Collapse
|