1
|
Bokulich NA. Integrating sequence composition information into microbial diversity analyses with k-mer frequency counting. mSystems 2025; 10:e0155024. [PMID: 39976436 PMCID: PMC11915819 DOI: 10.1128/msystems.01550-24] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2024] [Accepted: 01/23/2025] [Indexed: 02/21/2025] Open
Abstract
k-mer frequency information in biological sequences is used for a wide range of applications, including taxonomy classification, sequence similarity estimation, and supervised learning. However, in spite of its widespread utility, k-mer counting has been largely neglected for diversity estimation. This work examines the application of k-mer counting for alpha and beta diversity as well as supervised classification from microbiome marker-gene sequencing data sets (16S rRNA gene and full-length fungal internal transcribed spacer [ITS] sequences). Results demonstrate a close correspondence with phylogenetically aware diversity metrics, and advantages for using k-mer-based metrics for measuring microbial biodiversity in microbiome sequencing surveys. k-mer counting appears to be a suitable and efficient strategy for feature processing prior to diversity estimation as well as supervised learning in microbiome surveys. This allows the incorporation of subsequence-level information into diversity estimation without the computational cost of pairwise sequence alignment. k-mer counting is proposed as a complementary approach for feature processing prior to diversity estimation and supervised learning analyses, enabling large-scale reference-free profiling of microbiomes in biogeography, ecology, and biomedical data. A method for k-mer counting from marker-gene sequence data is implemented in the QIIME 2 plugin q2-kmerizer (https://github.com/bokulich-lab/q2-kmerizer). IMPORTANCE k-mers are all of the subsequences of length k that comprise a sequence. Comparing the frequency of k-mers in DNA sequences yields valuable information about the composition of these sequences and their similarity. This work demonstrates that k-mer frequencies from marker-gene sequence surveys can be used to inform diversity estimates and machine learning predictions that incorporate sequence composition information. Alpha and beta diversity estimates based on k-mer frequencies closely correspond to phylogenetically aware diversity metrics, suggesting that k-mer-based diversity estimates are useful proxy measurements especially when reliable phylogenies are not available, as is often the case for some DNA sequence targets such as for internal transcribed spacer sequences.
Collapse
|
2
|
Majidian S, Hwang S, Zakeri M, Langmead B. EvANI benchmarking workflow for evolutionary distance estimation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2025:2025.02.23.639716. [PMID: 40027788 PMCID: PMC11870633 DOI: 10.1101/2025.02.23.639716] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Advances in long-read sequencing technology has led to a rapid increase in high-quality genome assemblies. These make it possible to compare genome sequences across the Tree of Life, deepening our understanding of evolutionary relationships. Average nucleotide identity (ANI) is a distance measure that has been applied to species delineation, building of guide trees, and searching large sequence databases. Since computing ANI is computationally expensive, the field has increasingly turned to sketch-based approaches that use assumptions and heuristics to speed this up. We propose a suite of simulated and real benchmark datasets, together with a rank-correlation-based metric, to study how these assumptions and heuristics impact distance estimates. We call this evaluation framework EvANI. With EvANI, we show that ANIb is the ANI estimation algorithm that best captures tree distance, though it is also the least efficient. We show that k-mer based approaches are extremely efficient and have consistently strong accuracy. We also show that some clades have inter-sequence distances that are best computed using multiple values of k, e.g. k = 10 and k = 19 for Chlamydiales. Finally, we highlight that approaches based on maximal exact matches may represent an advantageous compromise, achieving an intermediate level of computational efficiency while avoiding over-reliance on a single fixed k-mer length.
Collapse
Affiliation(s)
- Sina Majidian
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Stephen Hwang
- XDBio Program, Johns Hopkins University, Baltimore, USA
| | - Mohsen Zakeri
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| | - Ben Langmead
- Department of Computer Science, Johns Hopkins University, Baltimore, USA
| |
Collapse
|
3
|
Moeckel C, Mareboina M, Konnaris MA, Chan CS, Mouratidis I, Montgomery A, Chantzi N, Pavlopoulos GA, Georgakopoulos-Soares I. A survey of k-mer methods and applications in bioinformatics. Comput Struct Biotechnol J 2024; 23:2289-2303. [PMID: 38840832 PMCID: PMC11152613 DOI: 10.1016/j.csbj.2024.05.025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2024] [Revised: 05/14/2024] [Accepted: 05/15/2024] [Indexed: 06/07/2024] Open
Abstract
The rapid progression of genomics and proteomics has been driven by the advent of advanced sequencing technologies, large, diverse, and readily available omics datasets, and the evolution of computational data processing capabilities. The vast amount of data generated by these advancements necessitates efficient algorithms to extract meaningful information. K-mers serve as a valuable tool when working with large sequencing datasets, offering several advantages in computational speed and memory efficiency and carrying the potential for intrinsic biological functionality. This review provides an overview of the methods, applications, and significance of k-mers in genomic and proteomic data analyses, as well as the utility of absent sequences, including nullomers and nullpeptides, in disease detection, vaccine development, therapeutics, and forensic science. Therefore, the review highlights the pivotal role of k-mers in addressing current genomic and proteomic problems and underscores their potential for future breakthroughs in research.
Collapse
Affiliation(s)
- Camille Moeckel
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Manvita Mareboina
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Maxwell A. Konnaris
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Candace S.Y. Chan
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, San Francisco, CA, USA
| | - Ioannis Mouratidis
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| | - Austin Montgomery
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | - Nikol Chantzi
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
| | | | - Ilias Georgakopoulos-Soares
- Institute for Personalized Medicine, Department of Biochemistry and Molecular Biology, The Pennsylvania State University College of Medicine, Hershey, PA, USA
- Huck Institute of the Life Sciences, Penn State University, University Park, Pennsylvania, USA
| |
Collapse
|
4
|
Spouge JL, Das P, Chen Y, Frith M. The Statistics of Parametrized Syncmers in a Simple Mutation Process Without Spurious Matches. J Comput Biol 2024; 31:1195-1210. [PMID: 39530391 PMCID: PMC11698668 DOI: 10.1089/cmb.2024.0508] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2024] Open
Abstract
Introduction: Often, bioinformatics uses summary sketches to analyze next-generation sequencing data, but most sketches are not well understood statistically. Under a simple mutation model, Blanca et al. analyzed complete sketches, that is, the complete set of unassembled k-mers, from two closely related sequences. The analysis extracted a point mutation parameter θ quantifying the evolutionary distance between the two sequences. Methods: We extend the results of Blanca et al. for complete sketches to parametrized syncmer sketches with downsampling. A syncmer sketch can sample k-mers much more sparsely than a complete sketch. Consider the following simple mutation model disallowing insertions or deletions. Consider a reference sequence A (e.g., a subsequence from a reference genome), and mutate each nucleotide in it independently with probability θ to produce a mutated sequence B (corresponding to, e.g., a set of reads or draft assembly of a related genome). Then, syncmer counts alone yield an approximate Gaussian distribution for estimating θ. The assumption disallowing insertions and deletions motivates a check on the lengths of A and B. The syncmer count from B yields an approximate Gaussian distribution for its length, and a p-value can test the length of B against the length of A using syncmer counts alone. Results: The Gaussian distributions permit syncmer counts alone to estimate θ and mutated sequence length with a known sampling error. Under some circumstances, the results provide the sampling error for the Mash containment index when applied to syncmer counts. Conclusions: The approximate Gaussian distributions provide hypothesis tests and confidence intervals for phylogenetic distance and sequence length. Our methods are likely to generalize to sketches other than syncmers and may be useful in assembling reads and related applications.
Collapse
Affiliation(s)
- John L. Spouge
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Pijush Das
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Ye Chen
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan
| | - Martin Frith
- Graduate School of Frontier Sciences, University of Tokyo, Kashiwa, Japan
- Artificial Intelligence Research Center, National Institute of Advanced Industrial Science and Technology (AIST), Tokyo, Japan
- Computational Bio Big-Data Open Innovation Laboratory (CBBD-OIL), AIST, Tokyo, Japan
| |
Collapse
|
5
|
Frith MC, Shaw J, Spouge JL. How to optimally sample a sequence for rapid analysis. Bioinformatics 2023; 39:btad057. [PMID: 36702468 PMCID: PMC9907223 DOI: 10.1093/bioinformatics/btad057] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2022] [Accepted: 01/24/2023] [Indexed: 01/28/2023] Open
Abstract
MOTIVATION We face an increasing flood of genetic sequence data, from diverse sources, requiring rapid computational analysis. Rapid analysis can be achieved by sampling a subset of positions in each sequence. Previous sequence-sampling methods, such as minimizers, syncmers and minimally overlapping words, were developed by heuristic intuition, and are not optimal. RESULTS We present a sequence-sampling approach that provably optimizes sensitivity for a whole class of sequence comparison methods, for randomly evolving sequences. It is likely near-optimal for a wide range of alignment-based and alignment-free analyses. For real biological DNA, it increases specificity by avoiding simple repeats. Our approach generalizes universal hitting sets (which guarantee to sample a sequence at least once) and polar sets (which guarantee to sample a sequence at most once). This helps us understand how to do rapid sequence analysis as accurately as possible. AVAILABILITY AND IMPLEMENTATION Source code is freely available at https://gitlab.com/mcfrith/noverlap. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin C Frith
- Artificial Intelligence Research Center, AIST, Tokyo 135-0064, Japan
- Department of Computational Biology and Medical Sciences, Graduate School of Frontier Sciences, University of Tokyo, Chiba 277-8568, Japan
- Computational Bio Big-Data Open Innovation Laboratory, AIST, Tokyo 169-8555, Japan
| | - Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, ON M5S 2E4, Canada
| | - John L Spouge
- National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|
6
|
Anjum N, Nabil RL, Rafi RI, Bayzid MS, Rahman MS. CD-MAWS: An Alignment-Free Phylogeny Estimation Method Using Cosine Distance on Minimal Absent Word Sets. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:196-205. [PMID: 34928803 DOI: 10.1109/tcbb.2021.3136792] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Multiple sequence alignment has been the traditional and well established approach of sequence analysis and comparison, though it is time and memory consuming. As the scale of sequencing data is increasing day by day, the importance of faster yet accurate alignment-free methods is on the rise. Several alignment-free sequence analysis methods have been established in the literature in recent years, which extract numerical features from genomic data to analyze sequences and also to estimate phylogenetic relationship among genes and species. Minimal Absent Word (MAW) is an effective concept for representing characteristics of a sequence in an alignment-free manner. In this study, we present CD-MAWS, a distance measure based on cosine of the angle between composition vectors constructed using minimal absent words, for sequence analysis in a computationally inexpensive manner. We have benchmarked CD-MAWS using several AFProject datasets, such as Fish mtDNA, E.coli, Plants, Shigella and Yersinia datasets, and found it to perform quite well. Applied on several other biological datasets such as mammal mtDNA, bacterial genomes and viral genomes, CD-MAWS resolved phylogenetic relationships similar to or better than state-of-the-art alignment-free methods such as Mash, Skmer, Co-phylog and kSNP3.
Collapse
|
7
|
Birth N, Dencker T, Morgenstern B. Insertions and deletions as phylogenetic signal in an alignment-free context. PLoS Comput Biol 2022; 18:e1010303. [PMID: 35939516 PMCID: PMC9387925 DOI: 10.1371/journal.pcbi.1010303] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/18/2021] [Revised: 08/18/2022] [Accepted: 06/14/2022] [Indexed: 11/18/2022] Open
Abstract
Most methods for phylogenetic tree reconstruction are based on sequence alignments; they infer phylogenies from substitutions that may have occurred at the aligned sequence positions. Gaps in alignments are usually not employed as phylogenetic signal. In this paper, we explore an alignment-free approach that uses insertions and deletions (indels) as an additional source of information for phylogeny inference. For a set of four or more input sequences, we generate so-called quartet blocks of four putative homologous segments each. For pairs of such quartet blocks involving the same four sequences, we compare the distances between the two blocks in these sequences, to obtain hints about indels that may have happened between the blocks since the respective four sequences have evolved from their last common ancestor. A prototype implementation that we call Gap-SpaM is presented to infer phylogenetic trees from these data, using a quartet-tree approach or, alternatively, under the maximum-parsimony paradigm. This approach should not be regarded as an alternative to established methods, but rather as a complementary source of phylogenetic information. Interestingly, however, our software is able to produce phylogenetic trees from putative indels alone that are comparable to trees obtained with existing alignment-free methods. Phylogenetic tree inference based on DNA or protein sequence comparison is a fundamental task in computational biology. Given a multiple alignment of a set of input sequences, most approaches compare aligned sequence positions to each other, to find a suitable tree, based on a model of molecular evolution. Insertions and deletions that may have happened since the input sequences evolved from their last common ancestor are ignored by most phylogeny methods. Herein, we show that insertions and deletions can provide an additional source of information for phylogeny inference, and that such information can be obtained with a simple alignment-free approach. We provide an implementation of this idea that we call Gap-SpaM. The proposed approach is complementary to existing phylogeny methods since it is based on a completely different source of information. It is, thus, not meant to be an alternative to those existing methods but rather as a possible additional source of information for tree inference.
Collapse
Affiliation(s)
- Niklas Birth
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universisät Göttingen, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
- Campus-Institute Data Science (CIDAS), Göttingen, Germany
- * E-mail:
| |
Collapse
|
8
|
Blanke M, Morgenstern B. App-SpaM: phylogenetic placement of short reads without sequence alignment. BIOINFORMATICS ADVANCES 2021; 1:vbab027. [PMID: 36700102 PMCID: PMC9710606 DOI: 10.1093/bioadv/vbab027] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/31/2021] [Revised: 09/27/2021] [Accepted: 10/11/2021] [Indexed: 01/28/2023]
Abstract
Motivation Phylogenetic placement is the task of placing a query sequence of unknown taxonomic origin into a given phylogenetic tree of a set of reference sequences. A major field of application of such methods is, for example, the taxonomic identification of reads in metabarcoding or metagenomic studies. Several approaches to phylogenetic placement have been proposed in recent years. The most accurate of them requires a multiple sequence alignment of the references as input. However, calculating multiple alignments is not only time-consuming but also limits the applicability of these approaches. Results Herein, we propose Alignment-free phylogenetic placement algorithm based on Spaced-word Matches (App-SpaM), an efficient algorithm for the phylogenetic placement of short sequencing reads on a tree of a set of reference sequences. App-SpaM produces results of high quality that are on a par with the best available approaches to phylogenetic placement, while our software is two orders of magnitude faster than these existing methods. Our approach neither requires a multiple alignment of the reference sequences nor alignments of the queries to the references. This enables App-SpaM to perform phylogenetic placement on a broad variety of datasets. Availability and implementation The source code of App-SpaM is freely available on Github at https://github.com/matthiasblanke/App-SpaM together with detailed instructions for installation and settings. App-SpaM is furthermore available as a Conda-package on the Bioconda channel. Contact matthias.blanke@biologie.uni-goettingen.de. Supplementary information Supplementary data are available at Bioinformatics Advances online.
Collapse
Affiliation(s)
- Matthias Blanke
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Göttingen 37077, Germany
- International Max Planck Research School for Genome Science, Göttingen 37077, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Georg-August-University Göttingen, Göttingen 37077, Germany
- Campus-Institute Data Science (CIDAS), Göttingen 37077, Germany
| |
Collapse
|
9
|
Sequence Comparison Without Alignment: The SpaM Approaches. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2021; 2231:121-134. [PMID: 33289890 DOI: 10.1007/978-1-0716-1036-7_8] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 10/22/2022]
Abstract
Sequence alignment is at the heart of DNA and protein sequence analysis. For the data volumes that are nowadays produced by massively parallel sequencing technologies, however, pairwise and multiple alignment methods are often too slow. Therefore, fast alignment-free approaches to sequence comparison have become popular in recent years. Most of these approaches are based on word frequencies, for words of a fixed length, or on word-matching statistics. Other approaches are using the length of maximal word matches. While these methods are very fast, most of them rely on ad hoc measures of sequences similarity or dissimilarity that are hard to interpret. In this chapter, I describe a number of alignment-free methods that we developed in recent years. Our approaches are based on spaced-word matches ("SpaM"), i.e. on inexact word matches, that are allowed to contain mismatches at certain pre-defined positions. Unlike most previous alignment-free approaches, our approaches are able to accurately estimate phylogenetic distances between DNA or protein sequences using a stochastic model of molecular evolution.
Collapse
|
10
|
Chakraborty A, Morgenstern B, Bandyopadhyay S. S-conLSH: alignment-free gapped mapping of noisy long reads. BMC Bioinformatics 2021; 22:64. [PMID: 33573603 PMCID: PMC7879691 DOI: 10.1186/s12859-020-03918-3] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Accepted: 12/02/2020] [Indexed: 11/16/2022] Open
Abstract
Background The advancement of SMRT technology has unfolded new opportunities of genome analysis with its longer read length and low GC bias. Alignment of the reads to their appropriate positions in the respective reference genome is the first but costliest step of any analysis pipeline based on SMRT sequencing. However, the state-of-the-art aligners often fail to identify distant homologies due to lack of conserved regions, caused by frequent genetic duplication and recombination. Therefore, we developed a novel alignment-free method of sequence mapping that is fast and accurate. Results We present a new mapper called S-conLSH that uses Spaced context based Locality Sensitive Hashing. With multiple spaced patterns, S-conLSH facilitates a gapped mapping of noisy long reads to the corresponding target locations of a reference genome. We have examined the performance of the proposed method on 5 different real and simulated datasets. S-conLSH is at least 2 times faster than the recently developed method lordFAST. It achieves a sensitivity of 99%, without using any traditional base-to-base alignment, on human simulated sequence data. By default, S-conLSH provides an alignment-free mapping in PAF format. However, it has an option of generating aligned output as SAM-file, if it is required for any downstream processing. Conclusions S-conLSH is one of the first alignment-free reference genome mapping tools achieving a high level of sensitivity. The spaced-context is especially suitable for extracting distant similarities. The variable-length spaced-seeds or patterns add flexibility to the proposed algorithm by introducing gapped mapping of the noisy long reads. Therefore, S-conLSH may be considered as a prominent direction towards alignment-free sequence analysis.
Collapse
Affiliation(s)
- Angana Chakraborty
- Department of Computer Science, West Bengal Education Service, Kolkata, India
| | - Burkhard Morgenstern
- Department of Bioinformatics (IMG), University of Göttingen, 37077, Göttingen, Germany.
| | | |
Collapse
|
11
|
Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Collapse
Affiliation(s)
- Pritam Chanda
- Corteva Agriscience™, Indianapolis, IN 46268, USA
- Computer and Information Science, Indiana University-Purdue University, Indianapolis, IN 46202, USA
| | - Eduardo Costa
- Corteva Agriscience™, Mogi Mirim, Sao Paulo 13801-540, Brazil
| | - Jie Hu
- Corteva Agriscience™, Indianapolis, IN 46268, USA
| | | | | | - Rasna Walia
- Corteva Agriscience™, Johnston, IA 50131, USA
| |
Collapse
|
12
|
Dencker T, Leimeister CA, Gerth M, Bleidorn C, Snir S, Morgenstern B. 'Multi-SpaM': a maximum-likelihood approach to phylogeny reconstruction using multiple spaced-word matches and quartet trees. NAR Genom Bioinform 2020; 2:lqz013. [PMID: 33575565 PMCID: PMC7671388 DOI: 10.1093/nargab/lqz013] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/30/2019] [Revised: 07/31/2019] [Accepted: 10/13/2019] [Indexed: 02/03/2023] Open
Abstract
Word-based or 'alignment-free' methods for phylogeny inference have become popular in recent years. These methods are much faster than traditional, alignment-based approaches, but they are generally less accurate. Most alignment-free methods calculate 'pairwise' distances between nucleic-acid or protein sequences; these distance values can then be used as input for tree-reconstruction programs such as neighbor-joining. In this paper, we propose the first word-based phylogeny approach that is based on 'multiple' sequence comparison and 'maximum likelihood'. Our algorithm first samples small, gap-free alignments involving four taxa each. For each of these alignments, it then calculates a quartet tree and, finally, the program 'Quartet MaxCut' is used to infer a super tree for the full set of input taxa from the calculated quartet trees. Experimental results show that trees produced with our approach are of high quality.
Collapse
Affiliation(s)
- Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Chris-André Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Michael Gerth
- Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK
| | - Christoph Bleidorn
- Department of Animal Evolution and Biodiversity, Universität Göttingen, Untere Karspüle 2, 37073 Göttingen, Germany
- Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain
| | - Sagi Snir
- Institute of Evolution, Department of Evolutionary and Environmental Biology, University of Haifa, 199 Aba Khoushy Ave. Mount Carmel, Haifa, Israel
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, Universität Göttingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen, Germany
| |
Collapse
|
13
|
Röhling S, Linne A, Schellhorn J, Hosseini M, Dencker T, Morgenstern B. The number of k-mer matches between two DNA sequences as a function of k and applications to estimate phylogenetic distances. PLoS One 2020; 15:e0228070. [PMID: 32040534 PMCID: PMC7010260 DOI: 10.1371/journal.pone.0228070] [Citation(s) in RCA: 20] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2020] [Accepted: 01/08/2020] [Indexed: 12/14/2022] Open
Abstract
We study the number Nk of length-k word matches between pairs of evolutionarily related DNA sequences, as a function of k. We show that the Jukes-Cantor distance between two genome sequences-i.e. the number of substitutions per site that occurred since they evolved from their last common ancestor-can be estimated from the slope of a function F that depends on Nk and that is affine-linear within a certain range of k. Integers kmin and kmax can be calculated depending on the length of the input sequences, such that the slope of F in the relevant range can be estimated from the values F(kmin) and F(kmax). This approach can be generalized to so-called Spaced-word Matches (SpaM), where mismatches are allowed at positions specified by a user-defined binary pattern. Based on these theoretical results, we implemented a prototype software program for alignment-free sequence comparison called Slope-SpaM. Test runs on real and simulated sequence data show that Slope-SpaM can accurately estimate phylogenetic distances for distances up to around 0.5 substitutions per position. The statistical stability of our results is improved if spaced words are used instead of contiguous words. Unlike previous alignment-free methods that are based on the number of (spaced) word matches, Slope-SpaM produces accurate results, even if sequences share only local homologies.
Collapse
Affiliation(s)
- Sophie Röhling
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Alexander Linne
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Jendrik Schellhorn
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | | | - Thomas Dencker
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Göttingen, Germany
| |
Collapse
|
14
|
Seo H, Song YJ, Cho K, Cho DH. Specificity Analysis of Genome Based on Statistically Identical K-Words With Same Base Combination. IEEE OPEN JOURNAL OF ENGINEERING IN MEDICINE AND BIOLOGY 2020; 1:214-219. [PMID: 35402963 PMCID: PMC8983152 DOI: 10.1109/ojemb.2020.3009055] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2020] [Revised: 06/17/2020] [Accepted: 06/29/2020] [Indexed: 11/25/2022] Open
Abstract
Goal: Individual characteristics are determined through a genome consisting of a complex base combination. This base combination is reflected in the k-word profile, which represents the number of consecutive k bases. Therefore, it is important to analyze the genome-specific statistical specificity in the k-word profile to understand the characteristics of the genome. In this paper, we propose a new k-word-based method to analyze genome-specific properties. Methods: We define k-words consisting of the same number of bases as statistically identical k-words. The statistically identical k-words are estimated to appear at a similar frequency by statistical prediction. However, this may not be true in the genome because it is not a random list of bases. The ratio between frequencies of two statistically identical k-words can then be used to investigate the statistical specificity of the genome reflected in the k-word profile. In order to find important ratios representing genomic characteristics, a reference value is calculated that results in a minimum error when classifying data by ratio alone. Finally, we propose a genetic algorithm-based search algorithm to select a minimum set of ratios useful for classification. Results: The proposed method was applied to the full-length sequence of microorganisms for pathogenicity classification. The classification accuracy of the proposed algorithm was similar to that of conventional methods while using only a few features. Conclusions: We proposed a new method to investigate the genome-specific statistical specificity in the k-word profile which can be applied to find important properties of the genome and classify genome sequences.
Collapse
Affiliation(s)
- Hyein Seo
- School of Electrical EngineeringKorea Advanced Institute of Science and Technology (KAIST) Daejeon 300-010 South Korea
| | - Yong-Joon Song
- School of Electrical EngineeringKorea Advanced Institute of Science and Technology (KAIST) Daejeon 300-010 South Korea
| | - Kiho Cho
- Department of SurgeryUniversity of California Sacramento California 95064 USA
| | - Dong-Ho Cho
- School of Electrical EngineeringKorea Advanced Institute of Science and Technology (KAIST) Daejeon 300-010 South Korea
| |
Collapse
|
15
|
Read-SpaM: assembly-free and alignment-free comparison of bacterial genomes with low sequencing coverage. BMC Bioinformatics 2019; 20:638. [PMID: 31842735 PMCID: PMC6916211 DOI: 10.1186/s12859-019-3205-7] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND In many fields of biomedical research, it is important to estimate phylogenetic distances between taxa based on low-coverage sequencing reads. Major applications are, for example, phylogeny reconstruction, species identification from small sequencing samples, or bacterial strain typing in medical diagnostics. RESULTS We adapted our previously developed software program Filtered Spaced-Word Matches (FSWM) for alignment-free phylogeny reconstruction to take unassembled reads as input; we call this implementation Read-SpaM. CONCLUSIONS Test runs on simulated reads from semi-artificial and real-world bacterial genomes show that our approach can estimate phylogenetic distances with high accuracy, even for large evolutionary distances and for very low sequencing coverage.
Collapse
|
16
|
Block alignment: New representation and comparison method to study evolution of genomes. Genomics 2019; 111:1590-1603. [DOI: 10.1016/j.ygeno.2018.11.003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2018] [Revised: 10/13/2018] [Accepted: 11/05/2018] [Indexed: 01/22/2023]
|
17
|
Leimeister CA, Dencker T, Morgenstern B. Accurate multiple alignment of distantly related genome sequences using filtered spaced word matches as anchor points. Bioinformatics 2019; 35:211-218. [PMID: 29992260 PMCID: PMC6330006 DOI: 10.1093/bioinformatics/bty592] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2017] [Accepted: 07/09/2018] [Indexed: 01/30/2023] Open
Abstract
Motivation Most methods for pairwise and multiple genome alignment use fast local homology search tools to identify anchor points, i.e. high-scoring local alignments of the input sequences. Sequence segments between those anchor points are then aligned with slower, more sensitive methods. Finding suitable anchor points is therefore crucial for genome sequence comparison; speed and sensitivity of genome alignment depend on the underlying anchoring methods. Results In this article, we use filtered spaced word matches to generate anchor points for genome alignment. For a given binary pattern representing match and don't-care positions, we first search for spaced-word matches, i.e. ungapped local pairwise alignments with matching nucleotides at the match positions of the pattern and possible mismatches at the don't-care positions. Those spaced-word matches that have similarity scores above some threshold value are then extended using a standard X-drop algorithm; the resulting local alignments are used as anchor points. To evaluate this approach, we used the popular multiple-genome-alignment pipeline Mugsy and replaced the exact word matches that Mugsy uses as anchor points with our spaced-word-based anchor points. For closely related genome sequences, the two anchoring procedures lead to multiple alignments of similar quality. For distantly related genomes, however, alignments calculated with our filtered-spaced-word matches are superior to alignments produced with the original Mugsy program where exact word matches are used to find anchor points. Availability and implementation http://spacedanchor.gobics.de. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics.,Center for Computational Sciences, University of Goettingen, Goettingen, Germany
| |
Collapse
|
18
|
Kumar V, Vollbrecht T, Chernyshev M, Mohan S, Hanst B, Bavafa N, Lorenzo A, Kumar N, Ketteringham R, Eren K, Golden M, Oliveira MF, Murrell B. Long-read amplicon denoising. Nucleic Acids Res 2019; 47:e104. [PMID: 31418021 PMCID: PMC6765106 DOI: 10.1093/nar/gkz657] [Citation(s) in RCA: 22] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2019] [Revised: 07/03/2019] [Accepted: 07/17/2019] [Indexed: 01/03/2023] Open
Abstract
Long-read next-generation amplicon sequencing shows promise for studying complete genes or genomes from complex and diverse populations. Current long-read sequencing technologies have challenging error profiles, hindering data processing and incorporation into downstream analyses. Here we consider the problem of how to reconstruct, free of sequencing error, the true sequence variants and their associated frequencies from PacBio reads. Called 'amplicon denoising', this problem has been extensively studied for short-read sequencing technologies, but current solutions do not always successfully generalize to long reads with high indel error rates. We introduce two methods: one that runs nearly instantly and is very accurate for medium length reads and high template coverage, and another, slower method that is more robust when reads are very long or coverage is lower. On two Mock Virus Community datasets with ground truth, each sequenced on a different PacBio instrument, and on a number of simulated datasets, we compare our two approaches to each other and to existing algorithms. We outperform all tested methods in accuracy, with competitive run times even for our slower method, successfully discriminating templates that differ by a just single nucleotide. Julia implementations of Fast Amplicon Denoising (FAD) and Robust Amplicon Denoising (RAD), and a webserver interface, are freely available.
Collapse
Affiliation(s)
- Venkatesh Kumar
- Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm 17177, Sweden
- Department of Medicine, University of California, San Diego, La Jolla 92093, CA, USA
| | - Thomas Vollbrecht
- Department of Medicine, University of California, San Diego, La Jolla 92093, CA, USA
| | - Mark Chernyshev
- Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm 17177, Sweden
- Department of Medicine, University of California, San Diego, La Jolla 92093, CA, USA
| | - Sanjay Mohan
- Department of Medicine, University of California, San Diego, La Jolla 92093, CA, USA
| | - Brian Hanst
- Department of Biology, University of California, San Diego, La Jolla 92093, CA, USA
| | - Nicholas Bavafa
- Department of Medicine, University of California, San Diego, La Jolla 92093, CA, USA
| | - Antonia Lorenzo
- Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm 17177, Sweden
- Department of Medicine, University of California, San Diego, La Jolla 92093, CA, USA
| | - Nikesh Kumar
- Department of Medicine, University of California, San Diego, La Jolla 92093, CA, USA
| | - Robert Ketteringham
- Department of Pathology, Institute of Infectious Diseases and Molecular Medicine, Faculty of Health Science, University of Cape Town, Cape Town 7925, South Africa
| | - Kemal Eren
- Department of Medicine, University of California, San Diego, La Jolla 92093, CA, USA
| | - Michael Golden
- Department of Statistics, University of Oxford, Oxford OX1 3LB, UK
| | - Michelli F Oliveira
- Department of Medicine, University of California, San Diego, La Jolla 92093, CA, USA
| | - Ben Murrell
- Department of Microbiology, Tumor and Cell Biology, Karolinska Institutet, Stockholm 17177, Sweden
- Department of Medicine, University of California, San Diego, La Jolla 92093, CA, USA
| |
Collapse
|
19
|
Zielezinski A, Girgis HZ, Bernard G, Leimeister CA, Tang K, Dencker T, Lau AK, Röhling S, Choi JJ, Waterman MS, Comin M, Kim SH, Vinga S, Almeida JS, Chan CX, James BT, Sun F, Morgenstern B, Karlowski WM. Benchmarking of alignment-free sequence comparison methods. Genome Biol 2019; 20:144. [PMID: 31345254 PMCID: PMC6659240 DOI: 10.1186/s13059-019-1755-7] [Citation(s) in RCA: 113] [Impact Index Per Article: 18.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2019] [Accepted: 07/03/2019] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND Alignment-free (AF) sequence comparison is attracting persistent interest driven by data-intensive applications. Hence, many AF procedures have been proposed in recent years, but a lack of a clearly defined benchmarking consensus hampers their performance assessment. RESULTS Here, we present a community resource (http://afproject.org) to establish standards for comparing alignment-free approaches across different areas of sequence-based research. We characterize 74 AF methods available in 24 software tools for five research applications, namely, protein sequence classification, gene tree inference, regulatory element detection, genome-based phylogenetic inference, and reconstruction of species trees under horizontal gene transfer and recombination events. CONCLUSION The interactive web service allows researchers to explore the performance of alignment-free tools relevant to their data types and analytical goals. It also allows method developers to assess their own algorithms and compare them with current state-of-the-art tools, accelerating the development of new, more accurate AF solutions.
Collapse
Affiliation(s)
- Andrzej Zielezinski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland
| | - Hani Z Girgis
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | | | - Chris-Andre Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Kujin Tang
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
| | - Thomas Dencker
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Anna Katharina Lau
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Sophie Röhling
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Jae Jin Choi
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Michael S Waterman
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Matteo Comin
- Department of Information Engineering, University of Padova, Padova, Italy
| | - Sung-Hou Kim
- Department of Chemistry, University of California, Berkeley, CA, 94720, USA
- Molecular Biophysics & Integrated Bioimaging Division, Lawrence Berkeley National Laboratory, Berkeley, CA, 94720, USA
| | - Susana Vinga
- INESC-ID, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
- IDMEC, Instituto Superior Técnico, Universidade de Lisboa, Av. Rovisco Pais 1, 1049-001, Lisbon, Portugal
| | - Jonas S Almeida
- Division of Cancer Epidemiology and Genetics (DCEG), National Cancer Institute (NIH/NCI), Bethesda, USA
| | - Cheong Xin Chan
- Institute for Molecular Bioscience, and School of Chemistry and Molecular Biosciences, The University of Queensland, Brisbane, QLD, 4072, Australia
| | - Benjamin T James
- Tandy School of Computer Science, The University of Tulsa, 800 South Tucker Drive, Tulsa, OK, 74104, USA
| | - Fengzhu Sun
- Department of Biological Sciences, Quantitative and Computational Biology Program, University of Southern California, Los Angeles, CA, 90089, USA
- Centre for Computational Systems Biology, School of Mathematical Sciences, Fudan University, Shanghai, 200433, China
| | - Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Göttingen, Goldschmidtstr. 1, 37077, Göttingen, Germany
| | - Wojciech M Karlowski
- Department of Computational Biology, Faculty of Biology, Adam Mickiewicz University Poznan, Uniwersytetu Poznańskiego 6, 61-614, Poznan, Poland.
| |
Collapse
|
20
|
Kucherov G. Evolution of biosequence search algorithms: a brief survey. Bioinformatics 2019; 35:3547-3552. [DOI: 10.1093/bioinformatics/btz272] [Citation(s) in RCA: 23] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2018] [Revised: 04/01/2019] [Accepted: 04/11/2019] [Indexed: 11/14/2022] Open
Abstract
Abstract
Motivation
Although modern high-throughput biomolecular technologies produce various types of data, biosequence data remain at the core of bioinformatic analyses. However, computational techniques for dealing with this data evolved dramatically.
Results
In this bird’s-eye review, we overview the evolution of main algorithmic techniques for comparing and searching biological sequences. We highlight key algorithmic ideas emerged in response to several interconnected factors: shifts of biological analytical paradigm, advent of new sequencing technologies and a substantial increase in size of the available data. We discuss the expansion of alignment-free techniques coming to replace alignment-based algorithms in large-scale analyses. We further emphasize recently emerged and growing applications of sketching methods which support comparison of massive datasets, such as metagenomics samples. Finally, we focus on the transition to population genomics and outline associated algorithmic challenges.
Collapse
Affiliation(s)
- Gregory Kucherov
- CNRS and LIGM/University of Paris-Est, Marne-la-Vallée, France
- SkolTech, Moscow, Russia
| |
Collapse
|
21
|
Saw AK, Raj G, Das M, Talukdar NC, Tripathy BC, Nandi S. Alignment-free method for DNA sequence clustering using Fuzzy integral similarity. Sci Rep 2019; 9:3753. [PMID: 30842590 PMCID: PMC6403383 DOI: 10.1038/s41598-019-40452-6] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2018] [Accepted: 01/28/2019] [Indexed: 12/28/2022] Open
Abstract
A larger amount of sequence data in private and public databases produced by next-generation sequencing put new challenges due to limitation associated with the alignment-based method for sequence comparison. So, there is a high need for faster sequence analysis algorithms. In this study, we developed an alignment-free algorithm for faster sequence analysis. The novelty of our approach is the inclusion of fuzzy integral with Markov chain for sequence analysis in the alignment-free model. The method estimate the parameters of a Markov chain by considering the frequencies of occurrence of all possible nucleotide pairs from each DNA sequence. These estimated Markov chain parameters were used to calculate similarity among all pairwise combinations of DNA sequences based on a fuzzy integral algorithm. This matrix is used as an input for the neighbor program in the PHYLIP package for phylogenetic tree construction. Our method was tested on eight benchmark datasets and on in-house generated datasets (18 s rDNA sequences from 11 arbuscular mycorrhizal fungi (AMF) and 16 s rDNA sequences of 40 bacterial isolates from plant interior). The results indicate that the fuzzy integral algorithm is an efficient and feasible alignment-free method for sequence analysis on the genomic scale.
Collapse
Affiliation(s)
- Ajay Kumar Saw
- Institute of Advanced Study in Science and Technology, Mathematical Sciences Division, Guwahati, 781035, India
| | - Garima Raj
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | - Manashi Das
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | - Narayan Chandra Talukdar
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India
| | | | - Soumyadeep Nandi
- Institute of Advanced Study in Science and Technology, Life Science Division, Guwahati, 781035, India.
| |
Collapse
|
22
|
Leimeister CA, Schellhorn J, Dörrer S, Gerth M, Bleidorn C, Morgenstern B. Prot-SpaM: fast alignment-free phylogeny reconstruction based on whole-proteome sequences. Gigascience 2019; 8:giy148. [PMID: 30535314 PMCID: PMC6436989 DOI: 10.1093/gigascience/giy148] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2018] [Revised: 09/10/2018] [Accepted: 11/20/2018] [Indexed: 11/20/2022] Open
Abstract
Word-based or 'alignment-free' sequence comparison has become an active research area in bioinformatics. While previous word-frequency approaches calculated rough measures of sequence similarity or dissimilarity, some new alignment-free methods are able to accurately estimate phylogenetic distances between genomic sequences. One of these approaches is Filtered Spaced Word Matches. Here, we extend this approach to estimate evolutionary distances between complete or incomplete proteomes; our implementation of this approach is called Prot-SpaM. We compare the performance of Prot-SpaM to other alignment-free methods on simulated sequences and on various groups of eukaryotic and prokaryotic taxa. Prot-SpaM can be used to calculate high-quality phylogenetic trees for dozens of whole-proteome sequences in a matter of seconds or minutes and often outperforms other alignment-free approaches. The source code of our software is available through Github: https://github.com/jschellh/ProtSpaM.
Collapse
Affiliation(s)
- Chris-Andre Leimeister
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Jendrik Schellhorn
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Svenja Dörrer
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Michael Gerth
- Institute for Integrative Biology, University of Liverpool, Biosciences Building, Crown Street, L69 7ZB Liverpool, UK
| | - Christoph Bleidorn
- University of Göttingen, Department of Animal Evolution and Biodiversity, Untere Karspüle 2, 37073 Göttingen, Germany
- Museo Nacional de Ciencias Naturales, Spanish National Research Council (CSIC), 28006 Madrid, Spain
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Goldschmidtstr. 1, 37077 Göttingen, Germany
- Göttingen Center of Molecular Biosciences (GZMB), Justus-von-Liebig-Weg 11, 37077 Göttingen
| |
Collapse
|
23
|
Sarmashghi S, Bohmann K, P. Gilbert MT, Bafna V, Mirarab S. Skmer: assembly-free and alignment-free sample identification using genome skims. Genome Biol 2019; 20:34. [PMID: 30760303 PMCID: PMC6374904 DOI: 10.1186/s13059-019-1632-4] [Citation(s) in RCA: 61] [Impact Index Per Article: 10.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2018] [Accepted: 01/16/2019] [Indexed: 01/10/2023] Open
Abstract
The ability to inexpensively describe taxonomic diversity is critical in this era of rapid climate and biodiversity changes. The recent genome-skimming approach extends current barcoding practices beyond short markers by applying low-pass sequencing and recovering whole organelle genomes computationally. This approach discards the nuclear DNA, which constitutes the vast majority of the data. In contrast, we suggest using all unassembled reads. We introduce an assembly-free and alignment-free tool, Skmer, to compute genomic distances between the query and reference genome skims. Skmer shows excellent accuracy in estimating distances and identifying the closest match in reference datasets.
Collapse
Affiliation(s)
- Shahab Sarmashghi
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| | - Kristine Bohmann
- Evolutionary Genomics, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
- School of Biological Sciences, University of East Anglia, Norwich, Norfolk UK
| | - M. Thomas P. Gilbert
- Evolutionary Genomics, Natural History Museum of Denmark, University of Copenhagen, Copenhagen, Denmark
- Norwegian University of Science and Technology, Trondheim, 7491 Norway
| | - Vineet Bafna
- Department of Computer Science & Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| | - Siavash Mirarab
- Department of Electrical & Computer Engineering, University of California, San Diego, La Jolla, 92093 CA USA
| |
Collapse
|
24
|
Privacy-Preserving Similar Patient Queries for Combined Biomedical Data. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES 2018. [DOI: 10.2478/popets-2019-0004] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/20/2022]
Abstract
Abstract
The decreasing costs of molecular profiling have fueled the biomedical research community with a plethora of new types of biomedical data, enabling a breakthrough towards more precise and personalized medicine. Naturally, the increasing availability of data also enables physicians to compare patients’ data and treatments easily and to find similar patients in order to propose the optimal therapy. Such similar patient queries (SPQs) are of utmost importance to medical practice and will be relied upon in future health information exchange systems. While privacy-preserving solutions have been previously studied, those are limited to genomic data, ignoring the different newly available types of biomedical data.
In this paper, we propose new cryptographic techniques for finding similar patients in a privacy-preserving manner with various types of biomedical data, including genomic, epigenomic and transcriptomic data as well as their combination. We design protocols for two of the most common similarity metrics in biomedicine: the Euclidean distance and Pearson correlation coefficient. Moreover, unlike previous approaches, we account for the fact that certain locations contribute differently to a given disease or phenotype by allowing to limit the query to the relevant locations and to assign them different weights. Our protocols are specifically designed to be highly efficient in terms of communication and bandwidth, requiring only one or two rounds of communication and thus enabling scalable parallel queries. We rigorously prove our protocols to be secure based on cryptographic games and instantiate our technique with three of the most important types of biomedical data – namely DNA, microRNA expression, and DNA methylation. Our experimental results show that our protocols can compute a similarity query over a typical number of positions against a database of 1,000 patients in a few seconds. Finally, we propose and formalize strategies to mitigate the threat of malicious users or hospitals.
Collapse
|
25
|
Han GB, Cho DH. Genome classification improvements based on k-mer intervals in sequences. Genomics 2018; 111:1574-1582. [PMID: 30439480 DOI: 10.1016/j.ygeno.2018.11.001] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Revised: 10/13/2018] [Accepted: 11/05/2018] [Indexed: 10/27/2022]
Abstract
Given the vast amount of genomic data, alignment-free sequence comparison methods are required due to their low computational complexity. k-mer based methods can improve comparison accuracy by extracting an effective feature of the genome sequences. The aim of this paper is to extract k-mer intervals of a sequence as a feature of a genome for high comparison accuracy. In the proposed method, we calculated the distance between genome sequences by comparing the distribution of k-mer intervals. Then, we identified the classification results using phylogenetic trees. We used viral, mitochondrial (MT), microbial and mammalian genome sequences to perform classification for various genome sets. We confirmed that the proposed method provides a better classification result than other k-mer based methods. Furthermore, the proposed method could efficiently be applied to long sequences such as human and mouse genomes.
Collapse
Affiliation(s)
- Gyu-Bum Han
- School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, South Korea
| | - Dong-Ho Cho
- School of Electrical Engineering, Korea Advanced Institute of Science and Technology, Daejeon, South Korea.
| |
Collapse
|
26
|
Morgenstern B, Schöbel S, Leimeister CA. Phylogeny reconstruction based on the length distribution of k-mismatch common substrings. Algorithms Mol Biol 2017; 12:27. [PMID: 29238399 PMCID: PMC5724348 DOI: 10.1186/s13015-017-0118-8] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2017] [Accepted: 11/28/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Various approaches to alignment-free sequence comparison are based on the length of exact or inexact word matches between pairs of input sequences. Haubold et al. (J Comput Biol 16:1487-1500, 2009) showed how the average number of substitutions per position between two DNA sequences can be estimated based on the average length of exact common substrings. RESULTS In this paper, we study the length distribution of k-mismatch common substrings between two sequences. We show that the number of substitutions per position can be accurately estimated from the position of a local maximum in the length distribution of their k-mismatch common substrings.
Collapse
Affiliation(s)
- Burkhard Morgenstern
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Svenja Schöbel
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| | - Chris-André Leimeister
- Department of Bioinformatics, Institute of Microbiology and Genetics, University of Goettingen, Goldschmidtstr. 1, 37077 Göttingen, Germany
| |
Collapse
|
27
|
Murray KD, Webers C, Ong CS, Borevitz J, Warthmann N. kWIP: The k-mer weighted inner product, a de novo estimator of genetic similarity. PLoS Comput Biol 2017; 13:e1005727. [PMID: 28873405 PMCID: PMC5600398 DOI: 10.1371/journal.pcbi.1005727] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2016] [Revised: 09/15/2017] [Accepted: 08/21/2017] [Indexed: 11/18/2022] Open
Abstract
Modern genomics techniques generate overwhelming quantities of data. Extracting population genetic variation demands computationally efficient methods to determine genetic relatedness between individuals (or "samples") in an unbiased manner, preferably de novo. Rapid estimation of genetic relatedness directly from sequencing data has the potential to overcome reference genome bias, and to verify that individuals belong to the correct genetic lineage before conclusions are drawn using mislabelled, or misidentified samples. We present the k-mer Weighted Inner Product (kWIP), an assembly-, and alignment-free estimator of genetic similarity. kWIP combines a probabilistic data structure with a novel metric, the weighted inner product (WIP), to efficiently calculate pairwise similarity between sequencing runs from their k-mer counts. It produces a distance matrix, which can then be further analysed and visualised. Our method does not require prior knowledge of the underlying genomes and applications include establishing sample identity and detecting mix-up, non-obvious genomic variation, and population structure. We show that kWIP can reconstruct the true relatedness between samples from simulated populations. By re-analysing several published datasets we show that our results are consistent with marker-based analyses. kWIP is written in C++, licensed under the GNU GPL, and is available from https://github.com/kdmurray91/kwip.
Collapse
Affiliation(s)
- Kevin D. Murray
- Research School of Biology, The Australian National University, Canberra, Australia
| | - Christfried Webers
- Data61, CSIRO, Canberra, Australia
- Research School of Computer Science, The Australian National University, Canberra, Australia
| | - Cheng Soon Ong
- Data61, CSIRO, Canberra, Australia
- Research School of Computer Science, The Australian National University, Canberra, Australia
| | - Justin Borevitz
- Research School of Biology, The Australian National University, Canberra, Australia
| | - Norman Warthmann
- Research School of Biology, The Australian National University, Canberra, Australia
| |
Collapse
|
28
|
Leimeister CA, Sohrabi-Jahromi S, Morgenstern B. Fast and accurate phylogeny reconstruction using filtered spaced-word matches. Bioinformatics 2017; 33:971-979. [PMID: 28073754 PMCID: PMC5409309 DOI: 10.1093/bioinformatics/btw776] [Citation(s) in RCA: 34] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2016] [Accepted: 12/02/2016] [Indexed: 11/13/2022] Open
Abstract
Motivation Word-based or ‘alignment-free’ algorithms are increasingly used for phylogeny reconstruction and genome comparison, since they are much faster than traditional approaches that are based on full sequence alignments. Existing alignment-free programs, however, are less accurate than alignment-based methods. Results We propose Filtered Spaced Word Matches (FSWM), a fast alignment-free approach to estimate phylogenetic distances between large genomic sequences. For a pre-defined binary pattern of match and don’t-care positions, FSWM rapidly identifies spaced word-matches between input sequences, i.e. gap-free local alignments with matching nucleotides at the match positions and with mismatches allowed at the don’t-care positions. We then estimate the number of nucleotide substitutions per site by considering the nucleotides aligned at the don’t-care positions of the identified spaced-word matches. To reduce the noise from spurious random matches, we use a filtering procedure where we discard all spaced-word matches for which the overall similarity between the aligned segments is below a threshold. We show that our approach can accurately estimate substitution frequencies even for distantly related sequences that cannot be analyzed with existing alignment-free methods; phylogenetic trees constructed with FSWM distances are of high quality. A program run on a pair of eukaryotic genomes of a few hundred Mb each takes a few minutes. Availability and Implementation The program source code for FSWM including a documentation, as well as the software that we used to generate artificial genome sequences are freely available at http://fswm.gobics.de/ Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Chris-André Leimeister
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077?Göttingen, Germany
| | - Salma Sohrabi-Jahromi
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077?Göttingen, Germany
| | - Burkhard Morgenstern
- Department of Bioinformatics, University of Göttingen, Institute of Microbiology and Genetics, Goldschmidtstr. 1, 37077 Göttingen, Germany.,University of Göttingen, Center for Computational Sciences, Goldschmidtstr. 1, 37077 Göttingen, Germany
| |
Collapse
|
29
|
Noé L. Best hits of 11110110111: model-free selection and parameter-free sensitivity calculation of spaced seeds. Algorithms Mol Biol 2017; 12:1. [PMID: 28289437 PMCID: PMC5310094 DOI: 10.1186/s13015-017-0092-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2016] [Accepted: 01/30/2017] [Indexed: 12/02/2022] Open
Abstract
Background Spaced seeds, also named gapped q-grams, gapped k-mers, spaced q-grams, have been proven to be more sensitive than contiguous seeds (contiguous q-grams, contiguous k-mers) in nucleic and amino-acid sequences analysis. Initially proposed to detect sequence similarities and to anchor sequence alignments, spaced seeds have more recently been applied in several alignment-free related methods. Unfortunately, spaced seeds need to be initially designed. This task is known to be time-consuming due to the number of spaced seed candidates. Moreover, it can be altered by a set of arbitrary chosen parameters from the probabilistic alignment models used. In this general context, Dominant seeds have been introduced by Mak and Benson (Bioinformatics 25:302–308, 2009) on the Bernoulli model, in order to reduce the number of spaced seed candidates that are further processed in a parameter-free calculation of the sensitivity. Results We expand the scope of work of Mak and Benson on single and multiple seeds by considering the Hit Integration model of Chung and Park (BMC Bioinform 11:31, 2010), demonstrate that the same dominance definition can be applied, and that a parameter-free study can be performed without any significant additional cost. We also consider two new discrete models, namely the Heaviside and the Dirac models, where lossless seeds can be integrated. From a theoretical standpoint, we establish a generic framework on all the proposed models, by applying a counting semi-ring to quickly compute large polynomial coefficients needed by the dominance filter. From a practical standpoint, we confirm that dominant seeds reduce the set of, either single seeds to thoroughly analyse, or multiple seeds to store. Moreover, in http://bioinfo.cristal.univ-lille.fr/yass/iedera_dominance, we provide a full list of spaced seeds computed on the four aforementioned models, with one (continuous) parameter left free for each model, and with several (discrete) alignment lengths.
Collapse
|
30
|
Hahn L, Leimeister CA, Ounit R, Lonardi S, Morgenstern B. rasbhari: Optimizing Spaced Seeds for Database Searching, Read Mapping and Alignment-Free Sequence Comparison. PLoS Comput Biol 2016; 12:e1005107. [PMID: 27760124 PMCID: PMC5070788 DOI: 10.1371/journal.pcbi.1005107] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2016] [Accepted: 08/11/2016] [Indexed: 12/05/2022] Open
Abstract
Many algorithms for sequence analysis rely on word matching or word statistics. Often, these approaches can be improved if binary patterns representing match and don't-care positions are used as a filter, such that only those positions of words are considered that correspond to the match positions of the patterns. The performance of these approaches, however, depends on the underlying patterns. Herein, we show that the overlap complexity of a pattern set that was introduced by Ilie and Ilie is closely related to the variance of the number of matches between two evolutionarily related sequences with respect to this pattern set. We propose a modified hill-climbing algorithm to optimize pattern sets for database searching, read mapping and alignment-free sequence comparison of nucleic-acid sequences; our implementation of this algorithm is called rasbhari. Depending on the application at hand, rasbhari can either minimize the overlap complexity of pattern sets, maximize their sensitivity in database searching or minimize the variance of the number of pattern-based matches in alignment-free sequence comparison. We show that, for database searching, rasbhari generates pattern sets with slightly higher sensitivity than existing approaches. In our Spaced Words approach to alignment-free sequence comparison, pattern sets calculated with rasbhari led to more accurate estimates of phylogenetic distances than the randomly generated pattern sets that we previously used. Finally, we used rasbhari to generate patterns for short read classification with CLARK-S. Here too, the sensitivity of the results could be improved, compared to the default patterns of the program. We integrated rasbhari into Spaced Words; the source code of rasbhari is freely available at http://rasbhari.gobics.de/.
Collapse
Affiliation(s)
- Lars Hahn
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
| | | | - Rachid Ounit
- University of California, Riverside, Department of Computer Science and Engineering, Riverside, California, United States of America
| | - Stefano Lonardi
- University of California, Riverside, Department of Computer Science and Engineering, Riverside, California, United States of America
| | - Burkhard Morgenstern
- University of Göttingen, Department of Bioinformatics, Göttingen, Germany
- University of Göttingen, Center for Computational Sciences, Göttingen, Germany
| |
Collapse
|
31
|
Břinda K, Sykulski M, Kucherov G. Spaced seeds improvek-mer-based metagenomic classification. Bioinformatics 2015. [DOI: 10.1093/bioinformatics/btv419] [Citation(s) in RCA: 58] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
|