1
|
Marçais G, DeBlasio D, Kingsford C. Sketching Methods with Small Window Guarantee Using Minimum Decycling Sets. J Comput Biol 2024; 31:597-615. [PMID: 38980804 PMCID: PMC11304339 DOI: 10.1089/cmb.2024.0544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 07/11/2024] Open
Abstract
Most sequence sketching methods work by selecting specific k-mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software. Applications using sketches often rely on properties of the k-mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a decycling set of the de Bruijn graph, which is a set of unavoidable k-mers. Any long enough sequence, by definition, must contain a k-mer from any decycling set (hence, the unavoidable property). Conversely, a decycling set also defines a sketching method by choosing the k-mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.
Collapse
Affiliation(s)
- Guillaume Marçais
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Dan DeBlasio
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Ray and Stephanie Lane Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
2
|
Hoang M, Marçais G, Kingsford C. Density and Conservation Optimization of the Generalized Masked-Minimizer Sketching Scheme. J Comput Biol 2024; 31:2-20. [PMID: 37975802 PMCID: PMC10794853 DOI: 10.1089/cmb.2023.0212] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2023] Open
Abstract
Minimizers and syncmers are sketching methods that sample representative k-mer seeds from a long string. The minimizer scheme guarantees a well-spread k-mer sketch (high coverage) while seeking to minimize the sketch size (low density). The syncmer scheme yields sketches that are more robust to base substitutions (high conservation) on random sequences, but do not have the coverage guarantee of minimizers. These sketching metrics are generally adversarial to one another, especially in the context of sketch optimization for a specific sequence, and thus are difficult to be simultaneously achieved. The parameterized syncmer scheme was recently introduced as a generalization of syncmers with more flexible sampling rules and empirically better coverage than the original syncmer variants. However, no approach exists to optimize parameterized syncmers. To address this shortcoming, we introduce a new scheme called masked minimizers that generalizes minimizers in manner analogous to how parameterized syncmers generalize syncmers and allows us to extend existing optimization techniques developed for minimizers. This results in a practical algorithm to optimize the masked minimizer scheme with respect to both density and conservation. We evaluate the optimization algorithm on various benchmark genomes and show that our algorithm finds sketches that are overall more compact, well-spread, and robust to substitutions than those found by previous methods. Our implementation is released at https://github.com/Kingsford-Group/maskedminimizer. This new technique will enable more efficient and robust genomic analyses in the many settings where minimizers and syncmers are used.
Collapse
Affiliation(s)
- Minh Hoang
- Department of Computer Science, and Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Guillaume Marçais
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
3
|
Marçais G, DeBlasio D, Kingsford C. Sketching methods with small window guarantee using minimum decycling sets. ARXIV 2023:arXiv:2311.03592v1. [PMID: 37986724 PMCID: PMC10659450] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 11/22/2023]
Abstract
Most sequence sketching methods work by selecting specific k -mers from sequences so that the similarity between two sequences can be estimated using only the sketches. Because estimating sequence similarity is much faster using sketches than using sequence alignment, sketching methods are used to reduce the computational requirements of computational biology software packages. Applications using sketches often rely on properties of the k -mer selection procedure to ensure that using a sketch does not degrade the quality of the results compared with using sequence alignment. Two important examples of such properties are locality and window guarantees, the latter of which ensures that no long region of the sequence goes unrepresented in the sketch. A sketching method with a window guarantee, implicitly or explicitly, corresponds to a Decycling Set, an unavoidable sets of k -mers. Any long enough sequence, by definition, must contain a k -mer from any decycling set (hence, it is unavoidable). Conversely, a decycling set also defines a sketching method by choosing the k -mers from the set as representatives. Although current methods use one of a small number of sketching method families, the space of decycling sets is much larger, and largely unexplored. Finding decycling sets with desirable characteristics (e.g., small remaining path length) is a promising approach to discovering new sketching methods with improved performance (e.g., with small window guarantee). The Minimum Decycling Sets (MDSs) are of particular interest because of their minimum size. Only two algorithms, by Mykkeltveit and Champarnaud, are previously known to generate two particular MDSs, although there are typically a vast number of alternative MDSs. We provide a simple method to enumerate MDSs. This method allows one to explore the space of MDSs and to find MDSs optimized for desirable properties. We give evidence that the Mykkeltveit sets are close to optimal regarding one particular property, the remaining path length. A number of conjectures and computational and theoretical evidence to support them are presented. Code available at https://github.com/Kingsford-Group/mdsscope.
Collapse
Affiliation(s)
- Guillaume Marçais
- Computational Biology Department, Carnegie Mellon University, Pittsburgh PA 15213, USA
| | - Dan DeBlasio
- Computational Biology Department, Carnegie Mellon University, Pittsburgh PA 15213, USA
| | - Carl Kingsford
- Computational Biology Department, Carnegie Mellon University, Pittsburgh PA 15213, USA
| |
Collapse
|
4
|
Greenberg G, Ravi AN, Shomorony I. LexicHash: sequence similarity estimation via lexicographic comparison of hashes. Bioinformatics 2023; 39:btad652. [PMID: 37878809 PMCID: PMC10628434 DOI: 10.1093/bioinformatics/btad652] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2023] [Revised: 10/11/2023] [Accepted: 10/23/2023] [Indexed: 10/27/2023] Open
Abstract
MOTIVATION Pairwise sequence alignment is a heavy computational burden, particularly in the context of third-generation sequencing technologies. This issue is commonly addressed by approximately estimating sequence similarities using a hash-based method such as MinHash. In MinHash, all k-mers in a read are hashed and the minimum hash value, the min-hash, is stored. Pairwise similarities can then be estimated by counting the number of min-hash matches between a pair of reads, across many distinct hash functions. The choice of the parameter k controls an important tradeoff in the task of identifying alignments: larger k-values give greater confidence in the identification of alignments (high precision) but can lead to many missing alignments (low recall), particularly in the presence of significant noise. RESULTS In this work, we introduce LexicHash, a new similarity estimation method that is effectively independent of the choice of k and attains the high precision of large-k and the high sensitivity of small-k MinHash. LexicHash is a variant of MinHash with a carefully designed hash function. When estimating the similarity between two reads, instead of simply checking whether min-hashes match (as in standard MinHash), one checks how "lexicographically similar" the LexicHash min-hashes are. In our experiments on 40 PacBio datasets, the area under the precision-recall curves obtained by LexicHash had an average improvement of 20.9% over MinHash. Additionally, the LexicHash framework lends itself naturally to an efficient search of the largest alignments, yielding an O(n) time algorithm, and circumventing the seemingly fundamental O(n2) scaling associated with pairwise similarity search. AVAILABILITY AND IMPLEMENTATION LexicHash is available on GitHub at https://github.com/gcgreenberg/LexicHash.
Collapse
Affiliation(s)
- Grant Greenberg
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Aditya Narayan Ravi
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| | - Ilan Shomorony
- Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign, Urbana, IL, United States
| |
Collapse
|
5
|
Chen K, Shao M. Locality-sensitive bucketing functions for the edit distance. Algorithms Mol Biol 2023; 18:7. [PMID: 37488600 PMCID: PMC10367278 DOI: 10.1186/s13015-023-00234-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2023] [Accepted: 06/13/2023] [Indexed: 07/26/2023] Open
Abstract
BACKGROUND Many bioinformatics applications involve bucketing a set of sequences where each sequence is allowed to be assigned into multiple buckets. To achieve both high sensitivity and precision, bucketing methods are desired to assign similar sequences into the same bucket while assigning dissimilar sequences into distinct buckets. Existing k-mer-based bucketing methods have been efficient in processing sequencing data with low error rates, but encounter much reduced sensitivity on data with high error rates. Locality-sensitive hashing (LSH) schemes are able to mitigate this issue through tolerating the edits in similar sequences, but state-of-the-art methods still have large gaps. RESULTS In this paper, we generalize the LSH function by allowing it to hash one sequence into multiple buckets. Formally, a bucketing function, which maps a sequence (of fixed length) into a subset of buckets, is defined to be [Formula: see text]-sensitive if any two sequences within an edit distance of [Formula: see text] are mapped into at least one shared bucket, and any two sequences with distance at least [Formula: see text] are mapped into disjoint subsets of buckets. We construct locality-sensitive bucketing (LSB) functions with a variety of values of [Formula: see text] and analyze their efficiency with respect to the total number of buckets needed as well as the number of buckets that a specific sequence is mapped to. We also prove lower bounds of these two parameters in different settings and show that some of our constructed LSB functions are optimal. CONCLUSION These results lay the theoretical foundations for their practical use in analyzing sequences with high error rates while also providing insights for the hardness of designing ungapped LSH functions.
Collapse
Affiliation(s)
- Ke Chen
- Department of Computer Science and Engineering, The Pennsylvania State University, State College, United States
| | - Mingfu Shao
- Department of Computer Science and Engineering, The Pennsylvania State University, State College, United States
- Huck Institutes of the Life Sciences, The Pennsylvania State University, State College, United States
| |
Collapse
|
6
|
Pellow D, Pu L, Ekim B, Kotlar L, Berger B, Shamir R, Orenstein Y. Efficient minimizer orders for large values of k using minimum decycling sets. Genome Res 2023; 33:1154-1161. [PMID: 37558282 PMCID: PMC10538483 DOI: 10.1101/gr.277644.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Accepted: 04/20/2023] [Indexed: 08/11/2023]
Abstract
Minimizers are ubiquitously used in data structures and algorithms for efficient searching, mapping, and indexing of high-throughput DNA sequencing data. Minimizer schemes select a minimum k-mer in every L-long subsequence of the target sequence, where minimality is with respect to a predefined k-mer order. Commonly used minimizer orders select more k-mers than necessary and therefore provide limited improvement in runtime and memory usage of downstream analysis tasks. The recently introduced universal k-mer hitting sets produce minimizer orders with fewer selected k-mers. Generating compact universal k-mer hitting sets is currently infeasible for k > 13, and thus, they cannot help in the many applications that require minimizer orders for larger k Here, we close the gap of efficient minimizer orders for large values of k by introducing decycling-set-based minimizer orders: new minimizer orders based on minimum decycling sets. We show that in practice these new minimizer orders select a number of k-mers comparable to that of minimizer orders based on universal k-mer hitting sets and can also scale to a larger k Furthermore, we developed a method that computes the minimizers in a sequence on the fly without keeping the k-mers of a decycling set in memory. This enables the use of these minimizer orders for any value of k We expect the new orders to improve the runtime and memory usage of algorithms and data structures in high-throughput DNA sequencing analysis.
Collapse
Affiliation(s)
- David Pellow
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 6997801, Israel
| | - Lianrong Pu
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 6997801, Israel
| | - Bariş Ekim
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Lior Kotlar
- Department of Computer Science, Ben-Gurion University of the Negev, Beer-Sheva 8410501, Israel
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts 02139, USA
| | - Ron Shamir
- Blavatnik School of Computer Science, Tel-Aviv University, Tel Aviv 6997801, Israel;
| | - Yaron Orenstein
- Department of Computer Science, Bar-Ilan University, Ramat-Gan 5290002, Israel;
- The Mina and Everard Goodman Faculty of Life Sciences, Bar-Ilan University, Ramat-Gan 5290002, Israel
| |
Collapse
|
7
|
Berger B, Yu YW. Navigating bottlenecks and trade-offs in genomic data analysis. Nat Rev Genet 2023; 24:235-250. [PMID: 36476810 DOI: 10.1038/s41576-022-00551-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 10/27/2022] [Indexed: 12/12/2022]
Abstract
Genome sequencing and analysis allow researchers to decode the functional information hidden in DNA sequences as well as to study cell to cell variation within a cell population. Traditionally, the primary bottleneck in genomic analysis pipelines has been the sequencing itself, which has been much more expensive than the computational analyses that follow. However, an important consequence of the continued drive to expand the throughput of sequencing platforms at lower cost is that often the analytical pipelines are struggling to keep up with the sheer amount of raw data produced. Computational cost and efficiency have thus become of ever increasing importance. Recent methodological advances, such as data sketching, accelerators and domain-specific libraries/languages, promise to address these modern computational challenges. However, despite being more efficient, these innovations come with a new set of trade-offs, both expected, such as accuracy versus memory and expense versus time, and more subtle, including the human expertise needed to use non-standard programming interfaces and set up complex infrastructure. In this Review, we discuss how to navigate these new methodological advances and their trade-offs.
Collapse
Affiliation(s)
- Bonnie Berger
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA, USA.
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA.
| | - Yun William Yu
- Department of Computer and Mathematical Sciences, University of Toronto Scarborough, Toronto, Ontario, Canada
- Tri-Campus Department of Mathematics, University of Toronto, Toronto, Ontario, Canada
| |
Collapse
|
8
|
Lonsdale A, Halman A, Brown L, Kosasih H, Ekert P, Oshlack A. Toblerone: detecting exon deletion events in cancer using RNA-seq. F1000Res 2023; 12:130. [PMID: 37767021 PMCID: PMC10521068 DOI: 10.12688/f1000research.129490.1] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Accepted: 01/27/2023] [Indexed: 09/29/2023] Open
Abstract
Cancer is driven by mutations of the genome that can result in the activation of oncogenes or repression of tumour suppressor genes. In acute lymphoblastic leukemia (ALL) focal deletions in IKAROS family zinc finger 1 (IKZF1) result in the loss of zinc-finger DNA-binding domains and a dominant negative isoform that is associated with higher rates of relapse and poorer patient outcomes. Clinically, the presence of IKZF1 deletions informs prognosis and treatment options. In this work we developed a method for detecting exon deletions in genes using RNA-seq with application to IKZF1. We developed a pipeline that first uses a custom transcriptome reference consisting of transcripts with exon deletions. Next, RNA-seq reads are mapped using a pseudoalignment algorithm to identify reads that uniquely support deletions. These are then evaluated for evidence of the deletion with respect to gene expression and other samples. We applied the algorithm, named Toblerone, to a cohort of 99 B-ALL paediatric samples including validated IKZF1 deletions. Furthermore, we developed a graphical desktop app for non-bioinformatics users that can quickly and easily identify and report deletions in IKZF1 from RNA-seq data with informative graphical outputs.
Collapse
Affiliation(s)
- Andrew Lonsdale
- Sir Peter MacCallum Department of Oncology, University of Melbourne, Parkville, VIC, 3010, Australia
- Murdoch Children’s Research Institute, Parkville, VIC, 3052, Australia
- Peter MacCallum Cancer Centre, Parkville, VIC, 3052, Australia
| | - Andreas Halman
- Sir Peter MacCallum Department of Oncology, University of Melbourne, Parkville, VIC, 3010, Australia
- Peter MacCallum Cancer Centre, Parkville, VIC, 3052, Australia
| | - Lauren Brown
- Murdoch Children’s Research Institute, Parkville, VIC, 3052, Australia
- School of Women’s and Children’s Health, UNSW Sydney, Sydney, NSW, 2052, Australia
- Children's Cancer Institute Australia, Sydney, NSW, 2052, Australia
| | - Hansen Kosasih
- Murdoch Children’s Research Institute, Parkville, VIC, 3052, Australia
| | - Paul Ekert
- Sir Peter MacCallum Department of Oncology, University of Melbourne, Parkville, VIC, 3010, Australia
- Murdoch Children’s Research Institute, Parkville, VIC, 3052, Australia
- Peter MacCallum Cancer Centre, Parkville, VIC, 3052, Australia
- School of Women’s and Children’s Health, UNSW Sydney, Sydney, NSW, 2052, Australia
- Children's Cancer Institute Australia, Sydney, NSW, 2052, Australia
| | - Alicia Oshlack
- Sir Peter MacCallum Department of Oncology, University of Melbourne, Parkville, VIC, 3010, Australia
- Peter MacCallum Cancer Centre, Parkville, VIC, 3052, Australia
- School of Mathematics and Statistics, University of Melbourne, Parkville, VIC, 3010, Australia
| |
Collapse
|
9
|
Hoang M, Zheng H, Kingsford C. Differentiable Learning of Sequence-Specific Minimizer Schemes with DeepMinimizer. J Comput Biol 2022; 29:1288-1304. [PMID: 36095142 PMCID: PMC9807081 DOI: 10.1089/cmb.2022.0275] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
Minimizers are widely used to sample representative k-mers from biological sequences in many applications, such as read mapping and taxonomy prediction. In most scenarios, having the minimizer scheme select as few k-mer positions as possible (i.e., having a low density) is desirable to reduce computation and memory cost. Despite the growing interest in minimizers, learning an effective scheme with optimal density is still an open question, as it requires solving an apparently challenging discrete optimization problem on the permutation space of k-mer orderings. Most existing schemes are designed to work well in expectation over random sequences, which have limited applicability to many practical tools. On the other hand, several methods have been proposed to construct minimizer schemes for a specific target sequence. These methods, however, only approximate the original objective with likewise discrete surrogate tasks that are not able to significantly improve the density performance. This article introduces the first continuous relaxation of the density minimizing objective, DeepMinimizer, which employs a novel Deep Learning twin architecture to simultaneously ensure both validity and performance of the minimizer scheme. Our surrogate objective is fully differentiable and, therefore, amenable to efficient gradient-based optimization using GPU computing. Finally, we demonstrate that DeepMinimizer discovers minimizer schemes that significantly outperform state-of-the-art constructions on human genomic sequences.
Collapse
Affiliation(s)
- Minh Hoang
- Computer Science Department, and Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Hongyu Zheng
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| | - Carl Kingsford
- Computer Science Department, and Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
- Computational Biology Department, Carnegie Mellon University, Pittsburgh, Pennsylvania, USA
| |
Collapse
|
10
|
Shaw J, Yu YW. Theory of local k-mer selection with applications to long-read alignment. Bioinformatics 2022; 38:4659-4669. [PMID: 36124869 PMCID: PMC9563685 DOI: 10.1093/bioinformatics/btab790] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Revised: 11/09/2021] [Accepted: 11/16/2021] [Indexed: 01/23/2023] Open
Abstract
MOTIVATION Selecting a subset of k-mers in a string in a local manner is a common task in bioinformatics tools for speeding up computation. Arguably the most well-known and common method is the minimizer technique, which selects the 'lowest-ordered' k-mer in a sliding window. Recently, it has been shown that minimizers may be a sub-optimal method for selecting subsets of k-mers when mutations are present. There is, however, a lack of understanding behind the theory of why certain methods perform well. RESULTS We first theoretically investigate the conservation metric for k-mer selection methods. We derive an exact expression for calculating the conservation of a k-mer selection method. This turns out to be tractable enough for us to prove closed-form expressions for a variety of methods, including (open and closed) syncmers, (a, b, n)-words, and an upper bound for minimizers. As a demonstration of our results, we modified the minimap2 read aligner to use a more conserved k-mer selection method and demonstrate that there is up to an 8.2% relative increase in number of mapped reads. However, we found that the k-mers selected by more conserved methods are also more repetitive, leading to a runtime increase during alignment. We give new insight into how one might use new k-mer selection methods as a reparameterization to optimize for speed and alignment quality. AVAILABILITY AND IMPLEMENTATION Simulations and supplementary methods are available at https://github.com/bluenote-1577/local-kmer-selection-results. os-minimap2 is a modified version of minimap2 and available at https://github.com/bluenote-1577/os-minimap2. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jim Shaw
- Department of Mathematics, University of Toronto, Toronto, ON M5S 2E4, Canada
| | - Yun William Yu
- Department of Mathematics, University of Toronto, Toronto, ON M5S 2E4, Canada
- Department of Computer and Mathematical Sciences, University of Toronto at Scarborough, Scarborough, ON M1C 1A4, Canada
| |
Collapse
|
11
|
de Sena Brandine G, Smith AD. Fast and memory-efficient mapping of short bisulfite sequencing reads using a two-letter alphabet. NAR Genom Bioinform 2022; 3:lqab115. [PMID: 34988438 PMCID: PMC8693577 DOI: 10.1093/nargab/lqab115] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2021] [Revised: 10/25/2021] [Accepted: 11/29/2021] [Indexed: 01/03/2023] Open
Abstract
DNA cytosine methylation is an important epigenomic mark with a wide range of functions in many organisms. Whole genome bisulfite sequencing is the gold standard to interrogate cytosine methylation genome-wide. Algorithms used to map bisulfite-converted reads often encode the four-base DNA alphabet with three letters by reducing two bases to a common letter. This encoding substantially reduces the entropy of nucleotide frequencies in the resulting reference genome. Within the paradigm of read mapping by first filtering possible candidate alignments, reduced entropy in the sequence space can increase the required computing effort. We introduce another bisulfite mapping algorithm (abismal), based on the idea of encoding a four-letter DNA sequence as only two letters, one for purines and one for pyrimidines. We show that this encoding can lead to greater specificity compared to existing encodings used to map bisulfite sequencing reads. Through the two-letter encoding, the abismal software tool maps reads in less time and using less memory than most bisulfite sequencing read mapping software tools, while attaining similar accuracy. This allows in silico methylation analysis to be performed in a wider range of computing machines with limited hardware settings.
Collapse
Affiliation(s)
- Guilherme de Sena Brandine
- Quantitative and Computational Biology, University of Southern California. 1050 Child's way, Los Angeles, CA 90007, USA
| | - Andrew D Smith
- Quantitative and Computational Biology, University of Southern California. 1050 Child's way, Los Angeles, CA 90007, USA
| |
Collapse
|
12
|
Mining subgraph coverage patterns from graph transactions. INTERNATIONAL JOURNAL OF DATA SCIENCE AND ANALYTICS 2021; 13:105-121. [PMID: 34873579 PMCID: PMC8636072 DOI: 10.1007/s41060-021-00292-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2021] [Accepted: 10/24/2021] [Indexed: 11/25/2022]
Abstract
Pattern mining from graph transactional data (GTD) is an active area of research with applications in the domains of bioinformatics, chemical informatics and social networks. Existing works address the problem of mining frequent subgraphs from GTD. However, the knowledge concerning the coverage aspect of a set of subgraphs is also valuable for improving the performance of several applications. In this regard, we introduce the notion of subgraph coverage patterns (SCPs). Given a GTD, a subgraph coverage pattern is a set of subgraphs subject to relative frequency, coverage and overlap constraints provided by the user. We propose the Subgraph ID-based Flat Transactional (SIFT) framework for the efficient extraction of SCPs from a given GTD. Our performance evaluation using three real datasets demonstrates that our proposed SIFT framework is indeed capable of efficiently extracting SCPs from GTD. Furthermore, we demonstrate the effectiveness of SIFT through a case study in computer-aided drug design.
Collapse
|
13
|
Sahlin K. Effective sequence similarity detection with strobemers. Genome Res 2021; 31:2080-2094. [PMID: 34667119 PMCID: PMC8559714 DOI: 10.1101/gr.275648.121] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2021] [Accepted: 08/20/2021] [Indexed: 01/08/2023]
Abstract
k-mer-based methods are widely used in bioinformatics for various types of sequence comparisons. However, a single mutation will mutate k consecutive k-mers and make most k-mer-based applications for sequence comparison sensitive to variable mutation rates. Many techniques have been studied to overcome this sensitivity, for example, spaced k-mers and k-mer permutation techniques, but these techniques do not handle indels well. For indels, pairs or groups of small k-mers are commonly used, but these methods first produce k-mer matches, and only in a second step, a pairing or grouping of k-mers is performed. Such techniques produce many redundant k-mer matches owing to the size of k Here, we propose strobemers as an alternative to k-mers for sequence comparison. Intuitively, strobemers consist of two or more linked shorter k-mers, where the combination of linked k-mers is decided by a hash function. We use simulated data to show that strobemers provide more evenly distributed sequence matches and are less sensitive to different mutation rates than k-mers and spaced k-mers. Strobemers also produce higher match coverage across sequences. We further implement a proof-of-concept sequence-matching tool StrobeMap and use synthetic and biological Oxford Nanopore sequencing data to show the utility of using strobemers for sequence comparison in different contexts such as sequence clustering and alignment scenarios.
Collapse
Affiliation(s)
- Kristoffer Sahlin
- Department of Mathematics, Science for Life Laboratory, Stockholm University, 10691 Stockholm, Sweden
| |
Collapse
|
14
|
Nyström-Persson J, Keeble-Gagnère G, Zawad N. Compact and evenly distributed k-mer binning for genomic sequences. Bioinformatics 2021; 37:2563-2569. [PMID: 33693556 PMCID: PMC8428581 DOI: 10.1093/bioinformatics/btab156] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/13/2020] [Revised: 02/15/2021] [Accepted: 03/03/2021] [Indexed: 11/17/2022] Open
Abstract
MOTIVATION The processing of k-mers (subsequences of length k) is at the foundation of many sequence processing algorithms in bioinformatics, including k-mer counting for genome size estimation, genome assembly, and taxonomic classification for metagenomics. Minimizers-ordered m-mers where m < k-are often used to group k-mers into bins as a first step in such processing. However, minimizers are known to generate bins of very different sizes, which can pose challenges for distributed and parallel processing, as well as generally increase memory requirements. Furthermore, although various minimizer orderings have been proposed, their practical value for improving tool efficiency has not yet been fully explored. RESULTS We present Discount, a distributed k-mer counting tool based on Apache Spark, which we use to investigate the behaviour of various minimizer orderings in practice when applied to metagenomics data. Using this tool, we then introduce the universal frequency ordering, a new combination of frequency-sampled minimizers and universal k-mer hitting sets, which yields both evenly distributed binning and small bin sizes. We show that this ordering allows Discount to perform distributed k-mer counting on a large dataset in as little as 1/8 of the memory of comparable approaches, making it the most efficient out-of-core distributed k-mer counting method available. AVAILABILITY AND IMPLEMENTATION Discount is GPL licensed and available at https://github.com/jtnystrom/discount. The data underlying this article are available in the article and in its online supplementary material. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Johan Nyström-Persson
- JNP Solutions, Yokoami, Sumida-ku, Tokyo 130-0015, Japan
- Department of R&D, Lifematics Inc., Kanda Jinbocho, Chiyoda-ku, Tokyo 101-0051, Japan
| | - Gabriel Keeble-Gagnère
- Department of Agriculture Victoria Research, AgriBio, Centre for AgriBioscience, Bundoora, VIC 3083, Australia
| | - Niamat Zawad
- Department of R&D, Lifematics Inc., Kanda Jinbocho, Chiyoda-ku, Tokyo 101-0051, Japan
| |
Collapse
|
15
|
Abstract
Given the popularity and elegance of k-mer-based tools, finding a space-efficient way to represent a set of k-mers is important for improving the scalability of bioinformatics analyses. One popular approach is to convert the set of k-mers into the more compact set of unitigs. We generalize this approach and formulate it as the problem of finding a smallest spectrum-preserving string set (SPSS) representation. We show that this problem is equivalent to finding a smallest path cover in a compacted de Bruijn graph. Using this reduction, we prove a lower bound on the size of the optimal SPSS and propose a greedy method called UST (Unitig-STitch) that results in a smaller representation than unitigs and is nearly optimal with respect to our lower bound. We demonstrate the usefulness of the SPSS formulation with two applications of UST. The first one is a compression algorithm, UST-Compress, which, we show, can store a set of k-mers by using an order-of-magnitude less disk space than other lossless compression tools. The second one is an exact static k-mer membership index, UST-FM, which, we show, improves index size by 10%-44% compared with other state-of-the-art low-memory indices.
Collapse
Affiliation(s)
- Amatur Rahman
- Department of Computer Science and Engineering, Penn State, University Park, State College, PA, USA
| | - Paul Medevedev
- Department of Computer Science and Engineering, Penn State, University Park, State College, PA, USA
- Department of Biochemistry and Molecular Biology, Penn State, University Park, State College, PA, USA
- Center for Computational Biology and Bioinformatics, Penn State, University Park, State College, PA, USA
| |
Collapse
|
16
|
Edgar R. Syncmers are more sensitive than minimizers for selecting conserved k‑mers in biological sequences. PeerJ 2021; 9:e10805. [PMID: 33604186 PMCID: PMC7869670 DOI: 10.7717/peerj.10805] [Citation(s) in RCA: 37] [Impact Index Per Article: 12.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2020] [Accepted: 12/30/2020] [Indexed: 12/19/2022] Open
Abstract
Minimizers are widely used to select subsets of fixed-length substrings (k-mers) from biological sequences in applications ranging from read mapping to taxonomy prediction and indexing of large datasets. The minimizer of a string of w consecutive k-mers is the k-mer with smallest value according to an ordering of all k-mers. Syncmers are defined here as a family of alternative methods which select k-mers by inspecting the position of the smallest-valued substring of length s < k within the k-mer. For example, a closed syncmer is selected if its smallest s-mer is at the start or end of the k-mer. At least one closed syncmer must be found in every window of length (k - s) k-mers. Unlike a minimizer, a syncmer is identified by its sequence alone, and is therefore synchronized in the following sense: if a given k-mer is selected from one sequence, it will also be selected from any other sequence. Also, minimizers can be deleted by mutations in flanking sequence, which cannot happen with syncmers. Experiments on minimizers with parameters used in the minimap2 read mapper and Kraken taxonomy prediction algorithm respectively show that syncmers can simultaneously achieve both lower density and higher conservation compared to minimizers.
Collapse
|
17
|
Orenstein Y. Improved Analysis of High-Throughput Sequencing Data Using Small Universal k-Mer Hitting Sets. Methods Mol Biol 2021; 2243:95-105. [PMID: 33606254 DOI: 10.1007/978-1-0716-1103-6_5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
High-throughput sequencing machines can read millions of DNA molecules in parallel in a short time and at a relatively low cost. As a consequence, researchers have access to databases with millions of genomic samples. Searching and analyzing these large amounts of data require efficient algorithms.Universal hitting sets are sets of words that must be present in any long enough string. Using small universal hitting sets, it is possible to increase the efficiency of many high-throughput sequencing data analyses. But, generating minimum-size universal hitting sets is a hard problem. In this chapter, we cover our algorithmic developments to produce compact universal hitting sets and some of their potential applications.
Collapse
|
18
|
Frith MC, Noé L, Kucherov G. Minimally-overlapping words for sequence similarity search. Bioinformatics 2020; 36:5344-5350. [PMID: 33346833 PMCID: PMC8016470 DOI: 10.1093/bioinformatics/btaa1054] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/28/2020] [Revised: 11/03/2020] [Accepted: 12/08/2020] [Indexed: 11/16/2022] Open
Abstract
Motivation Analysis of genetic sequences is usually based on finding similar parts of sequences, e.g. DNA reads and/or genomes. For big data, this is typically done via ‘seeds’: simple similarities (e.g. exact matches) that can be found quickly. For huge data, sparse seeding is useful, where we only consider seeds at a subset of positions in a sequence. Results Here, we study a simple sparse-seeding method: using seeds at positions of certain ‘words’ (e.g. ac, at, gc or gt). Sensitivity is maximized by using words with minimal overlaps. That is because, in a random sequence, minimally overlapping words are anti-clumped. We provide evidence that this is often superior to acclaimed ‘minimizer’ sparse-seeding methods. Our approach can be unified with design of inexact (spaced and subset) seeds, further boosting sensitivity. Thus, we present a promising approach to sequence similarity search, with open questions on how to optimize it. Availability and implementation Software to design and test minimally overlapping words is freely available at https://gitlab.com/mcfrith/noverlap. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Martin C Frith
- Artificial Intelligence Research Center, AIST, Tokyo, Japan.,Graduate School of Frontier Sciences, University of Tokyo, Chiba, Japan.,AIST-Waseda University CBBD-OIL, AIST, Tokyo, Japan
| | - Laurent Noé
- CRIStAL UMR9189, Université de Lille, Villeneuve d'Ascq, France
| | - Gregory Kucherov
- LIGM, CNRS, Université Gustave Eiffel, Marne-la-Valleé, France.,Skolkovo Institute of Science and Technology, Moscow, Russia
| |
Collapse
|
19
|
Zaiko A, Wood SA, Pochon X, Biessy L, Laroche O, Croot P, Garcia-Vazquez E. Elucidating Biodiversity Shifts in Ballast Water Tanks during a Cross-Latitudinal Transfer: Complementary Insights from Molecular Analyses. ENVIRONMENTAL SCIENCE & TECHNOLOGY 2020; 54:8443-8454. [PMID: 32436694 DOI: 10.1021/acs.est.0c01931] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
In this study, the evolution of ballast water (BW) assemblages across different trophic levels was characterized over a 21 day cross-latitudinal vessel transit using a combination of molecular methods. Triplicate BW samples were collected every second day and size-fractionated (<2.7, 10, >50 μm). Measurements of adenosine triphosphate (ATP) and metabarcoding of environmental nucleic acid (DNA and RNA) analyses, complemented by microscopy and flow cytometry, were performed on each sample. Measured ATP concentrations exhibited high variance between replicates and a strong negative trend in the large (≥50 μm) fraction over the voyage. In concert with microscopy, the metabarcoding data indicated a die-off of larger metazoans during the first week of study and gradual reductions in dinoflagellates and ochrophytes. The ATP and metabarcoding data signaled persistent or increased cellular activity of heterotrophic bacteria and protists in the BW, which was supported by flow cytometry. The metabarcoding showed the presence of active bacteria in all size fractions, suggesting that the sequential filtration approach does not ensure taxonomical differentiation, which has implications for BW quality assessment. Although our data show that ATP and metabarcoding have potential for indicative BW screening for BW compliance monitoring, further research and technological development is needed to improve representativeness of sampling and deliver the unequivocal response criteria required by the international Ballast Water Management Convention.
Collapse
Affiliation(s)
- Anastasija Zaiko
- Coastal and Freshwater Group, Cawthron Institute, Private Bag 2, Nelson 7042, New Zealand
- Institute of Marine Science, University of Auckland, Private Bag 349, Warkworth 0941, New Zealand
- Marine Research Institute, Klaipeda University, H.Manto 84, 92294 Klaipeda, Lithuania
| | - Susanna A Wood
- Coastal and Freshwater Group, Cawthron Institute, Private Bag 2, Nelson 7042, New Zealand
| | - Xavier Pochon
- Coastal and Freshwater Group, Cawthron Institute, Private Bag 2, Nelson 7042, New Zealand
- Institute of Marine Science, University of Auckland, Private Bag 349, Warkworth 0941, New Zealand
| | - Laura Biessy
- Coastal and Freshwater Group, Cawthron Institute, Private Bag 2, Nelson 7042, New Zealand
| | - Olivier Laroche
- Benthic Resources, The Norwegian Institute of Marine Research, Nordnesgaten 50, 5005 Bergen, Norway
| | - Peter Croot
- Irish Centre for Research in Applied Geoscience (iCRAG), Earth and Ocean Sciences, School of Natural Sciences, National University of Ireland Galway, Galway, Ireland
| | - Eva Garcia-Vazquez
- Department of Functional Biology, University of Oviedo, C/Julian Claveria s/n, 33006 Oviedo, Spain
| |
Collapse
|
20
|
Elworth RAL, Wang Q, Kota PK, Barberan CJ, Coleman B, Balaji A, Gupta G, Baraniuk RG, Shrivastava A, Treangen T. To Petabytes and beyond: recent advances in probabilistic and signal processing algorithms and their application to metagenomics. Nucleic Acids Res 2020; 48:5217-5234. [PMID: 32338745 PMCID: PMC7261164 DOI: 10.1093/nar/gkaa265] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2019] [Revised: 03/20/2020] [Accepted: 04/04/2020] [Indexed: 02/01/2023] Open
Abstract
As computational biologists continue to be inundated by ever increasing amounts of metagenomic data, the need for data analysis approaches that keep up with the pace of sequence archives has remained a challenge. In recent years, the accelerated pace of genomic data availability has been accompanied by the application of a wide array of highly efficient approaches from other fields to the field of metagenomics. For instance, sketching algorithms such as MinHash have seen a rapid and widespread adoption. These techniques handle increasingly large datasets with minimal sacrifices in quality for tasks such as sequence similarity calculations. Here, we briefly review the fundamentals of the most impactful probabilistic and signal processing algorithms. We also highlight more recent advances to augment previous reviews in these areas that have taken a broader approach. We then explore the application of these techniques to metagenomics, discuss their pros and cons, and speculate on their future directions.
Collapse
Affiliation(s)
| | - Qi Wang
- Systems, Synthetic, and Physical Biology (SSPB) Graduate Program, Houston, TX 77005, USA
| | - Pavan K Kota
- Department of Bioengineering, Houston, TX 77005, USA
| | - C J Barberan
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA
| | - Benjamin Coleman
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA
| | - Advait Balaji
- Department of Computer Science, Houston, TX 77005, USA
| | - Gaurav Gupta
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA
| | - Richard G Baraniuk
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA
| | - Anshumali Shrivastava
- Department of Computer Science, Houston, TX 77005, USA
- Department of Electrical and Computer Engineering, Rice University, Houston, TX 77005, USA
| | - Todd J Treangen
- Department of Computer Science, Houston, TX 77005, USA
- Systems, Synthetic, and Physical Biology (SSPB) Graduate Program, Houston, TX 77005, USA
| |
Collapse
|
21
|
Ekim B, Berger B, Orenstein Y. A Randomized Parallel Algorithm for Efficiently Finding Near-Optimal Universal Hitting Sets. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY : ... ANNUAL INTERNATIONAL CONFERENCE, RECOMB ... : PROCEEDINGS. RECOMB (CONFERENCE : 2005- ) 2020; 12074:37-53. [PMID: 38835399 PMCID: PMC11148856 DOI: 10.1007/978-3-030-45257-5_3] [Citation(s) in RCA: 8] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
As the volume of next generation sequencing data increases, an urgent need for algorithms to efficiently process the data arises. Universal hitting sets (UHS) were recently introduced as an alternative to the central idea of minimizers in sequence analysis with the hopes that they could more efficiently address common tasks such as computing hash functions for read overlap, sparse suffix arrays, and Bloom filters. A UHS is a set of k -mers that hit every sequence of length L , and can thus serve as indices to L -long sequences. Unfortunately, methods for computing small UHSs are not yet practical for real-world sequencing instances due to their serial and deterministic nature, which leads to long runtimes and high memory demands when handling typical values of k (e.g. k > 13 ). To address this bottleneck, we present two algorithmic innovations to significantly decrease runtime while keeping memory usage low: (i) we leverage advanced theoretical and architectural techniques to parallelize and decrease memory usage in calculating k -mer hitting numbers; and (ii) we build upon techniques from randomized Set Cover to select universal k -mers much faster. We implemented these innovations in PASHA, the first randomized parallel algorithm for generating nearoptimal UHSs, which newly handles k > 13 . We demonstrate empirically that PASHA produces sets only slightly larger than those of serial deterministic algorithms; moreover, the set size is provably guaranteed to be within a small constant factor of the optimal size. PASHA's runtime and memory-usage improvements are orders of magnitude faster than the current best algorithms. We expect our newly-practical construction of UHSs to be adopted in many high-throughput sequence analysis pipelines.
Collapse
Affiliation(s)
- Barış Ekim
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, MA 02139, USA
| | - Yaron Orenstein
- School of Electrical and Computer Engineering, Ben-Gurion University of the Negev, 8410501 Beer-Sheva, Israel
| |
Collapse
|
22
|
Abstract
Motivation The minimizers technique is a method to sample k-mers that is used in many bioinformatics software to reduce computation, memory usage and run time. The number of applications using minimizers keeps on growing steadily. Despite its many uses, the theoretical understanding of minimizers is still very limited. In many applications, selecting as few k-mers as possible (i.e. having a low density) is beneficial. The density is highly dependent on the choice of the order on the k-mers. Different applications use different orders, but none of these orders are optimal. A better understanding of minimizers schemes, and the related local and forward schemes, will allow designing schemes with lower density and thereby making existing and future bioinformatics tools even more efficient. Results From the analysis of the asymptotic behavior of minimizers, forward and local schemes, we show that the previously believed lower bound on minimizers schemes does not hold, and that schemes with density lower than thought possible actually exist. The proof is constructive and leads to an efficient algorithm to compare k-mers. These orders are the first known orders that are asymptotically optimal. Additionally, we give improved bounds on the density achievable by the three type of schemes.
Collapse
Affiliation(s)
- Guillaume Marçais
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Dan DeBlasio
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| | - Carl Kingsford
- Department of Computational Biology, Carnegie Mellon University, Pittsburgh, PA, USA
| |
Collapse
|
23
|
He W, Ju Y, Zeng X, Liu X, Zou Q. Sc-ncDNAPred: A Sequence-Based Predictor for Identifying Non-coding DNA in Saccharomyces cerevisiae. Front Microbiol 2018; 9:2174. [PMID: 30258427 PMCID: PMC6144933 DOI: 10.3389/fmicb.2018.02174] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Accepted: 08/24/2018] [Indexed: 12/22/2022] Open
Abstract
With the rapid development of high-speed sequencing technologies and the implementation of many whole genome sequencing project, research in the genomics is advancing from genome sequencing to genome synthesis. Synthetic biology technologies such as DNA-based molecular assemblies, genome editing technology, directional evolution technology and DNA storage technology, and other cutting-edge technologies emerge in succession. Especially the rapid growth and development of DNA assembly technology may greatly push forward the success of artificial life. Meanwhile, DNA assembly technology needs a large number of target sequences of known information as data support. Non-coding DNA (ncDNA) sequences occupy most of the organism genomes, thus accurate recognizing of them is necessary. Although experimental methods have been proposed to detect ncDNA sequences, they are expensive for performing genome wide detections. Thus, it is necessary to develop machine-learning methods for predicting non-coding DNA sequences. In this study, we collected the ncDNA benchmark dataset of Saccharomyces cerevisiae and reported a support vector machine-based predictor, called Sc-ncDNAPred, for predicting ncDNA sequences. The optimal feature extraction strategy was selected from a group included mononucleotide, dimer, trimer, tetramer, pentamer, and hexamer, using support vector machine learning method. Sc-ncDNAPred achieved an overall accuracy of 0.98. For the convenience of users, an online web-server has been built at: http://server.malab.cn/Sc_ncDNAPred/index.jsp.
Collapse
Affiliation(s)
- Wenying He
- School of Computer Science and Technology, Tianjin University, Tianjin, China
| | - Ying Ju
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Xiangxiang Zeng
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Xiangrong Liu
- School of Information Science and Technology, Xiamen University, Xiamen, China
| | - Quan Zou
- School of Computer Science and Technology, Tianjin University, Tianjin, China.,Shandong Provincial Key Laboratory of Biophysics, Institute of Biophysics, Dezhou University, Dezhou, China
| |
Collapse
|