1
|
Gustafsson J, Norberg P, Qvick-Wester JR, Schliep A. Fast parallel construction of variable-length Markov chains. BMC Bioinformatics 2021; 22:487. [PMID: 34627154 PMCID: PMC8501649 DOI: 10.1186/s12859-021-04387-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2021] [Accepted: 09/20/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Alignment-free methods are a popular approach for comparing biological sequences, including complete genomes. The methods range from probability distributions of sequence composition to first and higher-order Markov chains, where a k-th order Markov chain over DNA has [Formula: see text] formal parameters. To circumvent this exponential growth in parameters, variable-length Markov chains (VLMCs) have gained popularity for applications in molecular biology and other areas. VLMCs adapt the depth depending on sequence context and thus curtail excesses in the number of parameters. The scarcity of available fast, or even parallel software tools, prompted the development of a parallel implementation using lazy suffix trees and a hash-based alternative. RESULTS An extensive evaluation was performed on genomes ranging from 12Mbp to 22Gbp. Relevant learning parameters were chosen guided by the Bayesian Information Criterion (BIC) to avoid over-fitting. Our implementation greatly improves upon the state-of-the-art even in serial execution. It exhibits very good parallel scaling with speed-ups for long sequences close to the optimum indicated by Amdahl's law of 3 for 4 threads and about 6 for 16 threads, respectively. CONCLUSIONS Our parallel implementation released as open-source under the GPLv3 license provides a practically useful alternative to the state-of-the-art which allows the construction of VLMCs even for very large genomes significantly faster than previously possible. Additionally, our parameter selection based on BIC gives guidance to end-users comparing genomes.
Collapse
Affiliation(s)
- Joel Gustafsson
- Institute of Biomedicine, Department of Infectious Diseases, University of Gothenburg, Gothenburg, Sweden.
| | - Peter Norberg
- Institute of Biomedicine, Department of Infectious Diseases, University of Gothenburg, Gothenburg, Sweden
| | - Jan R Qvick-Wester
- Department of Computer Science and Engineering, University of Gothenburg - Chalmers University of Technology, Gothenburg, Sweden
| | - Alexander Schliep
- Department of Computer Science and Engineering, University of Gothenburg - Chalmers University of Technology, Gothenburg, Sweden
| |
Collapse
|
2
|
Karimi R, Hajdu A. HTSFinder: Powerful Pipeline of DNA Signature Discovery by Parallel and Distributed Computing. Evol Bioinform Online 2016; 12:73-85. [PMID: 26884678 PMCID: PMC4750899 DOI: 10.4137/ebo.s35545] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2015] [Revised: 11/05/2015] [Accepted: 12/05/2015] [Indexed: 11/06/2022] Open
Abstract
Comprehensive effort for low-cost sequencing in the past few years has led to the growth of complete genome databases. In parallel with this effort, a strong need, fast and cost-effective methods and applications have been developed to accelerate sequence analysis. Identification is the very first step of this task. Due to the difficulties, high costs, and computational challenges of alignment-based approaches, an alternative universal identification method is highly required. Like an alignment-free approach, DNA signatures have provided new opportunities for the rapid identification of species. In this paper, we present an effective pipeline HTSFinder (high-throughput signature finder) with a corresponding k-mer generator GkmerG (genome k-mers generator). Using this pipeline, we determine the frequency of k-mers from the available complete genome databases for the detection of extensive DNA signatures in a reasonably short time. Our application can detect both unique and common signatures in the arbitrarily selected target and nontarget databases. Hadoop and MapReduce as parallel and distributed computing tools with commodity hardware are used in this pipeline. This approach brings the power of high-performance computing into the ordinary desktop personal computers for discovering DNA signatures in large databases such as bacterial genome. A considerable number of detected unique and common DNA signatures of the target database bring the opportunities to improve the identification process not only for polymerase chain reaction and microarray assays but also for more complex scenarios such as metagenomics and next-generation sequencing analysis.
Collapse
Affiliation(s)
- Ramin Karimi
- Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary
| | - Andras Hajdu
- Faculty of Informatics, Department of Computer Graphics and Image Processing, University of Debrecen, Debrecen, Hungary.; Bioinformatics Research Group, University of Debrecen, Debrecen, Hungary
| |
Collapse
|
3
|
Solntsev LA, Starikova VD, Sakharnov NA, Knyazev DI, Utkin OV. Strategy of probe selection for studying mRNAs that participate in receptor-mediated apoptosis signaling. Mol Biol 2015. [DOI: 10.1134/s0026893315030164] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
4
|
Lee HP, Sheu TF. An algorithm of discovering signatures from DNA databases on a computer cluster. BMC Bioinformatics 2014; 15:339. [PMID: 25282047 PMCID: PMC4286918 DOI: 10.1186/1471-2105-15-339] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2014] [Accepted: 09/29/2014] [Indexed: 11/18/2022] Open
Abstract
Background Signatures are short sequences that are unique and not similar to any other sequence in a database that can be used as the basis to identify different species. Even though several signature discovery algorithms have been proposed in the past, these algorithms require the entirety of databases to be loaded in the memory, thus restricting the amount of data that they can process. It makes those algorithms unable to process databases with large amounts of data. Also, those algorithms use sequential models and have slower discovery speeds, meaning that the efficiency can be improved. Results In this research, we are debuting the utilization of a divide-and-conquer strategy in signature discovery and have proposed a parallel signature discovery algorithm on a computer cluster. The algorithm applies the divide-and-conquer strategy to solve the problem posed to the existing algorithms where they are unable to process large databases and uses a parallel computing mechanism to effectively improve the efficiency of signature discovery. Even when run with just the memory of regular personal computers, the algorithm can still process large databases such as the human whole-genome EST database which were previously unable to be processed by the existing algorithms. Conclusions The algorithm proposed in this research is not limited by the amount of usable memory and can rapidly find signatures in large databases, making it useful in applications such as Next Generation Sequencing and other large database analysis and processing. The implementation of the proposed algorithm is available athttp://www.cs.pu.edu.tw/~fang/DDCSDPrograms/DDCSD.htm.
Collapse
Affiliation(s)
| | - Tzu-Fang Sheu
- Department of Computer Science and Communication Engineering, Providence University, 200, Sec, 7, Taiwan Boulevard, 43301 Shalu Dist,, Taichung, Taiwan.
| |
Collapse
|
5
|
Zahariev M, Dahl V, Chen W, Lévesque CA. Efficient algorithms for the discovery of DNA oligonucleotide barcodes from sequence databases. Mol Ecol Resour 2013; 9 Suppl s1:58-64. [PMID: 21564965 DOI: 10.1111/j.1755-0998.2009.02651.x] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Efficient design of barcode oligonucleotides can lead to significant cost reductions in the manufacturing of DNA arrays. Previous methods are based on either a preliminary alignment, which reduces their efficiency for intron-rich regions, or on a brute force approach, not feasible for large-scale problems or on data structures with very poor performance in the worst case. One of the algorithms we propose uses 'oligonucleotide sorting' for the discovery of oligonucleotide barcodes of given sizes, with good asymptotic performance. Specific barcode oligonucleotides with at least one base difference from other sequences in a database are found for each individual sequence. With another algorithm, specific oligonucleotides can also be found for groups or clades in the database, which have 100% homology for all oligonucleotide sequences within the group or clade while having differences with the rest of the data. By re-organizing the sequences/groups in the database, oligonucleotides for different hierarchical levels can be found. The oligonucleotides or polymorphism locations identified as species or clade specific by the new algorithm are refined and screened further for hybridization thermodynamic properties with third party software.
Collapse
Affiliation(s)
- M Zahariev
- School of Computing Science, Simon Fraser University, 8888 University Drive, Burnaby, BC, Canada V5A 1S6, Agriculture & Agri-Food Canada, Ottawa, ON, Canada K1A 0C6, Department of Biology, Carleton University, Ottawa, Ontario, Canada, K1S 5B6
| | | | | | | |
Collapse
|
6
|
Tulpan D, Ghiggi A, Montemanni R. Computational Sequence Design Techniques for DNA Microarray Technologies. Bioinformatics 2013. [DOI: 10.4018/978-1-4666-3604-0.ch048] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
In systems biology and biomedical research, microarray technology is a method of choice that enables the complete quantitative and qualitative ascertainment of gene expression patterns for whole genomes. The selection of high quality oligonucleotide sequences that behave consistently across multiple experiments is a key step in the design, fabrication and experimental performance of DNA microarrays. The aim of this chapter is to outline recent algorithmic developments in microarray probe design, evaluate existing probe sequences used in commercial arrays, and suggest methodologies that have the potential to improve on existing design techniques.
Collapse
Affiliation(s)
- Dan Tulpan
- National Research Council of Canada, Canada
| | | | - Roberto Montemanni
- Istituto Dalle Molle di Studi sull’Intelligenza Artificiale, Switzerland
| |
Collapse
|
7
|
Whole-genome thermodynamic analysis reduces siRNA off-target effects. PLoS One 2013; 8:e58326. [PMID: 23484018 PMCID: PMC3590146 DOI: 10.1371/journal.pone.0058326] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2012] [Accepted: 02/01/2013] [Indexed: 11/19/2022] Open
Abstract
Small interfering RNAs (siRNAs) are important tools for knocking down targeted genes, and have been widely applied to biological and biomedical research. To design siRNAs, two important aspects must be considered: the potency in knocking down target genes and the off-target effect on any nontarget genes. Although many studies have produced useful tools to design potent siRNAs, off-target prevention has mostly been delegated to sequence-level alignment tools such as BLAST. We hypothesize that whole-genome thermodynamic analysis can identify potential off-targets with higher precision and help us avoid siRNAs that may have strong off-target effects. To validate this hypothesis, two siRNA sets were designed to target three human genes IDH1, ITPR2 and TRIM28. They were selected from the output of two popular siRNA design tools, siDirect and siDesign. Both siRNA design tools have incorporated sequence-level screening to avoid off-targets, thus their output is believed to be optimal. However, one of the sets we tested has off-target genes predicted by Picky, a whole-genome thermodynamic analysis tool. Picky can identify off-target genes that may hybridize to a siRNA within a user-specified melting temperature range. Our experiments validated that some off-target genes predicted by Picky can indeed be inhibited by siRNAs. Similar experiments were performed using commercially available siRNAs and a few off-target genes were also found to be inhibited as predicted by Picky. In summary, we demonstrate that whole-genome thermodynamic analysis can identify off-target genes that are missed in sequence-level screening. Because Picky prediction is deterministic according to thermodynamics, if a siRNA candidate has no Picky predicted off-targets, it is unlikely to cause off-target effects. Therefore, we recommend including Picky as an additional screening step in siRNA design.
Collapse
|
8
|
Ilie L, Mohamadi H, Golding GB, Smyth WF. BOND: Basic OligoNucleotide Design. BMC Bioinformatics 2013; 14:69. [PMID: 23444904 PMCID: PMC3648450 DOI: 10.1186/1471-2105-14-69] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2012] [Accepted: 02/21/2013] [Indexed: 11/18/2022] Open
Abstract
Background DNA microarrays have become ubiquitous in biological and medical research. The most difficult problem that needs to be solved is the design of DNA oligonucleotides that (i) are highly specific, that is, bind only to the intended target, (ii) cover the highest possible number of genes, that is, all genes that allow such unique regions, and (iii) are computed fast. None of the existing programs meet all these criteria. Results We introduce a new approach with our software program BOND (Basic OligoNucleotide Design). According to Kane’s criteria for oligo design, BOND computes highly specific DNA oligonucleotides, for all the genes that admit unique probes, while running orders of magnitude faster than the existing programs. The same approach enables us to introduce also an evaluation procedure that correctly measures the quality of the oligonucleotides. Extensive comparison is performed to prove our claims. BOND is flexible, easy to use, requires no additional software, and is freely available for non-commercial use from http://www.csd.uwo.ca/∼ilie/BOND/. Conclusions We provide an improved solution to the important problem of oligonucleotide design, including a thorough evaluation of oligo design programs. We hope BOND will become a useful tool for researchers in biological and medical sciences by making the microarray procedures faster and more accurate.
Collapse
Affiliation(s)
- Lucian Ilie
- Department of Computer Science, University of Western Ontario, London, ON, Canada.
| | | | | | | |
Collapse
|
9
|
Yadav BS, Ronda V, Vashista DP, Sharma B. Sequencing and computational approaches to identification and characterization of microbial organisms. Biomed Eng Comput Biol 2013; 5:43-9. [PMID: 25288901 PMCID: PMC4147756 DOI: 10.4137/becb.s10886] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open
Abstract
The recent advances in sequencing technologies and computational approaches are propelling scientists ever closer towards complete understanding of human-microbial interactions. The powerful sequencing platforms are rapidly producing huge amounts of nucleotide sequence data which are compiled into huge databases. This sequence data can be retrieved, assembled, and analyzed for identification of microbial pathogens and diagnosis of diseases. In this article, we present a commentary on how the metagenomics incorporated with microarray and new sequencing techniques are helping microbial detection and characterization.
Collapse
Affiliation(s)
- Brijesh Singh Yadav
- Department of Botany and Microbiology, H.N.B. Garhwal University, Srinagar, India. ; Division of Biochemistry, Indian Veterinary Research Institute, Izatnagar, India
| | - Venkateswarlu Ronda
- Central Institute of Fisheries Technology, Cochin, India. ; Division of Biochemistry, Indian Veterinary Research Institute, Izatnagar, India
| | - Dinesh P Vashista
- Department of Botany and Microbiology, H.N.B. Garhwal University, Srinagar, India
| | - Bhaskar Sharma
- Division of Biochemistry, Indian Veterinary Research Institute, Izatnagar, India
| |
Collapse
|
10
|
Gans JD, Dunbar J, Eichorst SA, Gallegos-Graves LV, Wolinsky M, Kuske CR. A robust PCR primer design platform applied to the detection of Acidobacteria Group 1 in soil. Nucleic Acids Res 2012; 40:e96. [PMID: 22434885 PMCID: PMC3384349 DOI: 10.1093/nar/gks238] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2011] [Revised: 01/18/2012] [Accepted: 02/29/2012] [Indexed: 01/17/2023] Open
Abstract
Environmental biosurveillance and microbial ecology studies use PCR-based assays to detect and quantify microbial taxa and gene sequences within a complex background of microorganisms. However, the fragmentary nature and growing quantity of DNA-sequence data make group-specific assay design challenging. We solved this problem by developing a software platform that enables PCR-assay design at an unprecedented scale. As a demonstration, we developed quantitative PCR assays for a globally widespread, ecologically important bacterial group in soil, Acidobacteria Group 1. A total of 33,684 Acidobacteria 16S rRNA gene sequences were used for assay design. Following 1 week of computation on a 376-core cluster, 83 assays were obtained. We validated the specificity of the top three assays, collectively predicted to detect 42% of the Acidobacteria Group 1 sequences, by PCR amplification and sequencing of DNA from soil. Based on previous analyses of 16S rRNA gene sequencing, Acidobacteria Group 1 species were expected to decrease in response to elevated atmospheric CO(2). Quantitative PCR results, using the Acidobacteria Group 1-specific PCR assays, confirmed the expected decrease and provided higher statistical confidence than the 16S rRNA gene-sequencing data. These results demonstrate a powerful capacity to address previously intractable assay design challenges.
Collapse
Affiliation(s)
- Jason D Gans
- Bioscience Division, Los Alamos National Laboratory, Los Alamos, NM, USA.
| | | | | | | | | | | |
Collapse
|
11
|
Tulpan D, Ghiggi A, Montemanni R. Computational Sequence Design Techniques for DNA Microarray Technologies. SYSTEMIC APPROACHES IN BIOINFORMATICS AND COMPUTATIONAL SYSTEMS BIOLOGY 2011. [DOI: 10.4018/978-1-61350-435-2.ch003] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/29/2023]
Abstract
In systems biology and biomedical research, microarray technology is a method of choice that enables the complete quantitative and qualitative ascertainment of gene expression patterns for whole genomes. The selection of high quality oligonucleotide sequences that behave consistently across multiple experiments is a key step in the design, fabrication and experimental performance of DNA microarrays. The aim of this chapter is to outline recent algorithmic developments in microarray probe design, evaluate existing probe sequences used in commercial arrays, and suggest methodologies that have the potential to improve on existing design techniques.
Collapse
Affiliation(s)
- Dan Tulpan
- National Research Council of Canada, Canada
| | | | - Roberto Montemanni
- Istituto Dalle Molle di Studi sull’Intelligenza Artificiale (IDSIA), Switzerland
| |
Collapse
|
12
|
Hafemeister C, Krause R, Schliep A. Selecting oligonucleotide probes for whole-genome tiling arrays with a cross-hybridization potential. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:1642-1652. [PMID: 21358006 DOI: 10.1109/tcbb.2011.39] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
For designing oligonucleotide tiling arrays popular, current methods still rely on simple criteria like Hamming distance or longest common factors, neglecting base stacking effects which strongly contribute to binding energies. Consequently, probes are often prone to cross-hybridization which reduces the signal-to-noise ratio and complicates downstream analysis. We propose the first computationally efficient method using hybridization energy to identify specific oligonucleotide probes. Our Cross-Hybridization Potential (CHP) is computed with a Nearest Neighbor Alignment, which efficiently estimates a lower bound for the Gibbs free energy of the duplex formed by two DNA sequences of bounded length. It is derived from our simplified reformulation of t-gap insertion-deletion-like metrics. The computations are accelerated by a filter using weighted ungapped q-grams to arrive at seeds. The computation of the CHP is implemented in our software OSProbes, available under the GPL, which computes sets of viable probe candidates. The user can choose a trade-off between running time and quality of probes selected. We obtain very favorable results in comparison with prior approaches with respect to specificity and sensitivity for cross-hybridization and genome coverage with high-specificity probes. The combination of OSProbes and our Tileomatic method, which computes optimal tiling paths from candidate sets, yields globally optimal tiling arrays, balancing probe distance, hybridization conditions, and uniqueness of hybridization.
Collapse
Affiliation(s)
- Christoph Hafemeister
- Department of Biology, New York University, 100 Washington Square East, Rm 1009, New York, NY 10003-6688, USA.
| | | | | |
Collapse
|
13
|
A rational approach in probe design for nucleic acid-based biosensing. Biosens Bioelectron 2011; 26:4785-90. [DOI: 10.1016/j.bios.2011.06.004] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2011] [Revised: 05/16/2011] [Accepted: 06/07/2011] [Indexed: 11/18/2022]
|
14
|
Ilie L, Ilie S, Khoshraftar S, Bigvand AM. Seeds for effective oligonucleotide design. BMC Genomics 2011; 12:280. [PMID: 21627845 PMCID: PMC3128067 DOI: 10.1186/1471-2164-12-280] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2011] [Accepted: 06/01/2011] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND DNA oligonucleotides are a very useful tool in biology. The best algorithms for designing good DNA oligonucleotides are filtering out unsuitable regions using a seeding approach. Determining the quality of the seeds is crucial for the performance of these algorithms. RESULTS We present a sound framework for evaluating the quality of seeds for oligonucleotide design. The F - score is used to measure the accuracy of each seed. A number of natural candidates are tested: contiguous (BLAST-like), spaced, transitions-constrained, and multiple spaced seeds. Multiple spaced seeds are the best, with more seeds providing better accuracy. Single spaced and transition seeds are very close whereas, as expected, contiguous seeds come last. Increased accuracy comes at the price of reduced efficiency. An exception is that single spaced and transitions-constrained seeds are both more accurate and more efficient than contiguous ones. CONCLUSIONS Our work confirms another application where multiple spaced seeds perform the best. It will be useful in improving the algorithms for oligonucleotide design.
Collapse
Affiliation(s)
- Lucian Ilie
- Department of Computer Science, University of Western Ontario, N6A 5B7, London, ON, Canada
| | - Silvana Ilie
- Department of Mathematics, Ryerson University, M5B 2K3, Toronto, ON, Canada
| | - Shima Khoshraftar
- Department of Computer Science, University of Western Ontario, N6A 5B7, London, ON, Canada
| | | |
Collapse
|
15
|
Bader KC, Grothoff C, Meier H. Comprehensive and relaxed search for oligonucleotide signatures in hierarchically clustered sequence datasets. Bioinformatics 2011; 27:1546-54. [PMID: 21471017 DOI: 10.1093/bioinformatics/btr161] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION PCR, hybridization, DNA sequencing and other important methods in molecular diagnostics rely on both sequence-specific and sequence group-specific oligonucleotide primers and probes. Their design depends on the identification of oligonucleotide signatures in whole genome or marker gene sequences. Although genome and gene databases are generally available and regularly updated, collections of valuable signatures are rare. Even for single requests, the search for signatures becomes computationally expensive when working with large collections of target (and non-target) sequences. Moreover, with growing dataset sizes, the chance of finding exact group-matching signatures decreases, necessitating the application of relaxed search methods. The resultant substantial increase in complexity is exacerbated by the dearth of algorithms able to solve these problems efficiently. RESULTS We have developed CaSSiS, a fast and scalable method for computing comprehensive collections of sequence- and sequence group-specific oligonucleotide signatures from large sets of hierarchically clustered nucleic acid sequence data. Based on the ARB Positional Tree (PT-)Server and a newly developed BGRT data structure, CaSSiS not only determines sequence-specific signatures and perfect group-covering signatures for every node within the cluster (i.e. target groups), but also signatures with maximal group coverage (sensitivity) within a user-defined range of non-target hits (specificity) for groups lacking a perfect common signature. An upper limit of tolerated mismatches within the target group, as well as the minimum number of mismatches with non-target sequences, can be predefined. Test runs with one of the largest phylogenetic gene sequence datasets available indicate good runtime and memory performance, and in silico spot tests have shown the usefulness of the resulting signature sequences as blueprints for group-specific oligonucleotide probes. AVAILABILITY Software and Supplementary Material are available at http://cassis.in.tum.de/.
Collapse
Affiliation(s)
- Kai Christian Bader
- Services Department of Informatics, Technische Universität München, Boltzmannstrasse 3, 85748 Garching, Germany
| | | | | |
Collapse
|
16
|
Vijaya Satya R, Kumar K, Zavaljevski N, Reifman J. A high-throughput pipeline for the design of real-time PCR signatures. BMC Bioinformatics 2010; 11:340. [PMID: 20573238 PMCID: PMC2905370 DOI: 10.1186/1471-2105-11-340] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/28/2010] [Accepted: 06/23/2010] [Indexed: 11/26/2022] Open
Abstract
Background Pathogen diagnostic assays based on polymerase chain reaction (PCR) technology provide high sensitivity and specificity. However, the design of these diagnostic assays is computationally intensive, requiring high-throughput methods to identify unique PCR signatures in the presence of an ever increasing availability of sequenced genomes. Results We present the Tool for PCR Signature Identification (TOPSI), a high-performance computing pipeline for the design of PCR-based pathogen diagnostic assays. The TOPSI pipeline efficiently designs PCR signatures common to multiple bacterial genomes by obtaining the shared regions through pairwise alignments between the input genomes. TOPSI successfully designed PCR signatures common to 18 Staphylococcus aureus genomes in less than 14 hours using 98 cores on a high-performance computing system. Conclusions TOPSI is a computationally efficient, fully integrated tool for high-throughput design of PCR signatures common to multiple bacterial genomes. TOPSI is freely available for download at http://www.bhsai.org/downloads/topsi.tar.gz.
Collapse
Affiliation(s)
- Ravi Vijaya Satya
- Biotechnology HPC Software Applications Institute, Telemedicine and Advanced Technology Research Center, US Army Medical Research and Materiel Command, Fort Detrick, MD 21702, USA
| | | | | | | |
Collapse
|
17
|
A parallel and incremental algorithm for efficient unique signature discovery on DNA databases. BMC Bioinformatics 2010; 11:132. [PMID: 20230647 PMCID: PMC2848650 DOI: 10.1186/1471-2105-11-132] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2009] [Accepted: 03/16/2010] [Indexed: 11/15/2022] Open
Abstract
Background DNA signatures are distinct short nucleotide sequences that provide valuable information that is used for various purposes, such as the design of Polymerase Chain Reaction primers and microarray experiments. Biologists usually use a discovery algorithm to find unique signatures from DNA databases, and then apply the signatures to microarray experiments. Such discovery algorithms require to set some input factors, such as signature length l and mismatch tolerance d, which affect the discovery results. However, suggestions about how to select proper factor values are rare, especially when an unfamiliar DNA database is used. In most cases, biologists typically select factor values based on experience, or even by guessing. If the discovered result is unsatisfactory, biologists change the input factors of the algorithm to obtain a new result. This process is repeated until a proper result is obtained. Implicit signatures under the discovery condition (l, d) are defined as the signatures of length ≤ l with mismatch tolerance ≥ d. A discovery algorithm that could discover all implicit signatures, such that those that meet the requirements concerning the results, would be more helpful than one that depends on trial and error. However, existing discovery algorithms do not address the need to discover all implicit signatures. Results This work proposes two discovery algorithms - the consecutive multiple discovery (CMD) algorithm and the parallel and incremental signature discovery (PISD) algorithm. The PISD algorithm is designed for efficiently discovering signatures under a certain discovery condition. The algorithm finds new results by using previously discovered results as candidates, rather than by using the whole database. The PISD algorithm further increases discovery efficiency by applying parallel computing. The CMD algorithm is designed to discover implicit signatures efficiently. It uses the PISD algorithm as a kernel routine to discover implicit signatures efficiently under every feasible discovery condition. Conclusions The proposed algorithms discover implicit signatures efficiently. The presented CMD algorithm has up to 97% less execution time than typical sequential discovery algorithms in the discovery of implicit signatures in experiments, when eight processing cores are used.
Collapse
|
18
|
Yu W, Lee JS, Johnson C, Kim JW, Deaton R. Independent sets of DNA oligonucleotides for nanotechnology applications. IEEE Trans Nanobioscience 2009; 9:38-43. [PMID: 19906601 DOI: 10.1109/tnb.2009.2035446] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Independent sets of DNA oligonucleotides, which only bind with their Watson-Crick complements, have potential use in self-assembly of nanostructures, since they minimize errors and inefficiency from unwanted binding. A software tool implemented a thermodynamic model for DNA duplex formation and was used to generate large independent sets of DNA oligonucleotides. The principle of the approach was experimentally verified on a sample set of oligonucleotides.
Collapse
Affiliation(s)
- Weixia Yu
- Department of Computer Science and Computer Engineering, University of Arkansas, Fayetteville, AR 72701, USA.
| | | | | | | | | |
Collapse
|
19
|
Abstract
Peptide microarrays (peptide arrays) have increasingly become an important research tool for studying protein detection, profiling, and protein-protein interactions, and they have the potential to foster high-throughput protein analysis as DNA arrays did for genomics research a decade ago. Recently, technologies have emerged that allow flexible synthesis of high-density peptide arrays based on specific application needs (e.g., phosphopeptide microarrays). To fully unleash the power of this promising research tool, significant efforts are required to develop computational and informatics resources that facilitate the experimental design and data analysis for a wide range of peptide array-based applications. The design of peptide arrays is inherently more complex than that of DNA arrays. We herein introduce microPepArray Pro, a Web-based general-purpose peptide array design program. microPepArray Pro features strong content design capabilities and maximized user control. The program suits the needs of a diversity of design tasks, works with a variety of peptide array configurations, and is highly expandable: new functionalities can be developed and added to microPepArray Pro with relative ease.
Collapse
|
20
|
Frech C, Breuer K, Ronacher B, Kern T, Sohn C, Gebauer G. hybseek: pathogen primer design tool for diagnostic multi-analyte assays. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2009; 94:152-160. [PMID: 19201047 DOI: 10.1016/j.cmpb.2008.12.007] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/18/2008] [Revised: 11/12/2008] [Accepted: 12/17/2008] [Indexed: 05/27/2023]
Abstract
Due to recent advances in genome sequencing, the detection of pathogens by DNA signatures, i.e. by oligonucleotide sequences that uniquely identify a specific genome, is becoming increasingly popular in modern clinical diagnostics. However, currently available screening methods, such as PCR and microarrays, lack multiplexing and sensitivity, respectively. Solid-phase amplification (SPA) is an emerging approach with the potential to overcome these limitations. SPA-based diagnostic assays require both pathogen-specific and compatible primer pairs for many, often closely related pathogens. Currently, none of the available tools supports an automated design of such primer sets, making it an iterative, labor-intensive, and often difficult procedure. Here we describe hybseek, a Web interface for efficient design of both pathogen-specific and compatible primer pairs for DNA-based diagnostic multi-analyte assays. hybseek achieves pathogen-specificity by selecting only candidates with unique 3(') subsequence, and the degree of this uniqueness is quantitatively expressed by a specificity score. qPCR experimental data confirm the feasibility of our design strategy. The service is freely available at https://www.hybseek.com.
Collapse
Affiliation(s)
- Christian Frech
- University of Applied Sciences, Softwarepark 11, 4232 Hagenberg, Austria
| | | | | | | | | | | |
Collapse
|
21
|
Lemoine S, Combes F, Le Crom S. An evaluation of custom microarray applications: the oligonucleotide design challenge. Nucleic Acids Res 2009; 37:1726-39. [PMID: 19208645 PMCID: PMC2665234 DOI: 10.1093/nar/gkp053] [Citation(s) in RCA: 37] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
The increase in feature resolution and the availability of multipack formats from microarray providers has opened the way to various custom genomic applications. However, oligonucleotide design and selection remains a bottleneck of the microarray workflow. Several tools are available to perform this work, and choosing the best one is not an easy task, nor are the choices obvious. Here we review the oligonucleotide design field to help users make their choice. We have first performed a comparative evaluation of the available solutions based on a set of criteria including: ease of installation, user-friendly access, the number of parameters and settings available. In a second step, we chose to submit two real cases to a selection of programs. Finally, we used a set of tests for the in silico benchmark of the oligo sets obtained from each type of software. We show that the design software must be selected according to the goal of the scientist, depending on factors such as the organism used, the number of probes required and their localization on the target sequence. The present work provides keys to the choice of the most relevant software, according to the various parameters we tested.
Collapse
Affiliation(s)
- Sophie Lemoine
- INSERM, CNRS, IFR36, Plate-forme Transcriptome, Paris, France
| | | | | |
Collapse
|
22
|
Vijaya Satya R, Zavaljevski N, Kumar K, Bode E, Padilla S, Wasieloski L, Geyer J, Reifman J. In silico microarray probe design for diagnosis of multiple pathogens. BMC Genomics 2008; 9:496. [PMID: 18940003 PMCID: PMC2596143 DOI: 10.1186/1471-2164-9-496] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2008] [Accepted: 10/21/2008] [Indexed: 12/05/2022] Open
Abstract
Background With multiple strains of various pathogens being sequenced, it is necessary to develop high-throughput methods that can simultaneously process multiple bacterial or viral genomes to find common fingerprints as well as fingerprints that are unique to each individual genome. We present algorithmic enhancements to an existing single-genome pipeline that allows for efficient design of microarray probes common to groups of target genomes. The enhanced pipeline takes advantage of the similarities in the input genomes to narrow the search to short, nonredundant regions of the target genomes and, thereby, significantly reduces the computation time. The pipeline also computes a three-state hybridization matrix, which gives the expected hybridization of each probe with each target. Results Design of microarray probes for eight pathogenic Burkholderia genomes shows that the multiple-genome pipeline is nearly four-times faster than the single-genome pipeline for this application. The probes designed for these eight genomes were experimentally tested with one non-target and three target genomes. Hybridization experiments show that less than 10% of the designed probes cross hybridize with non-targets. Also, more than 65% of the probes designed to identify all Burkholderia mallei and B. pseudomallei strains successfully hybridize with a B. pseudomallei strain not used for probe design. Conclusion The savings in runtime suggest that the enhanced pipeline can be used to design fingerprints for tens or even hundreds of related genomes in a single run. Hybridization results with an unsequenced B. pseudomallei strain indicate that the designed probes might be useful in identifying unsequenced strains of B. mallei and B. pseudomallei.
Collapse
Affiliation(s)
- Ravi Vijaya Satya
- Biotechnology HPC Software Applications Institute, Telemedicine and Advanced Technology Research Center, U.S. Army Medical Research and Materiel Command, Fort Detrick, MD 21702, USA.
| | | | | | | | | | | | | | | |
Collapse
|
23
|
Gans JD, Wolinsky M. Improved assay-dependent searching of nucleic acid sequence databases. Nucleic Acids Res 2008; 36:e74. [PMID: 18515842 PMCID: PMC2475610 DOI: 10.1093/nar/gkn301] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Nucleic acid-based biochemical assays are crucial to modern biology. Key applications, such as detection of bacterial, viral and fungal pathogens, require detailed knowledge of assay sensitivity and specificity to obtain reliable results. Improved methods to predict assay performance are needed for exploiting the exponentially growing amount of DNA sequence data and for reducing the experimental effort required to develop robust detection assays. Toward this goal, we present an algorithm for the calculation of sequence similarity based on DNA thermodynamics. In our approach, search queries consist of one to three oligonucleotide sequences representing either a hybridization probe, a pair of Padlock probes or a pair of PCR primers with an optional TaqMan™ probe (i.e. in silico or ‘virtual’ PCR). Matches are reported if the query and target satisfy both the thermodynamics of the assay (binding at a specified hybridization temperature and/or change in free energy) and the relevant biological constraints (assay sequences binding to the correct target duplex strands in the required orientations). The sensitivity and specificity of our method is evaluated by comparing predicted to known sequence tagged sites in the human genome. Free energy is shown to be a more sensitive and specific match criterion than hybridization temperature.
Collapse
Affiliation(s)
- Jason D Gans
- Biosciences Division, Los Alamos National Laboratory, Los Alamos, NM, USA.
| | | |
Collapse
|
24
|
Vijaya Satya R, Zavaljevski N, Kumar K, Reifman J. A high-throughput pipeline for designing microarray-based pathogen diagnostic assays. BMC Bioinformatics 2008; 9:185. [PMID: 18402679 PMCID: PMC2375140 DOI: 10.1186/1471-2105-9-185] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2007] [Accepted: 04/10/2008] [Indexed: 11/21/2022] Open
Abstract
Background We present a methodology for high-throughput design of oligonucleotide fingerprints for microarray-based pathogen diagnostic assays. The oligonucleotide fingerprints, or DNA microarray probes, are designed for identifying target organisms in environmental or clinical samples. The design process is implemented in a high-performance computing software pipeline that incorporates major algorithmic improvements over a previous version to both reduce computation time and improve specificity assessment. Results The algorithmic improvements result in significant reduction in runtimes, with the updated pipeline being nearly up to five-times faster than the previous version. The improvements in specificity assessment, based on multiple specificity criteria, result in robust and consistent evaluation of cross-hybridization with nontarget sequences. In addition, the multiple criteria provide finer control on the number of resulting fingerprints, which helps in obtaining a larger number of fingerprints with high specificity. Simulation tests for Francisella tularensis and Yersinia pestis, using a well-established hybridization model to estimate cross-hybridization with nontarget sequences, show that the improved specificity criteria yield a larger number of fingerprints as compared to using a single specificity criterion. Conclusion The faster runtimes, achieved as the result of algorithmic improvements, are critical for extending the pipeline to process multiple target genomes. The larger numbers of identified fingerprints, obtained by considering broader specificity criteria, are essential for designing probes for hard-to-distinguish target sequences.
Collapse
Affiliation(s)
- Ravi Vijaya Satya
- Biotechnology HPC Software Applications Institute, Telemedicine and Advanced Technology Research Center, US Army Medical Research and Materiel Command, Fort Detrick, MD 21702, USA.
| | | | | | | |
Collapse
|
25
|
Christen R. Global Sequencing: A Review of Current Molecular Data and New Methods Available to Assess Microbial Diversity. Microbes Environ 2008; 23:253-68. [DOI: 10.1264/jsme2.me08525] [Citation(s) in RCA: 61] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Affiliation(s)
- Richard Christen
- Université de Nice et CNRS UMR 6543, Laboratoire de Biologie Virtuelle, Cente de Biochimie, Parc Valrose, Faculté des Sciences
| |
Collapse
|
26
|
Affiliation(s)
- Audrey Sassolas
- Laboratoire de Génie Enzymatique et Biomoléculaire, Institut de Chimie et Biochimie Moléculaires et Supramoléculaires, 43 Boulevard du 11 Novembre 1918, Villeurbanne F-69622, France, UMR5246, Centre National de La Recherche Scientifque, Villeurbanne F-69622, France, Université de Lyon, Lyon F-69622, France, Université Lyon 1, Lyon F-69622, France, Institut National des Sciences Appliquées de Lyon, École d'Ingénieurs, Villeurbanne F-69621, France, and École Supérieure Chimie Physique Électronique de Lyon,
| | - Béatrice D. Leca-Bouvier
- Laboratoire de Génie Enzymatique et Biomoléculaire, Institut de Chimie et Biochimie Moléculaires et Supramoléculaires, 43 Boulevard du 11 Novembre 1918, Villeurbanne F-69622, France, UMR5246, Centre National de La Recherche Scientifque, Villeurbanne F-69622, France, Université de Lyon, Lyon F-69622, France, Université Lyon 1, Lyon F-69622, France, Institut National des Sciences Appliquées de Lyon, École d'Ingénieurs, Villeurbanne F-69621, France, and École Supérieure Chimie Physique Électronique de Lyon,
| | - Loïc J. Blum
- Laboratoire de Génie Enzymatique et Biomoléculaire, Institut de Chimie et Biochimie Moléculaires et Supramoléculaires, 43 Boulevard du 11 Novembre 1918, Villeurbanne F-69622, France, UMR5246, Centre National de La Recherche Scientifque, Villeurbanne F-69622, France, Université de Lyon, Lyon F-69622, France, Université Lyon 1, Lyon F-69622, France, Institut National des Sciences Appliquées de Lyon, École d'Ingénieurs, Villeurbanne F-69621, France, and École Supérieure Chimie Physique Électronique de Lyon,
| |
Collapse
|
27
|
Abstract
Single-nucleotide polymorphism (SNP) genotyping can be carried out by annealing an oligonucleotide primer directly adjacent to the polymorphism and carrying out a single base extension using a polymerase reaction with labeled dideoxynucleotide triphosphates. This can be multiplexed by attaching a unique tag at the 5'-end of each oligonucleotide primer and binding the corresponding antitag to a DNA microarray or microbead. After the polymerase reaction, the tag-antitag system can be used to demultiplex the experiment. However, such an assay requires careful primer and tag design to avoid any crossreactivity among the primers, tags, antitags, and template sequence. A procedure for designing the primers is described in this chapter.
Collapse
|
28
|
Paredes CJ, Senger RS, Spath IS, Borden JR, Sillers R, Papoutsakis ET. A general framework for designing and validating oligomer-based DNA microarrays and its application to Clostridium acetobutylicum. Appl Environ Microbiol 2007; 73:4631-8. [PMID: 17526797 PMCID: PMC1932840 DOI: 10.1128/aem.00144-07] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2007] [Accepted: 05/15/2007] [Indexed: 11/20/2022] Open
Abstract
While DNA microarray analysis is widely accepted as an essential tool for modern biology, its use still eludes many researchers for several reasons, especially when microarrays are not commercially available. In that case, the design, construction, and use of microarrays for a sequenced organism constitute substantial, time-consuming, and expensive tasks. Recently, it has become possible to construct custom microarrays using industrial manufacturing processes, which offer several advantages, including speed of manufacturing, quality control, no up-front setup costs, and need-based microarray ordering. Here, we describe a strategy for designing and validating DNA microarrays manufactured using a commercial process. The 22K microarrays for the solvent producer Clostridium acetobutylicum ATCC 824 are based on in situ-synthesized 60-mers employing the Agilent technology. The strategy involves designing a large library of possible oligomer probes for each target (i.e., gene or DNA sequence) and experimentally testing and selecting the best probes for each target. The degenerate C. acetobutylicum strain M5 lacking the pSOL1 megaplasmid (with 178 annotated open reading frames [genes]) was used to estimate the level of probe cross-hybridization in the new microarrays and to establish the minimum intensity for a gene to be considered expressed. Results obtained using this microarray design were consistent with previously reported results from spotted cDNA-based microarrays. The proposed strategy is applicable to any sequenced organism.
Collapse
Affiliation(s)
- Carlos J Paredes
- Department of Chemical and Biological Engineering, Northwestern University, Evanston, IL 60208, USA
| | | | | | | | | | | |
Collapse
|
29
|
Gasieniec L, Li CY, Sant P, Wong PWH. Randomized probe selection algorithm for microarray design. J Theor Biol 2007; 248:512-21. [PMID: 17628606 DOI: 10.1016/j.jtbi.2007.05.036] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/28/2007] [Revised: 05/11/2007] [Accepted: 05/29/2007] [Indexed: 11/18/2022]
Abstract
DNA microarray technology, originally developed to measure the level of gene expression, has become one of the most widely used tools in genomic study. The crux of microarray design lies in how to select a unique probe that distinguishes a given genomic sequence from other sequences. Due to its significance, probe selection attracts a lot of attention. Various probe selection algorithms have been developed in recent years. Good probe selection algorithms should produce a small number of candidate probes. Efficiency is also crucial because the data involved are usually huge. Most existing algorithms are usually not sufficiently selective and quite a large number of probes are returned. We propose a new direction to tackle the problem and give an efficient algorithm based on randomization to select a small set of probes and demonstrate that such a small set of probes is sufficient to distinguish each sequence from all the other sequences. Based on the algorithm, we have developed probe selection software RandPS, which runs efficiently in practice. The software is available on our website (http://www.csc.liv.ac.uk/ approximately cindy/RandPS/RandPS.htm). We test our algorithm via experiments on different genomes (Escherichia coli, Saccharamyces cerevisiae, etc.) and our algorithm is able to output unique probes for most of the genes efficiently. The other genes can be identified by a combination of at most two probes.
Collapse
Affiliation(s)
- Leszek Gasieniec
- Department of Computer Science, The University of Liverpool, Ashton Building, Ashton Street, Liverpool, L69 3BX, UK.
| | | | | | | |
Collapse
|
30
|
Genome-wide identification of specific oligonucleotides using artificial neural network and computational genomic analysis. BMC Bioinformatics 2007; 8:164. [PMID: 17518996 PMCID: PMC1892811 DOI: 10.1186/1471-2105-8-164] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2006] [Accepted: 05/22/2007] [Indexed: 11/26/2022] Open
Abstract
Background Genome-wide identification of specific oligonucleotides (oligos) is a computationally-intensive task and is a requirement for designing microarray probes, primers, and siRNAs. An artificial neural network (ANN) is a machine learning technique that can effectively process complex and high noise data. Here, ANNs are applied to process the unique subsequence distribution for prediction of specific oligos. Results We present a novel and efficient algorithm, named the integration of ANN and BLAST (IAB) algorithm, to identify specific oligos. We establish the unique marker database for human and rat gene index databases using the hash table algorithm. We then create the input vectors, via the unique marker database, to train and test the ANN. The trained ANN predicted the specific oligos with high efficiency, and these oligos were subsequently verified by BLAST. To improve the prediction performance, the ANN over-fitting issue was avoided by early stopping with the best observed error and a k-fold validation was also applied. The performance of the IAB algorithm was about 5.2, 7.1, and 6.7 times faster than the BLAST search without ANN for experimental results of 70-mer, 50-mer, and 25-mer specific oligos, respectively. In addition, the results of polymerase chain reactions showed that the primers predicted by the IAB algorithm could specifically amplify the corresponding genes. The IAB algorithm has been integrated into a previously published comprehensive web server to support microarray analysis and genome-wide iterative enrichment analysis, through which users can identify a group of desired genes and then discover the specific oligos of these genes. Conclusion The IAB algorithm has been developed to construct SpecificDB, a web server that provides a specific and valid oligo database of the probe, siRNA, and primer design for the human genome. We also demonstrate the ability of the IAB algorithm to predict specific oligos through polymerase chain reaction experiments. SpecificDB provides comprehensive information and a user-friendly interface.
Collapse
|
31
|
Phillippy AM, Mason JA, Ayanbule K, Sommer DD, Taviani E, Huq A, Colwell RR, Knight IT, Salzberg SL. Comprehensive DNA signature discovery and validation. PLoS Comput Biol 2007; 3:e98. [PMID: 17511514 PMCID: PMC1868776 DOI: 10.1371/journal.pcbi.0030098] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/05/2007] [Accepted: 04/18/2007] [Indexed: 11/27/2022] Open
Abstract
DNA signatures are nucleotide sequences that can be used to detect the presence of an organism and to distinguish that organism from all other species. Here we describe Insignia, a new, comprehensive system for the rapid identification of signatures in the genomes of bacteria and viruses. With the availability of hundreds of complete bacterial and viral genome sequences, it is now possible to use computational methods to identify signature sequences in all of these species, and to use these signatures as the basis for diagnostic assays to detect and genotype microbes in both environmental and clinical samples. The success of such assays critically depends on the methods used to identify signatures that properly differentiate between the target genomes and the sample background. We have used Insignia to compute accurate signatures for most bacterial genomes and made them available through our Web site. A sample of these signatures has been successfully tested on a set of 46 Vibrio cholerae strains, and the results indicate that the signatures are highly sensitive for detection as well as specific for discrimination between these strains and their near relatives. Our approach, whereby the entire genomic complement of organisms are compared to identify probe targets, is a promising method for diagnostic assay development, and it provides assay designers with the flexibility to choose probes from the most relevant genes or genomic regions. The Insignia system is freely accessible via a Web interface and has been released as open source software at: http://insignia.cbcb.umd.edu. Now that the genome sequences of hundreds of bacteria and viruses are known, we can design tests that will rapidly detect the presence of these species based solely on their DNA. Such tests have a wide range of applications, from diagnosing infections to detecting harmful microbes in a water supply. These tests can detect a pathogen in a complex mixture of organic material by recognizing short, distinguishing sequences—called DNA signatures—that occur in the pathogen and not in any other species. We present Insignia, a new computational system that identifies DNA signatures of any length in bacterial and viral genomes. Insignia uses highly efficient algorithms to compare sequenced bacterial and viral genomes against each other and to additional background genomes including plants, animals, and human. These comparisons are stored in a database and used to rapidly compute signatures for any particular target species. To maximize its utility for the community, we have made Insignia available as free, open-source software and as a Web application. We have also validated 50 Insignia-designed assays on a panel of 46 strains of Vibrio cholerae, and our results show that the signatures are both sensitive and specific.
Collapse
Affiliation(s)
- Adam M Phillippy
- Center for Bioinformatics and Computational Biology, University of Maryland, College Park, Maryland, United States of America.
| | | | | | | | | | | | | | | | | |
Collapse
|
32
|
Feng S, Tillier ERM. A fast and flexible approach to oligonucleotide probe design for genomes and gene families. Bioinformatics 2007; 23:1195-202. [PMID: 17392329 DOI: 10.1093/bioinformatics/btm114] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION With hundreds of completely sequenced microbial genomes available, and advancements in DNA microarray technology, the detection of genes in microbial communities consisting of hundreds of thousands of sequences may be possible. The existing strategies developed for DNA probe design, geared toward identifying specific sequences, are not suitable due to the lack of coverage, flexibility and efficiency necessary for applications in metagenomics. METHODS ProDesign is a tool developed for the selection of oligonucleotide probes to detect members of gene families present in environmental samples. Gene family-specific probe sequences are generated based on specific and shared words, which are found with the spaced seed hashing algorithm. To detect more sequences, those sharing some common words are re-clustered into new families, then probes specific for the new families are generated. RESULTS The program is very flexible in that it can be used for designing probes for detecting many genes families simultaneously and specifically in one or more genomes. Neither the length nor the melting temperature of the probes needs to be predefined. We have found that ProDesign provides more flexibility, coverage and speed than other software programs used in the selection of probes for genomic and gene family arrays. AVAILABILITY ProDesign is licensed free of charge to academic users. ProDesign and Supplementary Material can be obtained by contacting the authors. A web server for ProDesign is available at http://www.uhnresearch.ca/labs/tillier/ProDesign/ProDesign.html. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shengzhong Feng
- Institute of Computing Technology, Chinese Academy of Sciences, China
| | | |
Collapse
|
33
|
Atlas M, Hundewale N, Perelygina L, Zelikovsky A. Consolidating software tools for DNA microarray design and manufacturing. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2007; 2006:172-5. [PMID: 17271633 DOI: 10.1109/iembs.2004.1403119] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/13/2023]
Abstract
As the human genome project progresses and some microbial and eukaryotic genomes are recognized, a novel technology, DNA microarray (also called gene chip, biochip, gene microarray, and DNA chip) technology, has attracted increasing number of biologists, bioengineers and computer scientists recently. This technology promises to monitor the whole genome at once, so that researchers can study the whole genome on the global level and have a better picture of the expressions among millions of genes simultaneously. Today, it is widely used in many fields - disease diagnosis, gene classification, gene regulatory network, and drug discovery. We present a concatenated software solution for the entire DNA array flow exploring all steps of a consolidated software tool. The proposed software tool has been tested on Herpes B virus as well as simulated data. Our experiments show that the genomic data follow the pattern predicted by simulated data although the number of border conflicts (quality of the DNA array design) is several times smaller than for simulated data. We also report a trade-off between the number of border conflicts and the running time for several proposed algorithmic techniques employed in the physical design of DNA arrays.
Collapse
Affiliation(s)
- M Atlas
- Dept. of Comput. Sci., Georgia State Univ., Atlanta, GA, USA
| | | | | | | |
Collapse
|
34
|
Lin FM, Huang HD, Chang YC, Tsou AP, Chan PL, Wu LC, Tsai MF, Horng JT. Database to dynamically aid probe design for virus identification. ACTA ACUST UNITED AC 2006; 10:705-13. [PMID: 17044404 DOI: 10.1109/titb.2006.874202] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
Viral infection poses a major problem for public health, horticulture, and animal husbandry, possibly causing severe health crises and economic losses. Viral infections can be identified by the specific detection of viral sequences in many ways. The microarray approach not only tolerates sequence variations of newly evolved virus strains, but can also simultaneously diagnose many viral sequences. Many chips have so far been designed for clinical use. Most are designed for special purposes, such as typing enterovirus infection, and compare fewer than 30 different viral sequences. None considers primer design, increasing the likelihood of cross hybridization to similar sequences from other viruses. To prevent this possibility, this work establishes a platform and database that provides users with specific probes of all known viral genome sequences to facilitate the design of diagnostic chips. This work develops a system for designing probes online. A user can select any number of different viruses and set the experimental conditions such as melting temperature and length of probe. The system then returns the optimal sequences from the database. We have also developed a heuristic algorithm to calculate the probe correctness and show the correctness of the algorithm. (The system that supports probe design for identifying viruses has been published on our web page http://bioinfo.csie.ncu.edu.tw/.)
Collapse
Affiliation(s)
- Feng-Mao Lin
- Department of Computer Science and Information Engineering, National Central University, Jhongli City, Taiwan, ROC.
| | | | | | | | | | | | | | | |
Collapse
|
35
|
PathogenMIPer: a tool for the design of molecular inversion probes to detect multiple pathogens. BMC Bioinformatics 2006; 7:500. [PMID: 17105657 PMCID: PMC1657037 DOI: 10.1186/1471-2105-7-500] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2006] [Accepted: 11/14/2006] [Indexed: 12/29/2022] Open
Abstract
BACKGROUND Here we describe PathogenMIPer, a software program for designing molecular inversion probe (MIP) oligonucleotides for use in pathogen identification and detection. The software designs unique and specific oligonucleotide probes targeting microbial or other genomes. The tool tailors all probe sequence components (including target-specific sequences, barcode sequences, universal primers and restriction sites) and combines these components into ready-to-order probes for use in a MIP assay. The system can harness the genetic variability available in an entire genome in designing specific probes for the detection of multiple co-infections in a single tube using a MIP assay. RESULTS PathogenMIPer can accept sequence data in FASTA file format, and other parameter inputs from the user through a graphical user interface. It can design MIPs not only for pathogens, but for any genome for use in parallel genomic analyses. The software was validated experimentally by applying it to the detection of human papilloma virus (HPV) as a model system, which is associated with various human malignancies including cervical and skin cancers. Initial tests of laboratory samples using the MIPs developed by the PathogenMIPer to recognize 24 different types of HPVs gave very promising results, detecting even a small viral load of single as well as multiple infections (Akhras et al, personal communication). CONCLUSION PathogenMIPer is a software for designing molecular inversion probes for detection of multiple target DNAs in a sample using MIP assays. It enables broader use of MIP technology in the detection through genotyping of pathogens that are complex, difficult-to-amplify, or present in multiple subtypes in a sample.
Collapse
|
36
|
Kahng AB, Măndoiu II, Reda S, Xu X, Zelikovsky AZ. COMPUTER-AIDED OPTIMIZATION OF DNA ARRAY DESIGN AND MANUFACTURING. DESIGN AUTOMATION METHODS AND TOOLS FOR MICROFLUIDICS-BASED BIOCHIPS 2006:235-269. [DOI: 10.1007/1-4020-5123-9_10] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 09/01/2023]
|
37
|
Tembe W, Zavaljevski N, Bode E, Chase C, Geyer J, Wasieloski L, Benson G, Reifman J. Oligonucleotide fingerprint identification for microarray-based pathogen diagnostic assays. Bioinformatics 2006; 23:5-13. [PMID: 17068088 DOI: 10.1093/bioinformatics/btl549] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
MOTIVATION Advances in DNA microarray technology and computational methods have unlocked new opportunities to identify 'DNA fingerprints', i.e. oligonucleotide sequences that uniquely identify a specific genome. We present an integrated approach for the computational identification of DNA fingerprints for design of microarray-based pathogen diagnostic assays. We provide a quantifiable definition of a DNA fingerprint stated both from a computational as well as an experimental point of view, and the analytical proof that all in silico fingerprints satisfying the stated definition are found using our approach. RESULTS The presented computational approach is implemented in an integrated high-performance computing (HPC) software tool for oligonucleotide fingerprint identification termed TOFI. We employed TOFI to identify in silico DNA fingerprints for several bacteria and plasmid sequences, which were then experimentally evaluated as potential probes for microarray-based diagnostic assays. Results and analysis of approximately 150 in silico DNA fingerprints for Yersinia pestis and 250 fingerprints for Francisella tularensis are presented. AVAILABILITY The implemented algorithm is available upon request.
Collapse
Affiliation(s)
- Waibhav Tembe
- Biotechnology HPC Software Applications Institute, Telemedicine and Advanced Technology Research Center, US Army Medical Research and Materiel Command, Ft. Detrick, MD Boston, MA, USA
| | | | | | | | | | | | | | | |
Collapse
|
38
|
Yamada T, Soma H, Morishita S. PrimerStation: a highly specific multiplex genomic PCR primer design server for the human genome. Nucleic Acids Res 2006; 34:W665-9. [PMID: 16845094 PMCID: PMC1538814 DOI: 10.1093/nar/gkl297] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
PrimerStation (http://ps.cb.k.u-tokyo.ac.jp) is a web service that calculates primer sets guaranteeing high specificity against the entire human genome. To achieve high accuracy, we used the hybridization ratio of primers in liquid solution. Calculating the status of sequence hybridization in terms of the stringent hybridization ratio is computationally costly, and no web service checks the entire human genome and returns a highly specific primer set calculated using a precise physicochemical model. To shorten the response time, we precomputed candidates for specific primers using a massively parallel computer with 100 CPUs (SunFire 15 K) about 3 months in advance. This enables PrimerStation to search and output qualified primers interactively. PrimerStation can select highly specific primers suitable for multiplex PCR by seeking a wider temperature range that minimizes the possibility of cross-reaction. It also allows users to add heuristic rules to the primer design, e.g. the exclusion of single nucleotide polymorphisms (SNPs) in primers, the avoidance of poly(A) and CA-repeats in the PCR products, and the elimination of defective primers using the secondary structure prediction. We performed several tests to verify the PCR amplification of randomly selected primers for ChrX, and we confirmed that the primers amplify specific PCR products perfectly.
Collapse
Affiliation(s)
- Tomoyuki Yamada
- Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, 5-1-5 Kashiwanoha, Kashiwa, Chiba 277-8562, Japan.
| | | | | |
Collapse
|
39
|
Chou CC, Lee TT, Chen CH, Hsiao HY, Lin YL, Ho MS, Yang PC, Peck K. Design of microarray probes for virus identification and detection of emerging viruses at the genus level. BMC Bioinformatics 2006; 7:232. [PMID: 16643672 PMCID: PMC1523220 DOI: 10.1186/1471-2105-7-232] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2005] [Accepted: 04/28/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Most virus detection methods are geared towards the detection of specific single viruses or just a few known targets, and lack the capability to uncover the novel viruses that cause emerging viral infections. To address this issue, we developed a computational method that identifies the conserved viral sequences at the genus level for all viral genomes available in GenBank, and established a virus probe library. The virus probes are used not only to identify known viruses but also for discerning the genera of emerging or uncharacterized ones. RESULTS Using the microarray approach, the identity of the virus in a test sample is determined by the signals of both genus and species-specific probes. The genera of emerging and uncharacterized viruses are determined based on hybridization of the viral sequences to the conserved probes for the existing viral genera. A detection and classification procedure to determine the identity of a virus directly from detection signals results in the rapid identification of the virus. CONCLUSION We have demonstrated the validity and feasibility of the above strategy with a small number of viral samples. The probe design algorithm can be applied to any publicly available viral sequence database. The strategy of using separate genus and species probe sets enables the use of a straightforward virus identity calculation directly based on the hybridization signals. Our virus identification strategy has great potential in the diagnosis of viral infections. The virus genus and specific probe database and the associated summary tables are available at http://genestamp.sinica.edu.tw/virus/index.htm.
Collapse
Affiliation(s)
- Cheng-Chung Chou
- Center for Genomic Medicine, National Taiwan University, Taipei, 100, ROC
| | - Te-Tsui Lee
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan, 115, ROC
| | - Chun-Houh Chen
- Institute of Statistical Science, Academia Sinica, Taipei, Taiwan, 115, ROC
| | - Hsiang-Yun Hsiao
- Center for Genomic Medicine, National Taiwan University, Taipei, 100, ROC
| | - Yi-Ling Lin
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan, 115, ROC
| | - Mei-Shang Ho
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan, 115, ROC
| | - Pan-Chyr Yang
- Center for Genomic Medicine, National Taiwan University, Taipei, 100, ROC
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan, 115, ROC
| | - Konan Peck
- Institute of Biomedical Sciences, Academia Sinica, Taipei, Taiwan, 115, ROC
| |
Collapse
|
40
|
Lehner A, Loy A, Behr T, Gaenge H, Ludwig W, Wagner M, Schleifer KH. Oligonucleotide microarray for identification of Enterococcus species. FEMS Microbiol Lett 2005; 246:133-42. [PMID: 15869972 DOI: 10.1016/j.femsle.2005.04.002] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2005] [Revised: 03/30/2005] [Accepted: 04/01/2005] [Indexed: 11/17/2022] Open
Abstract
For detection of most members of the Enterococcaceae, the specificity of a novel oligonucleotide microarray (ECC-PhyloChip) consisting of 41 hierarchically nested 16S or 23S rRNA gene-targeted probes was evaluated with 23 pure cultures (including 19 Enterococcus species). Target nucleic acids were prepared by PCR amplification of a 4.5-kb DNA fragment containing large parts of the 16S and 23S rRNA genes and were subsequently labeled fluorescently by random priming. Each tested member of the Enterococcaceae was correctly identified on the basis of its unique microarray hybridization pattern. The evaluated ECC-PhyloChip was successfully applied for identification of Enterococcus faecium and Enterococcus faecalis in artificially contaminated milk samples demonstrating the utility of the ECC-PhyloChip for parallel identification and differentiation of Enterococcus species in food samples.
Collapse
Affiliation(s)
- Angelika Lehner
- Department of Microbiology, Technical University of Munich, D-85350 Freising, Germany
| | | | | | | | | | | | | |
Collapse
|
41
|
Stenberg J, Nilsson M, Landegren U. ProbeMaker: an extensible framework for design of sets of oligonucleotide probes. BMC Bioinformatics 2005; 6:229. [PMID: 16171527 PMCID: PMC1239912 DOI: 10.1186/1471-2105-6-229] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/17/2005] [Accepted: 09/19/2005] [Indexed: 11/24/2022] Open
Abstract
Background Procedures for genetic analyses based on oligonucleotide probes are powerful tools that can allow highly parallel investigations of genetic material. Such procedures require the design of large sets of probes using application-specific design constraints. Results ProbeMaker is a software framework for computer-assisted design and analysis of sets of oligonucleotide probe sequences. The tool assists in the design of probes for sets of target sequences, incorporating sequence motifs for purposes such as amplification, visualization, or identification. An extension system allows the framework to be equipped with application-specific components for evaluation of probe sequences, and provides the possibility to include support for importing sequence data from a variety of file formats. Conclusion ProbeMaker is a suitable tool for many different oligonucleotide design and analysis tasks, including the design of probe sets for various types of parallel genetic analyses, experimental validation of design parameters, and in silico testing of probe sequence evaluation algorithms.
Collapse
Affiliation(s)
- Johan Stenberg
- Department of Genetics and Pathology, Rudbeck Laboratory, Uppsala University, Se-751 85, Uppsala, Sweden
| | - Mats Nilsson
- Department of Genetics and Pathology, Rudbeck Laboratory, Uppsala University, Se-751 85, Uppsala, Sweden
| | - Ulf Landegren
- Department of Genetics and Pathology, Rudbeck Laboratory, Uppsala University, Se-751 85, Uppsala, Sweden
| |
Collapse
|
42
|
Abstract
Functional genomics methods are used to investigate the huge amount of information contained in genomes. Numerous experimental methods rely on the use of oligo- or polynucleotides. Nucleotide strand hybridization forms the underlying principle for these methods. For all these techniques, the probes should be unique for analyzed genes. In addition to being unique for the studied genes, the probes should fulfill a large number of criteria to be usable and valid. The criteria include for example, avoidance of self-annealing, suitable melting temperature and nucleotide composition. We developed a method for searching unique and valid oligonucleotides or probes for genes so that there is not even a similar (approximate) occurrence in any other location of the whole genome. By using probe size 25, we analyzed 17 complete genomes representing a wide range of both prokaryotic and eukaryotic organisms. More than 92% of all the genes in the investigated genomes contained valid oligonucleotides. Extensive statistical tests were performed to characterize the properties of unique and valid oligonucleotides. Unique and valid oligonucleotides were relatively evenly distributed in genes except for the beginning and end, which were somewhat overrepresented. The flanking regions in eukaryotes were clearly underrepresented among suitable oligonucleotides. In addition to distributions within genes, the effects on codon and amino acid usage were also studied.
Collapse
Affiliation(s)
| | | | - Mauno Vihinen
- Institute of Medical Technology, FI-33014 University of TampereFinland
- Research Unit, Tampere University HospitalFI-33520 Tampere, Finland
- To whom correspondence should be addressed. Tel: +358 3 35517735; Fax: +358 3 35517710;
| |
Collapse
|
43
|
Cao Y, Wang L, Xu K, Kou C, Zhang Y, Wei G, He J, Wang Y, Zhao L. Information theory-based algorithm for in silico prediction of PCR products with whole genomic sequences as templates. BMC Bioinformatics 2005; 6:190. [PMID: 16042814 PMCID: PMC1183192 DOI: 10.1186/1471-2105-6-190] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2005] [Accepted: 07/26/2005] [Indexed: 11/10/2022] Open
Abstract
Background A new algorithm for assessing similarity between primer and template has been developed based on the hypothesis that annealing of primer to template is an information transfer process. Results Primer sequence is converted to a vector of the full potential hydrogen numbers (3 for G or C, 2 for A or T), while template sequence is converted to a vector of the actual hydrogen bond numbers formed after primer annealing. The former is considered as source information and the latter destination information. An information coefficient is calculated as a measure for fidelity of this information transfer process and thus a measure of similarity between primer and potential annealing site on template. Conclusion Successful prediction of PCR products from whole genomic sequences with a computer program based on the algorithm demonstrated the potential of this new algorithm in areas like in silico PCR and gene finding.
Collapse
Affiliation(s)
- Youfang Cao
- School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Lianjie Wang
- Department of Food Science and Engineering, Northwest Institute of Light Industry, Xianyang 712081, Shaanxi, China
| | - Kexue Xu
- Institute of Botany, Chinese Academy of Sciences, Beijing 100093, China
| | - Chunhai Kou
- Department of Computer Science, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yulei Zhang
- School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Guifang Wei
- School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Junjian He
- School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Yunfang Wang
- School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| | - Liping Zhao
- School of Life Science and Biotechnology, Shanghai Jiao Tong University, Shanghai 200240, China
| |
Collapse
|
44
|
Hashsham SA, Wick LM, Rouillard JM, Gulari E, Tiedje JM. Potential of DNA microarrays for developing parallel detection tools (PDTs) for microorganisms relevant to biodefense and related research needs. Biosens Bioelectron 2005; 20:668-83. [PMID: 15522582 DOI: 10.1016/j.bios.2004.06.032] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Development of parallel detection tools using microarrays is critically reviewed in view of the need for screening multiple microorganisms in a single test. Potential research needs with respect to probe design and specificity, validation, sample concentration, selective target enrichment and amplification, and data analysis are discussed. Data illustrating selected probe design issues for detecting multiple targets in mixed microbial systems is presented. Challenges with respect to cost, time, and ease of use compared to other methods are also summarized.
Collapse
Affiliation(s)
- Syed A Hashsham
- Department of Civil and Environmental Engineering, Michigan State University, A 126 Research Complex-Engineering, East Lansing, MI 48824, USA.
| | | | | | | | | |
Collapse
|
45
|
Leber M, Kaderali L, Schönhuth A, Schrader R. A fractional programming approach to efficient DNA melting temperature calculation. Bioinformatics 2005; 21:2375-82. [PMID: 15769839 DOI: 10.1093/bioinformatics/bti379] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION In a wide range of experimental techniques in biology, there is a need for an efficient method to calculate the melting temperature of pairings of two single DNA strands. Avoiding cross-hybridization when choosing primers for the polymerase chain reaction or selecting probes for large-scale DNA assays are examples where the exact determination of melting temperatures is important. Beyond being exact, the method has to be efficient, as these techniques often require the simultaneous calculation of melting temperatures of up to millions of possible pairings. The problem is to simultaneously determine the most stable alignment of two sequences, including potential loops and bulges, and calculate the corresponding melting temperature. RESULTS As the melting temperature can be expressed as a fraction in terms of enthalpy and entropy differences of the corresponding annealing reaction, we propose to use a fractional programming algorithm, the Dinkelbach algorithm, to solve the problem. To calculate the required differences of enthalpy and entropy, the Nearest Neighbor model is applied. Using this model, the substeps of the Dinkelbach algorithm in our problem setting turn out to be calculations of alignments which optimize an additive score function. Thus, the usual dynamic programming techniques can be applied. The result is an efficient algorithm to determine melting temperatures of two DNA strands, suitable for large-scale applications such as primer or probe design. AVAILABILITY The software is available for academic purposes from the authors. A web interface is provided at http://www.zaik.uni-koeln.de/bioinformatik/fptm.html
Collapse
Affiliation(s)
- Markus Leber
- Institute for Biochemistry, University of Cologne, Zülpicher Strasse 47, Köln, D-50674, Germany
| | | | | | | |
Collapse
|
46
|
Rahmann S. Fast large scale oligonucleotide selection using the longest common factor approach. J Bioinform Comput Biol 2005; 1:343-61. [PMID: 15290776 DOI: 10.1142/s0219720003000125] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/15/2002] [Revised: 12/07/2002] [Accepted: 01/15/2003] [Indexed: 11/18/2022]
Abstract
We present a fast method that selects oligonucleotide probes (such as DNA 25-mers) for microarray experiments on a truly large scale. For example, reliable oligos for human genes can be found within four days, a speedup of one to two orders of magnitude compared to previous approaches. This speed is attained by using the longest common substring as a specificity measure for candidate oligos. We present a space- and time-efficient algorithm, based on a suffix array with additional information, to compute matching statistics (lengths of longest matches) between all candidate oligos and all remaining sequences. With the matching statistics available, we show how to incorporate constraints such as oligo length, melting temperature, and self-complementarity into the selection process at a postprocessing stage. As a result, we can now design custom oligos for any sequenced genome, just as the technology for on-site chip synthesis is becoming increasingly mature.
Collapse
Affiliation(s)
- Sven Rahmann
- Max-Planck-Institute for Molecular Genetics, Ihnestrasse 63-73, D-14195 Berlin, Germany.
| |
Collapse
|
47
|
Kucherov G, Noé L, Roytberg M. Multiseed lossless filtration. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2005; 2:51-61. [PMID: 17044164 DOI: 10.1109/tcbb.2005.12] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
We study a method of seed-based lossless filtration for approximate string matching and related bioinformatics applications. The method is based on a simultaneous use of several spaced seeds rather than a single seed as studied by Burkhardt and Kärkkäinen. We present algorithms to compute several important parameters of seed families, study their combinatorial properties, and describe several techniques to construct efficient families. We also report a large-scale application of the proposed technique to the problem of oligonucleotide selection for an EST sequence database.
Collapse
Affiliation(s)
- Gregory Kucherov
- INRIA/LORIA, 615, rue du Jardin Botanique, B.P. 101, 54602 Villers-lès-Nancy, France.
| | | | | |
Collapse
|
48
|
Kaplinski L, Andreson R, Puurand T, Remm M. MultiPLX: automatic grouping and evaluation of PCR primers. Bioinformatics 2004; 21:1701-2. [PMID: 15598831 DOI: 10.1093/bioinformatics/bti219] [Citation(s) in RCA: 53] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
UNLABELLED MultiPLX is a new program for automatic grouping of PCR primers. It can use many different parameters to estimate the compatibility of primers, such as primer-primer interactions, primer-product interactions, difference in melting temperatures, difference in product length and the risk of generating alternative products from the template. A unique feature of the MultiPLX is the ability to perform automatic grouping of large number (thousands) of primer pairs. AVAILABILITY Binaries for Windows, Linux and Solaris are available from http://bioinfo.ebc.ee/download/. A graphical version with limited capabilities can be used through a web interface at http://bioinfo.ebc.ee/multiplx/. The source code of the program is available on request for academic users. CONTACT maido.remm@ut.ee.
Collapse
Affiliation(s)
- Lauris Kaplinski
- Department of Bioinformatics, University of Tartu, Tartu, Estonia
| | | | | | | |
Collapse
|
49
|
Abstract
MOTIVATION Selecting oligonucleotide probes for use in microarray design, and other applications requiring signature sequences, involves identifying sequences which will bind strongly to their intended target, while binding only weakly (or preferably, not at all) to non-target sequences which may be present in the hybridization reaction. While many tools to assist in selection of such sequences exist, all the ones we examined lack important oligo design and software features. RESULTS YODA is an application for assisting biological researchers in selecting signature sequences. It incorporates a custom sequence similarity search to find potential cross-hybridizing non-target sequences. For this task, most oligo design tools rely on BLAST, which is ill suited for it due to an unacceptable risk of false negatives. YODA supports multiple probe design goals including single-genome, multiple-genome, pathogen-host and species/strain-identification. A graphical interface is provided as well as a command-line interface, both of which support many user-controlled parameters. YODA is easy to install and use and runs on Windows, Mac OS X and Linux platforms. AVAILABILITY Freely available (LGLP) along with source code and additional documentation at http://pathport.vbi.vt.edu/YODA CONTACT: enordber@vbi.vt.edu.
Collapse
Affiliation(s)
- Eric K Nordberg
- Virginia Bioinformatics Institute, Virginia Polytechnic Institute and State University, Blacksburg, VA 24061, USA.
| |
Collapse
|
50
|
Abstract
MOTIVATION Designing highly effective short interfering RNA (siRNA) sequences with maximum target-specificity for mammalian RNA interference (RNAi) is one of the hottest topics in molecular biology. The relationship between siRNA sequences and RNAi activity has been studied extensively to establish rules for selecting highly effective sequences. However, there is a pressing need to compute siRNA sequences that minimize off-target silencing effects efficiently and to match any non-targeted sequences with mismatches. RESULTS The enumeration of potential cross-hybridization candidates is non-trivial, because siRNA sequences are short, ca. 19 nt in length, and at least three mismatches with non-targets are required. With at least three mismatches, there are typically four or five contiguous matches, so that a BLAST search frequently overlooks off-target candidates. By contrast, existing accurate approaches are expensive to execute; thus we need to develop an accurate, efficient algorithm that uses seed hashing, the pigeonhole principle, and combinatorics to identify mismatch patterns. Tests show that our method can list potential cross-hybridization candidates for any siRNA sequence of selected human gene rapidly, outperforming traditional methods by orders of magnitude in terms of computational performance. AVAILABILITY http://design.RNAi.jp CONTACT yamada@cb.k.u-tokyo.ac.jp.
Collapse
Affiliation(s)
- Tomoyuki Yamada
- Department of Computational Biology, Graduate School of Frontier Sciences, University of Tokyo, Japan.
| | | |
Collapse
|