1
|
Fostier J. BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs. BMC Bioinformatics 2020; 21:81. [PMID: 32164557 PMCID: PMC7068855 DOI: 10.1186/s12859-020-3348-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND The identification of all matches of a large set of position weight matrices (PWMs) in long DNA sequences requires significant computational resources for which a number of efficient yet complex algorithms have been proposed. RESULTS We propose BLAMM, a simple and efficient tool inspired by high performance computing techniques. The workload is expressed in terms of matrix-matrix products that are evaluated with high efficiency using optimized BLAS library implementations. The algorithm is easy to parallelize and implement on CPUs and GPUs and has a runtime that is independent of the selected p-value. In terms of single-core performance, it is competitive with state-of-the-art software for PWM matching while being much more efficient when using multithreading. Additionally, BLAMM requires negligible memory. For example, both strands of the entire human genome can be scanned for 1404 PWMs in the JASPAR database in 13 min with a p-value of 10-4 using a 36-core machine. On a dual GPU system, the same task can be performed in under 5 min. CONCLUSIONS BLAMM is an efficient tool for identifying PWM matches in large DNA sequences. Its C++ source code is available under the GNU General Public License Version 3 at https://github.com/biointec/blamm.
Collapse
Affiliation(s)
- Jan Fostier
- Department of Information Technology - IDLab, Ghent University - imec, Technologiepark 126, Ghent (Zwijnaarde), B-9052, Belgium.
| |
Collapse
|
2
|
Gao L, Bao W, Zhang H, Yuan CA, Huang DS. Fast sequence analysis based on diamond sampling. PLoS One 2018; 13:e0198922. [PMID: 29953448 PMCID: PMC6023231 DOI: 10.1371/journal.pone.0198922] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2018] [Accepted: 05/29/2018] [Indexed: 12/02/2022] Open
Abstract
Both in DNA and protein contexts, an important method for modelling motifs is to utilize position weight matrix (PWM) in biological sequences. With the development of genome sequencing technology, the quantity of the sequence data is increasing explosively, so the faster searching algorithms which have the ability to meet the increasingly need are desired to develop. In this paper, we proposed a method for speeding up the searching process of candidate transcription factor binding sites (TFBS), and the users can be allowed to specify p threshold to get the desired trade-off between speed and sensitivity for a particular sequence analysis. Moreover, the proposed method can also be generalized to large-scale annotation and sequence projects.
Collapse
Affiliation(s)
- Liangxin Gao
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Wenzhen Bao
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Hongbo Zhang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| | - Chang-An Yuan
- Science Computing and Intelligent Information Processing of GuangXi Higher Education Key Laboratory, Guangxi Teachers Education University, Nanning, Guangxi, China
| | - De-Shuang Huang
- Institute of Machine Learning and Systems Biology, School of Electronics and Information Engineering, Tongji University, Shanghai, China
| |
Collapse
|
3
|
Korhonen JH, Palin K, Taipale J, Ukkonen E. Fast motif matching revisited: high-order PWMs, SNPs and indels. Bioinformatics 2017; 33:514-521. [PMID: 28011774 DOI: 10.1093/bioinformatics/btw683] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Accepted: 10/27/2016] [Indexed: 01/09/2023] Open
Abstract
Motivation While the position weight matrix (PWM) is the most popular model for sequence motifs, there is growing evidence of the usefulness of more advanced models such as first-order Markov representations, and such models are also becoming available in well-known motif databases. There has been lots of research of how to learn these models from training data but the problem of predicting putative sites of the learned motifs by matching the model against new sequences has been given less attention. Moreover, motif site analysis is often concerned about how different variants in the sequence affect the sites. So far, though, the corresponding efficient software tools for motif matching have been lacking. Results We develop fast motif matching algorithms for the aforementioned tasks. First, we formalize a framework based on high-order position weight matrices for generic representation of motif models with dinucleotide or general q -mer dependencies, and adapt fast PWM matching algorithms to the high-order PWM framework. Second, we show how to incorporate different types of sequence variants , such as SNPs and indels, and their combined effects into efficient PWM matching workflows. Benchmark results show that our algorithms perform well in practice on genome-sized sequence sets and are for multiple motif search much faster than the basic sliding window algorithm. Availability and Implementation Implementations are available as a part of the MOODS software package under the GNU General Public License v3.0 and the Biopython license ( http://www.cs.helsinki.fi/group/pssmfind ). Contact janne.h.korhonen@gmail.com.
Collapse
Affiliation(s)
- Janne H Korhonen
- School of Computer Science, Reykjavík University, Reykjavík, Iceland.,Helsinki Institute for Information Technology HIIT, Helsinki, Finland.,Department of Computer Science
| | - Kimmo Palin
- Genome-Scale Biology Research Program, Research Programs Unit
| | - Jussi Taipale
- Department of Biosciences and Nutrition, Karolinska Institutet, Genome Scale Biology Program, Faculty of Medicine, University of Helsinki, Helsinki, Finland
| | - Esko Ukkonen
- Helsinki Institute for Information Technology HIIT, Helsinki, Finland.,Department of Computer Science
| |
Collapse
|
4
|
Jain S, Bader GD. Predicting physiologically relevant SH3 domain mediated protein-protein interactions in yeast. Bioinformatics 2016; 32:1865-72. [PMID: 26861823 PMCID: PMC4908317 DOI: 10.1093/bioinformatics/btw045] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2015] [Revised: 12/05/2015] [Accepted: 01/20/2016] [Indexed: 12/02/2022] Open
Abstract
MOTIVATION Many intracellular signaling processes are mediated by interactions involving peptide recognition modules such as SH3 domains. These domains bind to small, linear protein sequence motifs which can be identified using high-throughput experimental screens such as phage display. Binding motif patterns can then be used to computationally predict protein interactions mediated by these domains. While many protein-protein interaction prediction methods exist, most do not work with peptide recognition module mediated interactions or do not consider many of the known constraints governing physiologically relevant interactions between two proteins. RESULTS A novel method for predicting physiologically relevant SH3 domain-peptide mediated protein-protein interactions in S. cerevisae using phage display data is presented. Like some previous similar methods, this method uses position weight matrix models of protein linear motif preference for individual SH3 domains to scan the proteome for potential hits and then filters these hits using a range of evidence sources related to sequence-based and cellular constraints on protein interactions. The novelty of this approach is the large number of evidence sources used and the method of combination of sequence based and protein pair based evidence sources. By combining different peptide and protein features using multiple Bayesian models we are able to predict high confidence interactions with an overall accuracy of 0.97. AVAILABILITY AND IMPLEMENTATION Domain-Motif Mediated Interaction Prediction (DoMo-Pred) command line tool and all relevant datasets are available under GNU LGPL license for download from http://www.baderlab.org/Software/DoMo-Pred The DoMo-Pred command line tool is implemented using Python 2.7 and C ++. CONTACT gary.bader@utoronto.ca SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Shobhit Jain
- Department of Computer Science and The Donnelly Centre, University of Toronto, Toronto, ON, Canada
| | - Gary D Bader
- Department of Computer Science and The Donnelly Centre, University of Toronto, Toronto, ON, Canada
| |
Collapse
|
5
|
Pizzi C, Rastas P, Ukkonen E. Finding significant matches of position weight matrices in linear time. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011; 8:69-79. [PMID: 21071798 DOI: 10.1109/tcbb.2009.35] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]
Abstract
Position weight matrices are an important method for modeling signals or motifs in biological sequences, both in DNA and protein contexts. In this paper, we present fast algorithms for the problem of finding significant matches of such matrices. Our algorithms are of the online type, and they generalize classical multipattern matching, filtering, and superalphabet techniques of combinatorial string matching to the problem of weight matrix matching. Several variants of the algorithms are developed, including multiple matrix extensions that perform the search for several matrices in one scan through the sequence database. Experimental performance evaluation is provided to compare the new techniques against each other as well as against some other online and index-based algorithms proposed in the literature. Compared to the brute-force O(mn) approach, our solutions can be faster by a factor that is proportional to the matrix length m. Our multiple-matrix filtration algorithm had the best performance in the experiments. On a current PC, this algorithm finds significant matches (p = 0.0001) of the 123 JASPAR matrices in the human genome in about 18 minutes.
Collapse
Affiliation(s)
- Cinzia Pizzi
- Department of Information Engineering, Università degli Studi di Padova, via Gradenigo 6/b, 35131 Padova, Italy.
| | | | | |
Collapse
|
6
|
Cornman RS. The distribution of GYR- and YLP-like motifs in Drosophila suggests a general role in cuticle assembly and other protein-protein interactions. PLoS One 2010; 5. [PMID: 20824096 PMCID: PMC2932725 DOI: 10.1371/journal.pone.0012536] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2010] [Accepted: 08/02/2010] [Indexed: 01/26/2023] Open
Abstract
Background Arthropod cuticle is composed predominantly of a self-assembling matrix of chitin and protein. Genes encoding structural cuticular proteins are remarkably abundant in arthropod genomes, yet there has been no systematic survey of conserved motifs across cuticular protein families. Methodology/Principal Findings Two short sequence motifs with conserved tyrosines were identified in Drosophila cuticular proteins that were similar to the GYR and YLP Interpro domains. These motifs were found in members of the CPR, Tweedle, CPF/CPFL, and (in Anopheles gambiae) CPLCG cuticular protein families, and the Dusky/Miniature family of cuticle-associated proteins. Tweedle proteins have a characteristic motif architecture that is shared with the Drosophila protein GCR1 and its orthologs in other species, suggesting that GCR1 is also cuticular. A resilin repeat, which has been shown to confer elasticity, matched one of the motifs; a number of other Drosophila proteins of unknown function exhibit a motif architecture similar to that of resilin. The motifs were also present in some proteins of the peritrophic matrix and the eggshell, suggesting molecular convergence among distinct extracellular matrices. More surprisingly, gene regulation, development, and proteolysis were statistically over-represented ontology terms for all non-cuticular matches in Drosophila. Searches against other arthropod genomes indicate that the motifs are taxonomically widespread. Conclusions This survey suggests a more general definition for GYR and YLP motifs and reveals their contribution to several types of extracellular matrix. They may define sites of protein interaction with DNA or other proteins, based on ontology analysis. These results can help guide experimental studies on the biochemistry of cuticle assembly.
Collapse
Affiliation(s)
- R Scott Cornman
- Department of Cellular Biology, University of Georgia, Athens, Georgia, United States of America.
| |
Collapse
|
7
|
Beckstette M, Homann R, Giegerich R, Kurtz S. Significant speedup of database searches with HMMs by search space reduction with PSSM family models. ACTA ACUST UNITED AC 2009; 25:3251-8. [PMID: 19828575 PMCID: PMC2788931 DOI: 10.1093/bioinformatics/btp593] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Profile hidden Markov models (pHMMs) are currently the most popular modeling concept for protein families. They provide sensitive family descriptors, and sequence database searching with pHMMs has become a standard task in today's genome annotation pipelines. On the downside, searching with pHMMs is computationally expensive. RESULTS We propose a new method for efficient protein family classification and for speeding up database searches with pHMMs as is necessary for large-scale analysis scenarios. We employ simpler models of protein families called position-specific scoring matrices family models (PSSM-FMs). For fast database search, we combine full-text indexing, efficient exact p-value computation of PSSM match scores and fast fragment chaining. The resulting method is well suited to prefilter the set of sequences to be searched for subsequent database searches with pHMMs. We achieved a classification performance only marginally inferior to hmmsearch, yet, results could be obtained in a fraction of runtime with a speedup of >64-fold. In experiments addressing the method's ability to prefilter the sequence space for subsequent database searches with pHMMs, our method reduces the number of sequences to be searched with hmmsearch to only 0.80% of all sequences. The filter is very fast and leads to a total speedup of factor 43 over the unfiltered search, while retaining >99.5% of the original results. In a lossless filter setup for hmmsearch on UniProtKB/Swiss-Prot, we observed a speedup of factor 92. AVAILABILITY The presented algorithms are implemented in the program PoSSuMsearch2, available for download at http://bibiserv.techfak.uni-bielefeld.de/possumsearch2/. CONTACT beckstette@zbh.uni-hamburg.de SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michael Beckstette
- Center for Bioinformatics, University of Hamburg, Bundesstrasse 43, 20146 Hamburg, Germany.
| | | | | | | |
Collapse
|
8
|
Korhonen J, Martinmäki P, Pizzi C, Rastas P, Ukkonen E. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics 2009; 25:3181-2. [PMID: 19773334 PMCID: PMC2778336 DOI: 10.1093/bioinformatics/btp554] [Citation(s) in RCA: 106] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
UNLABELLED MOODS (MOtif Occurrence Detection Suite) is a software package for matching position weight matrices against DNA sequences. MOODS implements state-of-the-art online matching algorithms, achieving considerably faster scanning speed than with a simple brute-force search. MOODS is written in C++, with bindings for the popular BioPerl and Biopython toolkits. It can easily be adapted for different purposes and integrated into existing workflows. It can also be used as a C++ library. AVAILABILITY The package with documentation and examples of usage is available at http://www.cs.helsinki.fi/group/pssmfind. The source code is also available under the terms of a GNU General Public License (GPL).
Collapse
Affiliation(s)
- Janne Korhonen
- Department of Computer Science and Helsinki Institute for Information Technology, University of Helsinki, Helsinki, Finland.
| | | | | | | | | |
Collapse
|
9
|
Pape UJ, Rahmann S, Sun F, Vingron M. Compound poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands. J Comput Biol 2008; 15:547-64. [PMID: 18631020 DOI: 10.1089/cmb.2007.0084] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Transcription factors play a key role in gene regulation by interacting with specific binding sites or motifs. Therefore, enrichment of binding motifs is important for genome annotation and efficient computation of the statistical significance, the p-value, of the enrichment of motifs is crucial. We propose an efficient approximation to compute the significance. Due to the incorporation of both strands of the DNA molecules and explicit modeling of dependencies between overlapping hits, we achieve accurate results for any DNA motif based on its Position Frequency Matrix (PFM) representation. The accuracy of the p-value approximation is shown by comparison with the simulated count distribution. Furthermore, we compare the approach with a binomial approximation, (compound) Poisson approximation, and a normal approximation. In general, our approach outperforms these approximations or is equally good but significantly faster. An implementation of our approach is available at http://mosta.molgen.mpg.de.
Collapse
Affiliation(s)
- Utz J Pape
- Computational Biology, Max Planck Institute for Molecular Genetics, Berlin, Germany.
| | | | | | | |
Collapse
|
10
|
Lähdesmäki H, Rust AG, Shmulevich I. Probabilistic inference of transcription factor binding from multiple data sources. PLoS One 2008; 3:e1820. [PMID: 18364997 PMCID: PMC2268002 DOI: 10.1371/journal.pone.0001820] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2007] [Accepted: 02/04/2008] [Indexed: 11/21/2022] Open
Abstract
An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org.
Collapse
Affiliation(s)
- Harri Lähdesmäki
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Alistair G. Rust
- Institute for Systems Biology, Seattle, Washington, United States of America
| | - Ilya Shmulevich
- Institute for Systems Biology, Seattle, Washington, United States of America
| |
Collapse
|
11
|
Pape UJ, Rahmann S, Vingron M. Natural similarity measures between position frequency matrices with an application to clustering. ACTA ACUST UNITED AC 2008; 24:350-7. [PMID: 18174183 DOI: 10.1093/bioinformatics/btm610] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Transcription factors (TFs) play a key role in gene regulation by binding to target sequences. In silico prediction of potential binding of a TF to a binding site is a well-studied problem in computational biology. The binding sites for one TF are represented by a position frequency matrix (PFM). The discovery of new PFMs requires the comparison to known PFMs to avoid redundancies. In general, two PFMs are similar if they occur at overlapping positions under a null model. Still, most existing methods compute similarity according to probabilistic distances of the PFMs. Here we propose a natural similarity measure based on the asymptotic covariance between the number of PFM hits incorporating both strands. Furthermore, we introduce a second measure based on the same idea to cluster a set of the Jaspar PFMs. RESULTS We show that the asymptotic covariance can be efficiently computed by a two dimensional convolution of the score distributions. The asymptotic covariance approach shows strong correlation with simulated data. It outperforms three alternative methods. The Jaspar clustering yields distinct groups of TFs of the same class. Furthermore, a representative PFM is given for each class. In contrast to most other clustering methods, PFMs with low similarity automatically remain singletons. AVAILABILITY A website to compute the similarity and to perform clustering, the source code and Supplementary Material are available at http://mosta.molgen.mpg.de.
Collapse
Affiliation(s)
- Utz J Pape
- Computational Biology, Max Planck Institute f. Molecular Genetics, Ihnestr. 73, 14195 Berlin, Germany.
| | | | | |
Collapse
|
12
|
Touzet H, Varré JS. Efficient and accurate P-value computation for Position Weight Matrices. Algorithms Mol Biol 2007; 2:15. [PMID: 18072973 PMCID: PMC2238751 DOI: 10.1186/1748-7188-2-15] [Citation(s) in RCA: 86] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2007] [Accepted: 12/11/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Position Weight Matrices (PWMs) are probabilistic representations of signals in sequences. They are widely used to model approximate patterns in DNA or in protein sequences. The usage of PWMs needs as a prerequisite to knowing the statistical significance of a word according to its score. This is done by defining the P-value of a score, which is the probability that the background model can achieve a score larger than or equal to the observed value. This gives rise to the following problem: Given a P-value, find the corresponding score threshold. Existing methods rely on dynamic programming or probability generating functions. For many examples of PWMs, they fail to give accurate results in a reasonable amount of time. RESULTS The contribution of this paper is two fold. First, we study the theoretical complexity of the problem, and we prove that it is NP-hard. Then, we describe a novel algorithm that solves the P-value problem efficiently. The main idea is to use a series of discretized score distributions that improves the final result step by step until some convergence criterion is met. Moreover, the algorithm is capable of calculating the exact P-value without any error, even for matrices with non-integer coefficient values. The same approach is also used to devise an accurate algorithm for the reverse problem: finding the P-value for a given score. Both methods are implemented in a software called TFM-PVALUE, that is freely available. CONCLUSION We have tested TFM-PVALUE on a large set of PWMs representing transcription factor binding sites. Experimental results show that it achieves better performance in terms of computational time and precision than existing tools.
Collapse
|
13
|
Koczyk G, Wyrwicz LS, Rychlewski L. LigProf: a simple tool for in silico prediction of ligand-binding sites. J Mol Model 2007; 13:445-55. [PMID: 17200839 DOI: 10.1007/s00894-006-0165-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2006] [Accepted: 10/25/2006] [Indexed: 10/23/2022]
Abstract
With the increasing amount of data provided by both high-throughput sequencing and structural genomics studies, there is a growing need for tools to augment functional predictions for protein sequences. Broad descriptions of function can be provided by establishing the presence of protein domains associated with a particular function. To extend the domain-based annotation, LigProf provides predictions of potential ligands that bind to a protein, as well as critical residues that stabilize ligands. A P-value statistic for estimating the significance of motif occurrence is provided for all sites. Although the usefulness of the method will rise with increasing numbers of crystallographically solved molecules deposited in the PDB database, we show that it can already be applied successfully to the highly represented ligand-bound protein kinase domains of viral and human origin. The LigProf webserver is freely available at: http://www.cropnet.pl/ligprof . At present, LigProf descriptors annotate and extend major protein families from the PfamA database.
Collapse
Affiliation(s)
- Grzegorz Koczyk
- Institute of Plant Genetics, Strzeszyńska 34, 60-479, Poznań, Poland.
| | | | | |
Collapse
|
14
|
Abstract
The level of conservation between two homologous sequences often varies among sequence regions; functionally important domains are more conserved than the remaining regions. Thus, multiple parameter sets should be used in alignment of homologous sequences with a stringent parameter set for highly conserved regions and a moderate parameter set for weakly conserved regions. We describe an alignment algorithm to allow dynamic use of multiple parameter sets with different levels of stringency in computation of an optimal alignment of two sequences. The algorithm dynamically considers various candidate alignments, partitions each candidate alignment into sections, and determines the most appropriate set of parameter values for each section of the alignment. The algorithm and its local alignment version are implemented in a computer program named GAP4. The local alignment algorithm in GAP4, that in its predecessor GAP3, and an ordinary local alignment program SIM were evaluated on 257 716 pairs of homologous sequences from 100 protein families. On 168 475 of the 257 716 pairs (a rate of 65.4%), alignments from GAP4 were more statistically significant than alignments from GAP3 and SIM.
Collapse
Affiliation(s)
- Xiaoqiu Huang
- Department of Computer Science, Iowa State University, Ames, IA 50011-1040, USA.
| | | |
Collapse
|
15
|
Zhang Y, Zaki MJ. SMOTIF: efficient structured pattern and profile motif search. Algorithms Mol Biol 2006; 1:22. [PMID: 17118189 PMCID: PMC1679804 DOI: 10.1186/1748-7188-1-22] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2006] [Accepted: 11/21/2006] [Indexed: 11/29/2022] Open
Abstract
BACKGROUND A structured motif allows variable length gaps between several components, where each component is a simple motif, which allows either no gaps or only fixed length gaps. The motif can either be represented as a pattern or a profile (also called positional weight matrix). We propose an efficient algorithm, called SMOTIF, to solve the structured motif search problem, i.e., given one or more sequences and a structured motif, SMOTIF searches the sequences for all occurrences of the motif. Potential applications include searching for long terminal repeat (LTR) retrotransposons and composite regulatory binding sites in DNA sequences. RESULTS SMOTIF can search for both pattern and profile motifs, and it is efficient in terms of both time and space; it outperforms SMARTFINDER, a state-of-the-art algorithm for structured motif search. Experimental results show that SMOTIF is about 7 times faster and consumes 100 times less memory than SMARTFINDER. It can effectively search for LTR retrotransposons and is well suited to searching for motifs with long range gaps. It is also successful in finding potential composite transcription factor binding sites. CONCLUSION SMOTIF is a useful and efficient tool in searching for structured pattern and profile motifs. The algorithm is available as open-source at: http://www.cs.rpi.edu/~zaki/software/sMotif/.
Collapse
Affiliation(s)
- Yongqiang Zhang
- Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York 12180, USA
| | - Mohammed J Zaki
- Department of Computer Science, Rensselaer Polytechnic Institute, Troy, New York 12180, USA
| |
Collapse
|
16
|
Beckstette M, Homann R, Giegerich R, Kurtz S. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics 2006; 7:389. [PMID: 16930469 PMCID: PMC1635428 DOI: 10.1186/1471-2105-7-389] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2006] [Accepted: 08/24/2006] [Indexed: 11/10/2022] Open
Abstract
Background In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task. Results We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330. Conclusion Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than |A
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@|m + m - 1, where m is the length of the PSSM and A
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.
Collapse
Affiliation(s)
- Michael Beckstette
- International NRW Graduate School in Bioinformatics and Genome Research, Center for Biotechnology (CeBITec), Bielefeld University, D-33594 Bielefeld, Germany
- Technische Fakultät, Universität Bielefeld, Postfach 100 131, D-33501 Bielefeld, Germany
| | - Robert Homann
- International NRW Graduate School in Bioinformatics and Genome Research, Center for Biotechnology (CeBITec), Bielefeld University, D-33594 Bielefeld, Germany
- Technische Fakultät, Universität Bielefeld, Postfach 100 131, D-33501 Bielefeld, Germany
| | - Robert Giegerich
- Technische Fakultät, Universität Bielefeld, Postfach 100 131, D-33501 Bielefeld, Germany
| | - Stefan Kurtz
- Zentrum für Bioinformatik, Universität Hamburg, 20146 Hamburg, Germany
| |
Collapse
|
17
|
Su QJ, Lu L, Saxonov S, Brutlag DL. eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity. Nucleic Acids Res 2005; 33:D178-82. [PMID: 15608172 PMCID: PMC540014 DOI: 10.1093/nar/gki060] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Classifying proteins into families and superfamilies allows identification of functionally important conserved domains. The motifs and scoring matrices derived from such conserved regions provide computational tools that recognize similar patterns in novel sequences, and thus enable the prediction of protein function for genomes. The eBLOCKs database enumerates a cascade of protein blocks with varied conservation levels for each functional domain. A biologically important region is most stringently conserved among a smaller family of highly similar proteins. The same region is often found in a larger group of more remotely related proteins with a reduced stringency. Through enumeration, highly specific signatures can be generated from blocks with more columns and fewer family members, while highly sensitive signatures can be derived from blocks with fewer columns and more members as in a superfamily. By applying PSI-BLAST and a modified K-means clustering algorithm, eBLOCKs automatically groups protein sequences according to different levels of similarity. Multiple sequence alignments are made and trimmed into a series of ungapped blocks. Motifs and position-specific scoring matrices were derived from eBLOCKs and made available for sequence search and annotation. The eBLOCKs database provides a tool for high-throughput genome annotation with maximal specificity and sensitivity. The eBLOCKs database is freely available on the World Wide Web at http://motif.stanford.edu/eblocks/ to all users for online usage. Academic and not-for-profit institutions wishing copies of the program may contact Douglas L. Brutlag (brutlag@stanford.edu). Commercial firms wishing copies of the program for internal installation may contact Jacqueline Tay at the Stanford Office of Technology Licensing (jacqueline.tay@stanford.edu; http://otl.stanford.edu/).
Collapse
|
18
|
Bejerano G, Friedman N, Tishby N. Efficient exact p-value computation for small sample, sparse, and surprising categorical data. J Comput Biol 2005; 11:867-86. [PMID: 15700407 DOI: 10.1089/cmb.2004.11.867] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
A major obstacle in applying various hypothesis testing procedures to datasets in bioinformatics is the computation of ensuing p-values. In this paper, we define a generic branch-and-bound approach to efficient exact p-value computation and enumerate the required conditions for successful application. Explicit procedures are developed for the entire Cressie-Read family of statistics, which includes the widely used Pearson and likelihood ratio statistics in a one-way frequency table goodness-of-fit test. This new formulation constitutes a first practical exact improvement over the exhaustive enumeration performed by existing statistical software. The general techniques we develop to exploit the convexity of many statistics are also shown to carry over to contingency table tests, suggesting that they are readily extendible to other tests and test statistics of interest. Our empirical results demonstrate a speed-up of orders of magnitude over the exhaustive computation, significantly extending the practical range for performing exact tests. We also show that the relative speed-up gain increases as the null hypothesis becomes sparser, that computation precision increases with increase in speed-up, and that computation time is very moderately affected by the magnitude of the computed p-value. These qualities make our algorithm especially appealing in the regimes of small samples, sparse null distributions, and rare events, compared to the alternative asymptotic approximations and Monte Carlo samplers. We discuss several established bioinformatics applications, where small sample size, small expected counts in one or more categories (sparseness), and very small p-values do occur. Our computational framework could be applied in these, and similar cases, to improve performance.
Collapse
Affiliation(s)
- Gill Bejerano
- Center for Biomolecular Science and Engineering, School of Engineering, University of California, Santa Cruz, CA 95064, USA.
| | | | | |
Collapse
|
19
|
Abstract
MOTIVATION Matching a biological sequence against a probabilistic pattern (or profile) is a common task in computational biology. A probabilistic profile, represented as a scoring matrix, is more suitable than a deterministic pattern to retain the peculiarities of a given segment of a family of biological sequences. Brute-force algorithms take O(NP) to match a sequence of N characters against a profile of length P << N. RESULTS In this work, we exploit string compression techniques to speedup brute-force profile matching. We present two algorithms, based on run-length and LZ78 encodings, that reduce computational complexity by the compression factor of the encoding.
Collapse
Affiliation(s)
- Valerio Freschi
- Information Science and Technology Institute, University of Urbino, 61029 Urbino, Italy
| | | |
Collapse
|
20
|
Barash Y, Elidan G, Kaplan T, Friedman N. CIS: compound importance sampling method for protein-DNA binding site p-value estimation. Bioinformatics 2004; 21:596-600. [PMID: 15454407 DOI: 10.1093/bioinformatics/bti041] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION A key aspect of transcriptional regulation is the binding of transcription factors to sequence-specific binding sites that allow them to modulate the expression of nearby genes. Given models of such binding sites, one can scan regulatory regions for putative binding sites and construct a genome-wide regulatory network. In such genome-wide scans, it is crucial to control the amount of false positive predictions. Recently, several works demonstrated the benefits of modeling dependencies between positions within the binding site. Yet, computing the statistical significance of putative binding sites in this scenario remains a challenge. RESULTS We present a general, accurate and efficient method for computing p-values of putative binding sites that is applicable to a large class of probabilistic binding site and background models. We demonstrate the accuracy of the method on synthetic and real-life data. AVAILABILITY The procedure for scanning DNA sequences and computing the statistical significance of putative binding site scores is available upon request at http://compbio.cs.huji.ac.il/CIS/ CONTACT: nir@cs.huji.ac.il.
Collapse
Affiliation(s)
- Y Barash
- School of Computer Science & Engineering, The Hebrew University Jerusalem 91904, Israel
| | | | | | | |
Collapse
|
21
|
Qin Z, Shen M, Cohen SN. Identification and characterization of a pSLA2 plasmid locus required for linear DNA replication and circular plasmid stable inheritance in Streptomyces lividans. J Bacteriol 2003; 185:6575-82. [PMID: 14594830 PMCID: PMC262113 DOI: 10.1128/jb.185.22.6575-6582.2003] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Streptomyces linear plasmids and linear chromosomes can replicate also in a circular form when their telomeres are deleted. The 17-kb linear plasmid pSLA2 has been a useful model in studies of such replicons. Here we report that the minimal origin initiating replication of pSLA2-derived plasmids as circular molecules cannot propagate these plasmids in a linear mode unless they also contain a novel plasmid-encoded locus, here named rlrA (required for linear replication). In contrast with the need for rlrA to accomplish replication of telomere-containing linear plasmids, expression of rlrA, which encodes two LuxR family regulatory domains, interferes with the establishment of pSLA2 in circular form in Streptomyces lividans transformants. The additional presence of an adjacent divergently transcribed locus, rorA (rlrA override), which strongly resembles the kor (kil override) transcription control genes identified previously on Streptomyces plasmids, reversed the detrimental effects of rlrA on plasmid establishment and additionally stabilized circular plasmid inheritance by spores during the S. lividans life cycle. While the effects of the rlrA/rorA locus of pSLA2 were seen also on linear plasmids derived from the unrelated SLP2 replicon, they did not extend to plasmids whose replication was initiated at a cloned chromosomal origin. Our results establish the existence of, and provide the initial description of, a novel plasmid-borne regulatory system that differentially affects the propagation of linear and circular plasmids in Streptomyces.
Collapse
Affiliation(s)
- Zhongjun Qin
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305-5120, USA
| | | | | |
Collapse
|
22
|
Bennett SP, Lu L, Brutlag DL. 3MATRIX and 3MOTIF: a protein structure visualization system for conserved sequence motifs. Nucleic Acids Res 2003; 31:3328-32. [PMID: 12824319 PMCID: PMC168971 DOI: 10.1093/nar/gkg564] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Computational methods such as sequence alignment and motif construction are useful in grouping related proteins into families, as well as helping to annotate new proteins of unknown function. These methods identify conserved amino acids in protein sequences, but cannot determine the specific functional or structural roles of conserved amino acids without additional study. In this work, we present 3MATRIX (http://3matrix.stanford.edu) and 3MOTIF (http://3motif.stanford.edu), a web-based sequence motif visualization system that displays sequence motif information in its appropriate three-dimensional (3D) context. This system is flexible in that users can enter sequences, keywords, structures or sequence motifs to generate visualizations. In 3MOTIF, users can search using discrete sequence motifs such as PROSITE patterns, eMOTIFs, or any other regular expression-like motif. Similarly, 3MATRIX accepts an eMATRIX position-specific scoring matrix, or will convert a multiple sequence alignment block into an eMATRIX for visualization. Each query motif is used to search the protein structure database for matches, in which the motif is then visually highlighted in three dimensions. Important properties of motifs such as sequence conservation and solvent accessible surface area are also displayed in the visualizations, using carefully chosen color shading schemes.
Collapse
Affiliation(s)
- Steven P Bennett
- Department of Biochemistry, Stanford University School of Medicine, Stanford, CA 94305-5307, USA
| | | | | |
Collapse
|
23
|
Pouliot Y, Gao J, Su QJ, Liu GG, Ling XB. DIAN: a novel algorithm for genome ontological classification. Genome Res 2001; 11:1766-79. [PMID: 11591654 PMCID: PMC311153 DOI: 10.1101/gr.183301] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2001] [Accepted: 08/14/2001] [Indexed: 11/24/2022]
Abstract
Faced with the determination of many completely sequenced genomes, computational biology is now faced with the challenge of interpreting the significance of these data sets. A multiplicity of data-related problems impedes this goal: Biological annotations associated with raw data are often not normalized, and the data themselves are often poorly interrelated and their interpretation unclear. All of these problems make interpretation of genomic databases increasingly difficult. With the current explosion of sequences now available from the human genome as well as from model organisms, the importance of sorting this vast amount of conceptually unstructured source data into a limited universe of genes, proteins, functions, structures, and pathways has become a bottleneck for the field. To address this problem, we have developed a method of interrelating data sources by applying a novel method of associating biological objects to ontologies. We have developed an intelligent knowledge-based algorithm, to support biological knowledge mapping, and, in particular, to facilitate the interpretation of genomic data. In this respect, the method makes it possible to inventory genomes by collapsing multiple types of annotations and normalizing them to various ontologies. By relying on a conceptual view of the genome, researchers can now easily navigate the human genome in a biologically intuitive, scientifically accurate manner.
Collapse
Affiliation(s)
- Y Pouliot
- DoubleTwist, Inc., Oakland, California 94612, USA
| | | | | | | | | |
Collapse
|
24
|
Abstract
In its early days, the entire field of computational biology revolved almost entirely around biological sequence analysis. Over the past few years, however, a number of new non-sequence-based areas of investigation have become mainstream, from the analysis of gene expression data from microarrays, to whole-genome association discovery, and to the reverse engineering of gene regulatory pathways. Nonetheless, with the completion of private and public efforts to map the human genome, as well as those of other organisms, sequence data continue to be a veritable mother lode of valuable biological information that can be mined in a variety of contexts. Furthermore, the integration of sequence data with a variety of alternative information is providing valuable and fundamentally new insight into biological processes, as well as an array of new computational methodologies for the analysis of biological data.
Collapse
Affiliation(s)
- A Califano
- First Genetic Trust Inc., 9 Polito Avenue, Suite 930, Lyndhurst, NJ 07071, USA.
| |
Collapse
|
25
|
Harrington JJ, Sherf B, Rundlett S, Jackson PD, Perry R, Cain S, Leventhal C, Thornton M, Ramachandran R, Whittington J, Lerner L, Costanzo D, McElligott K, Boozer S, Mays R, Smith E, Veloso N, Klika A, Hess J, Cothren K, Lo K, Offenbacher J, Danzig J, Ducar M. Creation of genome-wide protein expression libraries using random activation of gene expression. Nat Biotechnol 2001; 19:440-5. [PMID: 11329013 DOI: 10.1038/88107] [Citation(s) in RCA: 57] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Here we report the use of random activation of gene expression (RAGE) to create genome-wide protein expression libraries. RAGE libraries containing only 5 x 10(6) individual clones were found to express every gene tested, including genes that are normally silent in the parent cell line. Furthermore, endogenous genes were activated at similar frequencies and expressed at similar levels within RAGE libraries created from multiple human cell lines, demonstrating that RAGE libraries are inherently normalized. Pools of RAGE clones were used to isolate 19,547 human gene clusters, approximately 53% of which were novel when tested against public databases of expressed sequence tag (EST) and complementary DNA (cDNA). Isolation of individual clones confirmed that the activated endogenous genes can be expressed at high levels to produce biologically active proteins. The properties of RAGE libraries and RAGE expression clones are well suited for a number of biotechnological applications including gene discovery, protein characterization, drug development, and protein manufacturing.
Collapse
Affiliation(s)
- J J Harrington
- Athersys, Inc., 3201 Carnegie Ave., Cleveland, OH 44115, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
26
|
Abstract
The EMOTIF database is a collection of more than 170 000 highly specific and sensitive protein sequence motifs representing conserved biochemical properties and biological functions. These protein motifs are derived from 7697 sequence alignments in the BLOCKS+ database (released on June 23, 2000) and all 8244 protein sequence alignments in the PRINTS database (version 27.0) using the emotif-maker algorithm developed by Nevill-Manning et al. (Nevill-Manning,C.G., Wu,T.D. and Brutlag,D.L. (1998) Proc. Natl Acad. Sci. USA, 95, 5865-5871; Nevill-Manning,C.G., Sethi,K.S., Wu,T. D. and Brutlag,D.L. (1997) ISMB-97, 5, 202-209). Since the amino acids and the groups of amino acids in these sequence motifs represent critical positions conserved in evolution, search algorithms employing the EMOTIF patterns can identify and classify more widely divergent sequences than methods based on global sequence similarity. The emotif protein pattern database is available at http://motif.stanford.edu/emotif/.
Collapse
Affiliation(s)
- J Y Huang
- Department of Biochemistry, Stanford University, Stanford, CA 94305-5307, USA.
| | | |
Collapse
|