Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Wu TD, Nevill-Manning CG, Brutlag DL. Fast probabilistic analysis of sequence function using scoring matrices. Bioinformatics 2000;16:233-44. [PMID: 10869016 DOI: 10.1093/bioinformatics/16.3.233] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

For:	Wu TD, Nevill-Manning CG, Brutlag DL. Fast probabilistic analysis of sequence function using scoring matrices. Bioinformatics 2000;16:233-44. [PMID: 10869016 DOI: 10.1093/bioinformatics/16.3.233] [Citation(s) in RCA: 35] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Number

Cited by Other Article(s)

Fostier J. BLAMM: BLAS-based algorithm for finding position weight matrix occurrences in DNA sequences on CPUs and GPUs. BMC Bioinformatics 2020;21:81. [PMID: 32164557 PMCID: PMC7068855 DOI: 10.1186/s12859-020-3348-6] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Gao L, Bao W, Zhang H, Yuan CA, Huang DS. Fast sequence analysis based on diamond sampling. PLoS One 2018;13:e0198922. [PMID: 29953448 PMCID: PMC6023231 DOI: 10.1371/journal.pone.0198922] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2018] [Accepted: 05/29/2018] [Indexed: 12/02/2022] Open

Korhonen JH, Palin K, Taipale J, Ukkonen E. Fast motif matching revisited: high-order PWMs, SNPs and indels. Bioinformatics 2017;33:514-521. [PMID: 28011774 DOI: 10.1093/bioinformatics/btw683] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2016] [Accepted: 10/27/2016] [Indexed: 01/09/2023] Open

Jain S, Bader GD. Predicting physiologically relevant SH3 domain mediated protein-protein interactions in yeast. Bioinformatics 2016;32:1865-72. [PMID: 26861823 PMCID: PMC4908317 DOI: 10.1093/bioinformatics/btw045] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/29/2015] [Revised: 12/05/2015] [Accepted: 01/20/2016] [Indexed: 12/02/2022] Open

Abstract

MOTIVATION

Many intracellular signaling processes are mediated by interactions involving peptide recognition modules such as SH3 domains. These domains bind to small, linear protein sequence motifs which can be identified using high-throughput experimental screens such as phage display. Binding motif patterns can then be used to computationally predict protein interactions mediated by these domains. While many protein-protein interaction prediction methods exist, most do not work with peptide recognition module mediated interactions or do not consider many of the known constraints governing physiologically relevant interactions between two proteins.

RESULTS

A novel method for predicting physiologically relevant SH3 domain-peptide mediated protein-protein interactions in S. cerevisae using phage display data is presented. Like some previous similar methods, this method uses position weight matrix models of protein linear motif preference for individual SH3 domains to scan the proteome for potential hits and then filters these hits using a range of evidence sources related to sequence-based and cellular constraints on protein interactions. The novelty of this approach is the large number of evidence sources used and the method of combination of sequence based and protein pair based evidence sources. By combining different peptide and protein features using multiple Bayesian models we are able to predict high confidence interactions with an overall accuracy of 0.97.

AVAILABILITY AND IMPLEMENTATION

Domain-Motif Mediated Interaction Prediction (DoMo-Pred) command line tool and all relevant datasets are available under GNU LGPL license for download from http://www.baderlab.org/Software/DoMo-Pred The DoMo-Pred command line tool is implemented using Python 2.7 and C ++.

CONTACT

gary.bader@utoronto.ca

SUPPLEMENTARY INFORMATION

Supplementary data are available at Bioinformatics online.

Collapse

Pizzi C, Rastas P, Ukkonen E. Finding significant matches of position weight matrices in linear time. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2011;8:69-79. [PMID: 21071798 DOI: 10.1109/tcbb.2009.35] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/30/2023]

Cornman RS. The distribution of GYR- and YLP-like motifs in Drosophila suggests a general role in cuticle assembly and other protein-protein interactions. PLoS One 2010;5. [PMID: 20824096 PMCID: PMC2932725 DOI: 10.1371/journal.pone.0012536] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2010] [Accepted: 08/02/2010] [Indexed: 01/26/2023] Open

Beckstette M, Homann R, Giegerich R, Kurtz S. Significant speedup of database searches with HMMs by search space reduction with PSSM family models. ACTA ACUST UNITED AC 2009;25:3251-8. [PMID: 19828575 PMCID: PMC2788931 DOI: 10.1093/bioinformatics/btp593] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]

Korhonen J, Martinmäki P, Pizzi C, Rastas P, Ukkonen E. MOODS: fast search for position weight matrix matches in DNA sequences. Bioinformatics 2009;25:3181-2. [PMID: 19773334 PMCID: PMC2778336 DOI: 10.1093/bioinformatics/btp554] [Citation(s) in RCA: 106] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Pape UJ, Rahmann S, Sun F, Vingron M. Compound poisson approximation of the number of occurrences of a position frequency matrix (PFM) on both strands. J Comput Biol 2008;15:547-64. [PMID: 18631020 DOI: 10.1089/cmb.2007.0084] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Lähdesmäki H, Rust AG, Shmulevich I. Probabilistic inference of transcription factor binding from multiple data sources. PLoS One 2008;3:e1820. [PMID: 18364997 PMCID: PMC2268002 DOI: 10.1371/journal.pone.0001820] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2007] [Accepted: 02/04/2008] [Indexed: 11/21/2022] Open

Abstract

An important problem in molecular biology is to build a complete understanding of transcriptional regulatory processes in the cell. We have developed a flexible, probabilistic framework to predict TF binding from multiple data sources that differs from the standard hypothesis testing (scanning) methods in several ways. Our probabilistic modeling framework estimates the probability of binding and, thus, naturally reflects our degree of belief in binding. Probabilistic modeling also allows for easy and systematic integration of our binding predictions into other probabilistic modeling methods, such as expression-based gene network inference. The method answers the question of whether the whole analyzed promoter has a binding site, but can also be extended to estimate the binding probability at each nucleotide position. Further, we introduce an extension to model combinatorial regulation by several TFs. Most importantly, the proposed methods can make principled probabilistic inference from multiple evidence sources, such as, multiple statistical models (motifs) of the TFs, evolutionary conservation, regulatory potential, CpG islands, nucleosome positioning, DNase hypersensitive sites, ChIP-chip binding segments and other (prior) sequence-based biological knowledge. We developed both a likelihood and a Bayesian method, where the latter is implemented with a Markov chain Monte Carlo algorithm. Results on a carefully constructed test set from the mouse genome demonstrate that principled data fusion can significantly improve the performance of TF binding prediction methods. We also applied the probabilistic modeling framework to all promoters in the mouse genome and the results indicate a sparse connectivity between transcriptional regulators and their target promoters. To facilitate analysis of other sequences and additional data, we have developed an on-line web tool, ProbTF, which implements our probabilistic TF binding prediction method using multiple data sources. Test data set, a web tool, source codes and supplementary data are available at: http://www.probtf.org.

Collapse

Pape UJ, Rahmann S, Vingron M. Natural similarity measures between position frequency matrices with an application to clustering. ACTA ACUST UNITED AC 2008;24:350-7. [PMID: 18174183 DOI: 10.1093/bioinformatics/btm610] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]

Touzet H, Varré JS. Efficient and accurate P-value computation for Position Weight Matrices. Algorithms Mol Biol 2007;2:15. [PMID: 18072973 PMCID: PMC2238751 DOI: 10.1186/1748-7188-2-15] [Citation(s) in RCA: 86] [Impact Index Per Article: 5.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2007] [Accepted: 12/11/2007] [Indexed: 11/10/2022] Open

Koczyk G, Wyrwicz LS, Rychlewski L. LigProf: a simple tool for in silico prediction of ligand-binding sites. J Mol Model 2007;13:445-55. [PMID: 17200839 DOI: 10.1007/s00894-006-0165-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/27/2006] [Accepted: 10/25/2006] [Indexed: 10/23/2022]

Huang X, Brutlag DL. Dynamic use of multiple parameter sets in sequence alignment. Nucleic Acids Res 2006;35:678-86. [PMID: 17182633 PMCID: PMC1802605 DOI: 10.1093/nar/gkl1063] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open

Zhang Y, Zaki MJ. SMOTIF: efficient structured pattern and profile motif search. Algorithms Mol Biol 2006;1:22. [PMID: 17118189 PMCID: PMC1679804 DOI: 10.1186/1748-7188-1-22] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2006] [Accepted: 11/21/2006] [Indexed: 11/29/2022] Open

Beckstette M, Homann R, Giegerich R, Kurtz S. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics 2006;7:389. [PMID: 16930469 PMCID: PMC1635428 DOI: 10.1186/1471-2105-7-389] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2006] [Accepted: 08/24/2006] [Indexed: 11/10/2022] Open

Abstract

Background

In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task.

Results

We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330.

Conclusion

Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than |A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@|^m+ m - 1, where m is the length of the PSSM and A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.

Collapse

Su QJ, Lu L, Saxonov S, Brutlag DL. eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity. Nucleic Acids Res 2005;33:D178-82. [PMID: 15608172 PMCID: PMC540014 DOI: 10.1093/nar/gki060] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Abstract

Classifying proteins into families and superfamilies allows identification of functionally important conserved domains. The motifs and scoring matrices derived from such conserved regions provide computational tools that recognize similar patterns in novel sequences, and thus enable the prediction of protein function for genomes. The eBLOCKs database enumerates a cascade of protein blocks with varied conservation levels for each functional domain. A biologically important region is most stringently conserved among a smaller family of highly similar proteins. The same region is often found in a larger group of more remotely related proteins with a reduced stringency. Through enumeration, highly specific signatures can be generated from blocks with more columns and fewer family members, while highly sensitive signatures can be derived from blocks with fewer columns and more members as in a superfamily. By applying PSI-BLAST and a modified K-means clustering algorithm, eBLOCKs automatically groups protein sequences according to different levels of similarity. Multiple sequence alignments are made and trimmed into a series of ungapped blocks. Motifs and position-specific scoring matrices were derived from eBLOCKs and made available for sequence search and annotation. The eBLOCKs database provides a tool for high-throughput genome annotation with maximal specificity and sensitivity. The eBLOCKs database is freely available on the World Wide Web at http://motif.stanford.edu/eblocks/ to all users for online usage. Academic and not-for-profit institutions wishing copies of the program may contact Douglas L. Brutlag (brutlag@stanford.edu). Commercial firms wishing copies of the program for internal installation may contact Jacqueline Tay at the Stanford Office of Technology Licensing (jacqueline.tay@stanford.edu; http://otl.stanford.edu/).

Collapse

Bejerano G, Friedman N, Tishby N. Efficient exact p-value computation for small sample, sparse, and surprising categorical data. J Comput Biol 2005;11:867-86. [PMID: 15700407 DOI: 10.1089/cmb.2004.11.867] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Freschi V, Bogliolo A. Using sequence compression to speedup probabilistic profile matching. Bioinformatics 2005;21:2225-9. [PMID: 15713733 DOI: 10.1093/bioinformatics/bti323] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Barash Y, Elidan G, Kaplan T, Friedman N. CIS: compound importance sampling method for protein-DNA binding site p-value estimation. Bioinformatics 2004;21:596-600. [PMID: 15454407 DOI: 10.1093/bioinformatics/bti041] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open

Qin Z, Shen M, Cohen SN. Identification and characterization of a pSLA2 plasmid locus required for linear DNA replication and circular plasmid stable inheritance in Streptomyces lividans. J Bacteriol 2003;185:6575-82. [PMID: 14594830 PMCID: PMC262113 DOI: 10.1128/jb.185.22.6575-6582.2003] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Bennett SP, Lu L, Brutlag DL. 3MATRIX and 3MOTIF: a protein structure visualization system for conserved sequence motifs. Nucleic Acids Res 2003;31:3328-32. [PMID: 12824319 PMCID: PMC168971 DOI: 10.1093/nar/gkg564] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Pouliot Y, Gao J, Su QJ, Liu GG, Ling XB. DIAN: a novel algorithm for genome ontological classification. Genome Res 2001;11:1766-79. [PMID: 11591654 PMCID: PMC311153 DOI: 10.1101/gr.183301] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2001] [Accepted: 08/14/2001] [Indexed: 11/24/2022]

Califano A. Advances in sequence analysis. Curr Opin Struct Biol 2001;11:330-3. [PMID: 11406383 DOI: 10.1016/s0959-440x(00)00210-4] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]

Harrington JJ, Sherf B, Rundlett S, Jackson PD, Perry R, Cain S, Leventhal C, Thornton M, Ramachandran R, Whittington J, Lerner L, Costanzo D, McElligott K, Boozer S, Mays R, Smith E, Veloso N, Klika A, Hess J, Cothren K, Lo K, Offenbacher J, Danzig J, Ducar M. Creation of genome-wide protein expression libraries using random activation of gene expression. Nat Biotechnol 2001;19:440-5. [PMID: 11329013 DOI: 10.1038/88107] [Citation(s) in RCA: 57] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

Huang JY, Brutlag DL. The EMOTIF database. Nucleic Acids Res 2001;29:202-4. [PMID: 11125091 PMCID: PMC29837 DOI: 10.1093/nar/29.1.202] [Citation(s) in RCA: 100] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open