Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Wu TD, Nevill-Manning CG, Brutlag DL. Minimal-risk scoring matrices for sequence analysis. J Comput Biol 1999;6:219-35. [PMID: 10421524 DOI: 10.1089/cmb.1999.6.219] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

For:	Wu TD, Nevill-Manning CG, Brutlag DL. Minimal-risk scoring matrices for sequence analysis. J Comput Biol 1999;6:219-35. [PMID: 10421524 DOI: 10.1089/cmb.1999.6.219] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open

Number

Cited by Other Article(s)

Huang X, Brutlag DL. Dynamic use of multiple parameter sets in sequence alignment. Nucleic Acids Res 2006;35:678-86. [PMID: 17182633 PMCID: PMC1802605 DOI: 10.1093/nar/gkl1063] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open

Beckstette M, Homann R, Giegerich R, Kurtz S. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics 2006;7:389. [PMID: 16930469 PMCID: PMC1635428 DOI: 10.1186/1471-2105-7-389] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2006] [Accepted: 08/24/2006] [Indexed: 11/10/2022] Open

Abstract

Background

In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task.

Results

We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330.

Conclusion

Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than |A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@|^m+ m - 1, where m is the length of the PSSM and A MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.

Collapse

Sharma A, Chavali S, Mahajan A, Tabassum R, Banerjee V, Tandon N, Bharadwaj D. Genetic Association, Post-translational Modification, and Protein-Protein Interactions in Type 2 Diabetes Mellitus. Mol Cell Proteomics 2005;4:1029-37. [PMID: 15886397 DOI: 10.1074/mcp.m500024-mcp200] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open

Abstract

Type 2 diabetes mellitus is a complex disorder with a strong genetic component. Inherited complex disease susceptibility in humans is most commonly associated with single nucleotide polymorphisms. The mechanisms by which this occurs are still poorly understood. Here we focus on analyzing the effect of a set of disease-causing missense variations of the monogenetic form of Type 2 diabetes mellitus and a set of disease-associated nonsynonymous variations in comparison with that of nonsynonymous variations without any experimental evidence for association with any disease. Analysis of different properties such as evolutionary conservation status, solvent accessibility, secondary structure, etc. suggests that disease-causing variations are associated with extreme changes in the value of the parameters relating to evolutionary conservation and/or protein stability. Disease-associated variations are rather moderately conserved and have a milder effect on protein function and stability. The majority of the genes harboring these variations are clustered in or near the insulin signaling network. Most of these variations are identified as potential sites for post-translational modifications; certain predictions have already reported experimental evidence. Overall our results indicate that Type 2 diabetes mellitus may result from a large number of single nucleotide polymorphisms that impair modular domain function and post-translational modifications involved in signaling. Our emphasis is more on conserved corresponding residues than the variation alone. We believe that the approach of considering a stretch of peptide sequence involving a polymorphism would be a better method of defining the role of the polymorphism in the manifestation of this disease. Because most of the variations associated with the disease are rare, we hypothesize that this disease is a "mosaic model" of interaction between a large number of rare alleles and a small number of common alleles along with the environment, which is little contrary to the existing common disease common variant model.

Collapse

Su QJ, Lu L, Saxonov S, Brutlag DL. eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity. Nucleic Acids Res 2005;33:D178-82. [PMID: 15608172 PMCID: PMC540014 DOI: 10.1093/nar/gki060] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open

Abstract

Classifying proteins into families and superfamilies allows identification of functionally important conserved domains. The motifs and scoring matrices derived from such conserved regions provide computational tools that recognize similar patterns in novel sequences, and thus enable the prediction of protein function for genomes. The eBLOCKs database enumerates a cascade of protein blocks with varied conservation levels for each functional domain. A biologically important region is most stringently conserved among a smaller family of highly similar proteins. The same region is often found in a larger group of more remotely related proteins with a reduced stringency. Through enumeration, highly specific signatures can be generated from blocks with more columns and fewer family members, while highly sensitive signatures can be derived from blocks with fewer columns and more members as in a superfamily. By applying PSI-BLAST and a modified K-means clustering algorithm, eBLOCKs automatically groups protein sequences according to different levels of similarity. Multiple sequence alignments are made and trimmed into a series of ungapped blocks. Motifs and position-specific scoring matrices were derived from eBLOCKs and made available for sequence search and annotation. The eBLOCKs database provides a tool for high-throughput genome annotation with maximal specificity and sensitivity. The eBLOCKs database is freely available on the World Wide Web at http://motif.stanford.edu/eblocks/ to all users for online usage. Academic and not-for-profit institutions wishing copies of the program may contact Douglas L. Brutlag (brutlag@stanford.edu). Commercial firms wishing copies of the program for internal installation may contact Jacqueline Tay at the Stanford Office of Technology Licensing (jacqueline.tay@stanford.edu; http://otl.stanford.edu/).

Collapse

Freschi V, Bogliolo A. Using sequence compression to speedup probabilistic profile matching. Bioinformatics 2005;21:2225-9. [PMID: 15713733 DOI: 10.1093/bioinformatics/bti323] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Qin Z, Shen M, Cohen SN. Identification and characterization of a pSLA2 plasmid locus required for linear DNA replication and circular plasmid stable inheritance in Streptomyces lividans. J Bacteriol 2003;185:6575-82. [PMID: 14594830 PMCID: PMC262113 DOI: 10.1128/jb.185.22.6575-6582.2003] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open

Turchin A, Kohane IS. Gene homology resources on the World Wide Web. Physiol Genomics 2002;11:165-77. [PMID: 12464690 DOI: 10.1152/physiolgenomics.00112.2002] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open

Claverie JM, Abergel C, Audic S, Ogata H. Recent advances in computational genomics. Pharmacogenomics 2001;2:361-72. [PMID: 11722286 DOI: 10.1517/14622416.2.4.361] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/05/2022] Open

Harrington JJ, Sherf B, Rundlett S, Jackson PD, Perry R, Cain S, Leventhal C, Thornton M, Ramachandran R, Whittington J, Lerner L, Costanzo D, McElligott K, Boozer S, Mays R, Smith E, Veloso N, Klika A, Hess J, Cothren K, Lo K, Offenbacher J, Danzig J, Ducar M. Creation of genome-wide protein expression libraries using random activation of gene expression. Nat Biotechnol 2001;19:440-5. [PMID: 11329013 DOI: 10.1038/88107] [Citation(s) in RCA: 57] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]

von Ohsen N, Zimmer R. Improving Profile-Profile Alignments via Log Average Scoring. LECTURE NOTES IN COMPUTER SCIENCE 2001. [DOI: 10.1007/3-540-44696-6_2] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]