1
|
Abstract
The level of conservation between two homologous sequences often varies among sequence regions; functionally important domains are more conserved than the remaining regions. Thus, multiple parameter sets should be used in alignment of homologous sequences with a stringent parameter set for highly conserved regions and a moderate parameter set for weakly conserved regions. We describe an alignment algorithm to allow dynamic use of multiple parameter sets with different levels of stringency in computation of an optimal alignment of two sequences. The algorithm dynamically considers various candidate alignments, partitions each candidate alignment into sections, and determines the most appropriate set of parameter values for each section of the alignment. The algorithm and its local alignment version are implemented in a computer program named GAP4. The local alignment algorithm in GAP4, that in its predecessor GAP3, and an ordinary local alignment program SIM were evaluated on 257 716 pairs of homologous sequences from 100 protein families. On 168 475 of the 257 716 pairs (a rate of 65.4%), alignments from GAP4 were more statistically significant than alignments from GAP3 and SIM.
Collapse
Affiliation(s)
- Xiaoqiu Huang
- Department of Computer Science, Iowa State University, Ames, IA 50011-1040, USA.
| | | |
Collapse
|
2
|
Beckstette M, Homann R, Giegerich R, Kurtz S. Fast index based algorithms and software for matching position specific scoring matrices. BMC Bioinformatics 2006; 7:389. [PMID: 16930469 PMCID: PMC1635428 DOI: 10.1186/1471-2105-7-389] [Citation(s) in RCA: 102] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/20/2006] [Accepted: 08/24/2006] [Indexed: 11/10/2022] Open
Abstract
Background In biological sequence analysis, position specific scoring matrices (PSSMs) are widely used to represent sequence motifs in nucleotide as well as amino acid sequences. Searching with PSSMs in complete genomes or large sequence databases is a common, but computationally expensive task. Results We present a new non-heuristic algorithm, called ESAsearch, to efficiently find matches of PSSMs in large databases. Our approach preprocesses the search space, e.g., a complete genome or a set of protein sequences, and builds an enhanced suffix array that is stored on file. This allows the searching of a database with a PSSM in sublinear expected time. Since ESAsearch benefits from small alphabets, we present a variant operating on sequences recoded according to a reduced alphabet. We also address the problem of non-comparable PSSM-scores by developing a method which allows the efficient computation of a matrix similarity threshold for a PSSM, given an E-value or a p-value. Our method is based on dynamic programming and, in contrast to other methods, it employs lazy evaluation of the dynamic programming matrix. We evaluated algorithm ESAsearch with nucleotide PSSMs and with amino acid PSSMs. Compared to the best previous methods, ESAsearch shows speedups of a factor between 17 and 275 for nucleotide PSSMs, and speedups up to factor 1.8 for amino acid PSSMs. Comparisons with the most widely used programs even show speedups by a factor of at least 3.8. Alphabet reduction yields an additional speedup factor of 2 on amino acid sequences compared to results achieved with the 20 symbol standard alphabet. The lazy evaluation method is also much faster than previous methods, with speedups of a factor between 3 and 330. Conclusion Our analysis of ESAsearch reveals sublinear runtime in the expected case, and linear runtime in the worst case for sequences not shorter than |A
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@|m + m - 1, where m is the length of the PSSM and A
MathType@MTEF@5@5@+=feaafiart1ev1aaatCvAUfKttLearuWrP9MDH5MBPbIqV92AaeXatLxBI9gBamrtHrhAL1wy0L2yHvtyaeHbnfgDOvwBHrxAJfwnaebbnrfifHhDYfgasaacH8akY=wiFfYdH8Gipec8Eeeu0xXdbba9frFj0=OqFfea0dXdd9vqai=hGuQ8kuc9pgc9s8qqaq=dirpe0xb9q8qiLsFr0=vr0=vr0dc8meaabaqaciaacaGaaeqabaWaaeGaeaaakeaaimaacqWFaeFqaaa@3821@ a finite alphabet. In practice, ESAsearch shows superior performance over the most widely used programs, especially for DNA sequences. The new algorithm for accurate on-the-fly calculations of thresholds has the potential to replace formerly used approximation approaches. Beyond the algorithmic contributions, we provide a robust, well documented, and easy to use software package, implementing the ideas and algorithms presented in this manuscript.
Collapse
Affiliation(s)
- Michael Beckstette
- International NRW Graduate School in Bioinformatics and Genome Research, Center for Biotechnology (CeBITec), Bielefeld University, D-33594 Bielefeld, Germany
- Technische Fakultät, Universität Bielefeld, Postfach 100 131, D-33501 Bielefeld, Germany
| | - Robert Homann
- International NRW Graduate School in Bioinformatics and Genome Research, Center for Biotechnology (CeBITec), Bielefeld University, D-33594 Bielefeld, Germany
- Technische Fakultät, Universität Bielefeld, Postfach 100 131, D-33501 Bielefeld, Germany
| | - Robert Giegerich
- Technische Fakultät, Universität Bielefeld, Postfach 100 131, D-33501 Bielefeld, Germany
| | - Stefan Kurtz
- Zentrum für Bioinformatik, Universität Hamburg, 20146 Hamburg, Germany
| |
Collapse
|
3
|
Sharma A, Chavali S, Mahajan A, Tabassum R, Banerjee V, Tandon N, Bharadwaj D. Genetic Association, Post-translational Modification, and Protein-Protein Interactions in Type 2 Diabetes Mellitus. Mol Cell Proteomics 2005; 4:1029-37. [PMID: 15886397 DOI: 10.1074/mcp.m500024-mcp200] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
Abstract
Type 2 diabetes mellitus is a complex disorder with a strong genetic component. Inherited complex disease susceptibility in humans is most commonly associated with single nucleotide polymorphisms. The mechanisms by which this occurs are still poorly understood. Here we focus on analyzing the effect of a set of disease-causing missense variations of the monogenetic form of Type 2 diabetes mellitus and a set of disease-associated nonsynonymous variations in comparison with that of nonsynonymous variations without any experimental evidence for association with any disease. Analysis of different properties such as evolutionary conservation status, solvent accessibility, secondary structure, etc. suggests that disease-causing variations are associated with extreme changes in the value of the parameters relating to evolutionary conservation and/or protein stability. Disease-associated variations are rather moderately conserved and have a milder effect on protein function and stability. The majority of the genes harboring these variations are clustered in or near the insulin signaling network. Most of these variations are identified as potential sites for post-translational modifications; certain predictions have already reported experimental evidence. Overall our results indicate that Type 2 diabetes mellitus may result from a large number of single nucleotide polymorphisms that impair modular domain function and post-translational modifications involved in signaling. Our emphasis is more on conserved corresponding residues than the variation alone. We believe that the approach of considering a stretch of peptide sequence involving a polymorphism would be a better method of defining the role of the polymorphism in the manifestation of this disease. Because most of the variations associated with the disease are rare, we hypothesize that this disease is a "mosaic model" of interaction between a large number of rare alleles and a small number of common alleles along with the environment, which is little contrary to the existing common disease common variant model.
Collapse
Affiliation(s)
- Amitabh Sharma
- Functional Genomics Unit, Institute of Genomics and Integrative Biology, Council of Scientific and Industrial Research (CSIR), Delhi 110 007
| | | | | | | | | | | | | |
Collapse
|
4
|
Su QJ, Lu L, Saxonov S, Brutlag DL. eBLOCKs: enumerating conserved protein blocks to achieve maximal sensitivity and specificity. Nucleic Acids Res 2005; 33:D178-82. [PMID: 15608172 PMCID: PMC540014 DOI: 10.1093/nar/gki060] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Classifying proteins into families and superfamilies allows identification of functionally important conserved domains. The motifs and scoring matrices derived from such conserved regions provide computational tools that recognize similar patterns in novel sequences, and thus enable the prediction of protein function for genomes. The eBLOCKs database enumerates a cascade of protein blocks with varied conservation levels for each functional domain. A biologically important region is most stringently conserved among a smaller family of highly similar proteins. The same region is often found in a larger group of more remotely related proteins with a reduced stringency. Through enumeration, highly specific signatures can be generated from blocks with more columns and fewer family members, while highly sensitive signatures can be derived from blocks with fewer columns and more members as in a superfamily. By applying PSI-BLAST and a modified K-means clustering algorithm, eBLOCKs automatically groups protein sequences according to different levels of similarity. Multiple sequence alignments are made and trimmed into a series of ungapped blocks. Motifs and position-specific scoring matrices were derived from eBLOCKs and made available for sequence search and annotation. The eBLOCKs database provides a tool for high-throughput genome annotation with maximal specificity and sensitivity. The eBLOCKs database is freely available on the World Wide Web at http://motif.stanford.edu/eblocks/ to all users for online usage. Academic and not-for-profit institutions wishing copies of the program may contact Douglas L. Brutlag (brutlag@stanford.edu). Commercial firms wishing copies of the program for internal installation may contact Jacqueline Tay at the Stanford Office of Technology Licensing (jacqueline.tay@stanford.edu; http://otl.stanford.edu/).
Collapse
|
5
|
Abstract
MOTIVATION Matching a biological sequence against a probabilistic pattern (or profile) is a common task in computational biology. A probabilistic profile, represented as a scoring matrix, is more suitable than a deterministic pattern to retain the peculiarities of a given segment of a family of biological sequences. Brute-force algorithms take O(NP) to match a sequence of N characters against a profile of length P << N. RESULTS In this work, we exploit string compression techniques to speedup brute-force profile matching. We present two algorithms, based on run-length and LZ78 encodings, that reduce computational complexity by the compression factor of the encoding.
Collapse
Affiliation(s)
- Valerio Freschi
- Information Science and Technology Institute, University of Urbino, 61029 Urbino, Italy
| | | |
Collapse
|
6
|
Qin Z, Shen M, Cohen SN. Identification and characterization of a pSLA2 plasmid locus required for linear DNA replication and circular plasmid stable inheritance in Streptomyces lividans. J Bacteriol 2003; 185:6575-82. [PMID: 14594830 PMCID: PMC262113 DOI: 10.1128/jb.185.22.6575-6582.2003] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Streptomyces linear plasmids and linear chromosomes can replicate also in a circular form when their telomeres are deleted. The 17-kb linear plasmid pSLA2 has been a useful model in studies of such replicons. Here we report that the minimal origin initiating replication of pSLA2-derived plasmids as circular molecules cannot propagate these plasmids in a linear mode unless they also contain a novel plasmid-encoded locus, here named rlrA (required for linear replication). In contrast with the need for rlrA to accomplish replication of telomere-containing linear plasmids, expression of rlrA, which encodes two LuxR family regulatory domains, interferes with the establishment of pSLA2 in circular form in Streptomyces lividans transformants. The additional presence of an adjacent divergently transcribed locus, rorA (rlrA override), which strongly resembles the kor (kil override) transcription control genes identified previously on Streptomyces plasmids, reversed the detrimental effects of rlrA on plasmid establishment and additionally stabilized circular plasmid inheritance by spores during the S. lividans life cycle. While the effects of the rlrA/rorA locus of pSLA2 were seen also on linear plasmids derived from the unrelated SLP2 replicon, they did not extend to plasmids whose replication was initiated at a cloned chromosomal origin. Our results establish the existence of, and provide the initial description of, a novel plasmid-borne regulatory system that differentially affects the propagation of linear and circular plasmids in Streptomyces.
Collapse
Affiliation(s)
- Zhongjun Qin
- Department of Genetics, Stanford University School of Medicine, Stanford, California 94305-5120, USA
| | | | | |
Collapse
|
7
|
Abstract
As the amount of information available to biologists increases exponentially, data analysis becomes progressively more challenging. Sequence homology has been a traditional tool in the researchers' armamentarium; it is a very versatile instrument and can be employed to assist in numerous tasks, from establishing the function of a gene to determination of the evolutionary development of an organism. Consequently, numerous specialized tools have been established in the public domain (most commonly, the World Wide Web) to help investigators use sequence homology in their research. These homology databases differ both in techniques they use to compare sequences as well as in the size of the unit of analysis, which can be the whole gene, a domain, or a motif. In this paper, we aim to present a systematic review of the inner details of the most commonly used databases as well as to offer guidelines for their use.
Collapse
Affiliation(s)
- Alexander Turchin
- Department of Medicine, New England Medical Center, Boston 02111, USA.
| | | |
Collapse
|
8
|
Abstract
In the post-genomic era, the new discipline of functional genomics is now facing the challenge of associating a function (as well as estimating its relevance to industrial applications) to about 100,000 microbial, plant or animal genes of known sequence but unknown function. Besides the design of databases, computational methods are increasingly becoming intimately linked with the various experimental approaches. Consequently, bioinformatics is rapidly evolving into independent fields addressing the specific problems of interpreting i) genomic sequences, ii) protein sequences and 3D-structures, as well as iii) transcriptome and macromolecular interaction data. It is thus increasingly difficult for the biologist to choose the computational approaches that perform best in these various areas. This paper attempts to review the most useful developments of the last 2 years.
Collapse
Affiliation(s)
- J M Claverie
- Structural and Genetic Information Laboratory,UMR 1889 CNRS-AVENTIS, 31 Chemin Joseph Aiguier, 13402 Marseille Cedex 20, France.
| | | | | | | |
Collapse
|
9
|
Harrington JJ, Sherf B, Rundlett S, Jackson PD, Perry R, Cain S, Leventhal C, Thornton M, Ramachandran R, Whittington J, Lerner L, Costanzo D, McElligott K, Boozer S, Mays R, Smith E, Veloso N, Klika A, Hess J, Cothren K, Lo K, Offenbacher J, Danzig J, Ducar M. Creation of genome-wide protein expression libraries using random activation of gene expression. Nat Biotechnol 2001; 19:440-5. [PMID: 11329013 DOI: 10.1038/88107] [Citation(s) in RCA: 57] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Here we report the use of random activation of gene expression (RAGE) to create genome-wide protein expression libraries. RAGE libraries containing only 5 x 10(6) individual clones were found to express every gene tested, including genes that are normally silent in the parent cell line. Furthermore, endogenous genes were activated at similar frequencies and expressed at similar levels within RAGE libraries created from multiple human cell lines, demonstrating that RAGE libraries are inherently normalized. Pools of RAGE clones were used to isolate 19,547 human gene clusters, approximately 53% of which were novel when tested against public databases of expressed sequence tag (EST) and complementary DNA (cDNA). Isolation of individual clones confirmed that the activated endogenous genes can be expressed at high levels to produce biologically active proteins. The properties of RAGE libraries and RAGE expression clones are well suited for a number of biotechnological applications including gene discovery, protein characterization, drug development, and protein manufacturing.
Collapse
Affiliation(s)
- J J Harrington
- Athersys, Inc., 3201 Carnegie Ave., Cleveland, OH 44115, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
10
|
von Ohsen N, Zimmer R. Improving Profile-Profile Alignments via Log Average Scoring. LECTURE NOTES IN COMPUTER SCIENCE 2001. [DOI: 10.1007/3-540-44696-6_2] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|