1
|
Cunial F, Apostolico A. Phylogeny Construction with Rigid Gapped Motifs. J Comput Biol 2012; 19:911-27. [DOI: 10.1089/cmb.2012.0060] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Fabio Cunial
- School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| | - Alberto Apostolico
- School of Computational Science and Engineering, College of Computing, Georgia Institute of Technology, Atlanta, Georgia
| |
Collapse
|
2
|
Jackups R, Liang J. Combinatorial analysis for sequence and spatial motif discovery in short sequence fragments. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:524-536. [PMID: 20671322 PMCID: PMC3417775 DOI: 10.1109/tcbb.2008.101] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
Motifs are overrepresented sequence or spatial patterns appearing in proteins. They often play important roles in maintaining protein stability and in facilitating protein function. When motifs are located in short sequence fragments, as in transmembrane domains that are only 6-20 residues in length, and when there is only very limited data, it is difficult to identify motifs. In this study, we introduce combinatorial models based on permutation for assessing statistically significant sequence and spatial patterns in short sequences. We show that our method can uncover previously unknown sequence and spatial motifs in beta-barrel membrane proteins and that our method outperforms existing methods in detecting statistically significant motifs in this data set. Last, we discuss implications of motif analysis for problems involving short sequences in other families of proteins.
Collapse
Affiliation(s)
- Ronald Jackups
- Department of Pathology and Immunology, Washington University School of Medicine, St. Louis, MO 63110, USA.
| | | |
Collapse
|
3
|
Jackups R, Liang J. Combinatorial model for sequence and spatial motif discovery in short sequence fragments: examples from beta-barrel membrane proteins. CONFERENCE PROCEEDINGS : ... ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL CONFERENCE 2008; 2006:3470-3. [PMID: 17947032 DOI: 10.1109/iembs.2006.259727] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Motifs are over-represented sequence or spatial patterns appearing in proteins. They often play important roles in maintaining protein stability and in facilitating protein functions. When motifs are located in short sequence fragments, as in transmembrane domains that are only 10-20 residues in length, and when there is only very limited data, it is difficult to identify motifs. In this study, we develop combinatorial models for assessing statistically significant sequence and spatial patterns. We show our method can uncover previously unknown sequence and spatial motifs in beta-barrel membrane proteins.
Collapse
Affiliation(s)
- Ronald Jackups
- Dept. of Bioeng., Illinois Univ., Chicago, IL 60607-7052, USA
| | | |
Collapse
|
4
|
Wei W, Yu XD. Comparative analysis of regulatory motif discovery tools for transcription factor binding sites. GENOMICS PROTEOMICS & BIOINFORMATICS 2007; 5:131-42. [PMID: 17893078 PMCID: PMC5054109 DOI: 10.1016/s1672-0229(07)60023-0] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/19/2023]
Abstract
In the post-genomic era, identification of specific regulatory motifs or transcription factor binding sites (TFBSs) in non-coding DNA sequences, which is essential to elucidate transcriptional regulatory networks, has emerged as an obstacle that frustrates many researchers. Consequently, numerous motif discovery tools and correlated databases have been applied to solving this problem. However, these existing methods, based on different computational algorithms, show diverse motif prediction efficiency in non-coding DNA sequences. Therefore, understanding the similarities and differences of computational algorithms and enriching the motif discovery literatures are important for users to choose the most appropriate one among the online available tools. Moreover, there still lacks credible criterion to assess motif discovery tools and instructions for researchers to choose the best according to their own projects. Thus integration of the related resources might be a good approach to improve accuracy of the application. Recent studies integrate regulatory motif discovery tools with experimental methods to offer a complementary approach for researchers, and also provide a much-needed model for current researches on transcriptional regulatory networks. Here we present a comparative analysis of regulatory motif discovery tools for TFBSs.
Collapse
|
5
|
Abstract
MOTIVATION Profile HMMs are a powerful tool for modeling conserved motifs in proteins. These models are widely used by search tools to classify new protein sequences into families based on domain architecture. However, the proliferation of known motifs and new proteomic sequence data poses a computational challenge for search, requiring days of CPU time to annotate an organism's proteome. RESULTS We use PROSITE-like patterns as a filter to speed up the comparison between protein sequence and profile HMM. A set of patterns is designed starting from the HMM, and only sequences matching one of these patterns are compared to the HMM by full dynamic programming. We give an algorithm to design patterns with maximal sensitivity subject to a bound on the false positive rate. Experiments show that our patterns typically retain at least 90% of the sensitivity of the source HMM while accelerating search by an order of magnitude. AVAILABILITY Contact the first author at the address below.
Collapse
Affiliation(s)
- Yanni Sun
- Department of Computer Science and Engineering, Washington University, St Louis, MO 63130, USA.
| | | |
Collapse
|
6
|
Via A, Gherardini PF, Ferraro E, Ausiello G, Scalia Tomba G, Helmer-Citterich M. False occurrences of functional motifs in protein sequences highlight evolutionary constraints. BMC Bioinformatics 2007; 8:68. [PMID: 17331242 PMCID: PMC1821045 DOI: 10.1186/1471-2105-8-68] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2006] [Accepted: 03/01/2007] [Indexed: 01/28/2023] Open
Abstract
Background False occurrences of functional motifs in protein sequences can be considered as random events due solely to the sequence composition of a proteome. Here we use a numerical approach to investigate the random appearance of functional motifs with the aim of addressing biological questions such as: How are organisms protected from undesirable occurrences of motifs otherwise selected for their functionality? Has the random appearance of functional motifs in protein sequences been affected during evolution? Results Here we analyse the occurrence of functional motifs in random sequences and compare it to that observed in biological proteomes; the behaviour of random motifs is also studied. Most motifs exhibit a number of false positives significantly similar to the number of times they appear in randomized proteomes (=expected number of false positives). Interestingly, about 3% of the analysed motifs show a different kind of behaviour and appear in biological proteomes less than they do in random sequences. In some of these cases, a mechanism of evolutionary negative selection is apparent; this helps to prevent unwanted functionalities which could interfere with cellular mechanisms. Conclusion Our thorough statistical and biological analysis showed that there are several mechanisms and evolutionary constraints both of which affect the appearance of functional motifs in protein sequences.
Collapse
Affiliation(s)
- Allegra Via
- Centro di Bioinformatica Molecolare, Department of Biology, University of Rome Tor Vergata, Roma
| | - Pier Federico Gherardini
- Centro di Bioinformatica Molecolare, Department of Biology, University of Rome Tor Vergata, Roma
| | - Enrico Ferraro
- Centro di Bioinformatica Molecolare, Department of Biology, University of Rome Tor Vergata, Roma
| | - Gabriele Ausiello
- Centro di Bioinformatica Molecolare, Department of Biology, University of Rome Tor Vergata, Roma
| | | | - Manuela Helmer-Citterich
- Centro di Bioinformatica Molecolare, Department of Biology, University of Rome Tor Vergata, Roma
| |
Collapse
|
7
|
Shen S, Kai B, Ruan J, Torin Huzil J, Carpenter E, Tuszynski JA. Probabilistic analysis of the frequencies of amino acid pairs within characterized protein sequences. PHYSICA A 2006; 370:651-662. [PMID: 32288076 PMCID: PMC7127678 DOI: 10.1016/j.physa.2006.03.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 01/03/2006] [Revised: 02/22/2006] [Indexed: 06/07/2023]
Abstract
Here, we describe a unique probabilistic evaluation of the 20, naturally occurring, amino acids and their distributions within the Swiss-Prot and Complete Human Genebank databases. We have developed a computational technique that imparts both directionality and length constraints into searches for unique combinations of amino acids within protein sequences. Using statistical approaches, we have carried out searches of all possible two- and three-residue motifs contained within these databases. This technique is based on the unusually high occurrence of a small number of these motifs when compared to the expected probability of finding a specific residue grouping within a given database. Subsequent filtering of this search to identify such unique combinations has provided several examples that can be used as markers to identify particular proteins within or across databases. We focus on three of these motifs, which were found to be of greatest interest to us. The CC, CM and a combination of the two, CCM motifs all occur either more or less frequently than would be predicted based on standard amino acid distributions within the entire human proteome.
Collapse
Affiliation(s)
- Shiyi Shen
- College of Mathematical Science and LPMC, Nankai University, Tianjin 300071, PR China
| | - Bo Kai
- College of Mathematical Science and LPMC, Nankai University, Tianjin 300071, PR China
| | - Jishou Ruan
- College of Mathematical Science and LPMC, Nankai University, Tianjin 300071, PR China
| | - J. Torin Huzil
- Department of Oncology, Division of Experimental Oncology, Cross Cancer Institute, University of Alberta, 11560 University Avenue, Edmonton, Canada AB T6G 1Z2
| | - Eric Carpenter
- Department of Oncology, Division of Experimental Oncology, Cross Cancer Institute, University of Alberta, 11560 University Avenue, Edmonton, Canada AB T6G 1Z2
| | - Jack A. Tuszynski
- Department of Oncology, Division of Experimental Oncology, Cross Cancer Institute, University of Alberta, 11560 University Avenue, Edmonton, Canada AB T6G 1Z2
| |
Collapse
|
8
|
Jackups R, Cheng S, Liang J. Sequence motifs and antimotifs in beta-barrel membrane proteins from a genome-wide analysis: the Ala-Tyr dichotomy and chaperone binding motifs. J Mol Biol 2006; 363:611-23. [PMID: 16973175 DOI: 10.1016/j.jmb.2006.07.095] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/09/2006] [Revised: 07/29/2006] [Accepted: 07/31/2006] [Indexed: 10/24/2022]
Abstract
Beta-barrel membrane proteins are found in the outer membrane of gram-negative bacteria, mitochondria, and chloroplasts. Although sequence motifs have been studied in alpha-helical membrane proteins and have been shown to play important roles in their assembly, it is not clear whether over-represented motifs and under-represented anti-motifs exist in beta-barrel membrane proteins. We have developed probabilistic models to identify sequence motifs of residue pairs on the same strand separated by an arbitrary number of residues. A rigorous statistical model is essential for this study because of the difficulty associated with the short length of the strands and the small amount of structural data. By comparing to the null model of exhaustive permutation of residues within the same beta-strand, propensity values of sequence patterns of two residues and p-values measuring statistical significance are calculated exactly by several analytical formulae we have developed or by enumeration. We find that there are characteristic sequence motifs and antimotifs in transmembrane (TM) beta-strands. The amino acid Tyr plays an important role in several such motifs. We find a general dichotomy consisting of favorable Aliphatic-Tyr sequence motifs and unfavorable Tyr-Aliphatic antimotifs. Tyr is also part of a terminal motif, YxF, which is likely to be important for chaperone binding. Our results also suggest several experiments that can help to elucidate the mechanisms of in vitro and in vivo folding of beta-barrel membrane proteins.
Collapse
Affiliation(s)
- Ronald Jackups
- Department of Bioengineering,SEO, MC-063, University of Illinois at Chicago, 851 S. Morgan Street, Room 218, Chicago, IL 60607-7052, USA
| | | | | |
Collapse
|
9
|
Sandve GK, Drabløs F. A survey of motif discovery methods in an integrated framework. Biol Direct 2006; 1:11. [PMID: 16600018 PMCID: PMC1479319 DOI: 10.1186/1745-6150-1-11] [Citation(s) in RCA: 109] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2006] [Accepted: 04/06/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND There has been a growing interest in computational discovery of regulatory elements, and a multitude of motif discovery methods have been proposed. Computational motif discovery has been used with some success in simple organisms like yeast. However, as we move to higher organisms with more complex genomes, more sensitive methods are needed. Several recent methods try to integrate additional sources of information, including microarray experiments (gene expression and ChlP-chip). There is also a growing awareness that regulatory elements work in combination, and that this combinatorial behavior must be modeled for successful motif discovery. However, the multitude of methods and approaches makes it difficult to get a good understanding of the current status of the field. RESULTS This paper presents a survey of methods for motif discovery in DNA, based on a structured and well defined framework that integrates all relevant elements. Existing methods are discussed according to this framework. CONCLUSION The survey shows that although no single method takes all relevant elements into consideration, a very large number of different models treating the various elements separately have been tried. Very often the choices that have been made are not explicitly stated, making it difficult to compare different implementations. Also, the tests that have been used are often not comparable. Therefore, a stringent framework and improved test methods are needed to evaluate the different approaches in order to conclude which ones are most promising. REVIEWERS This article was reviewed by Eugene V. Koonin, Philipp Bucher (nominated by Mikhail Gelfand) and Frank Eisenhaber.
Collapse
Affiliation(s)
- Geir Kjetil Sandve
- Department of Computer and Information Science, NTNU – Norwegian University of Science and Technology, N-7052, Trondheim, Norway
| | - Finn Drabløs
- Department of Cancer Research and Molecular Medicine, NTNU – Norwegian University of Science and Technology, N-7006, Trondheim, Norway
| |
Collapse
|
10
|
Tao T, Zhai CX, Lu X, Fang H. A study of statistical methods for function prediction of protein motifs. ACTA ACUST UNITED AC 2005; 3:115-24. [PMID: 15693737 DOI: 10.2165/00822942-200403020-00006] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Abstract
Automatic discovery of new protein motifs (i.e. amino acid patterns) is one of the major challenges in bioinformatics. Several algorithms have been proposed that can extract statistically significant motif patterns from any set of protein sequences. With these methods, one can generate a large set of candidate motifs that may be biologically meaningful. This article examines methods to predict the functions of these candidate motifs. We use several statistical methods: a popularity method, a mutual information method and probabilistic translation models. These methods capture, from different perspectives, the correlations between the matched motifs of a protein and its assigned Gene Ontology terms that characterise the function of the protein. We evaluate these different methods using the known motifs in the InterPro database. Each method is used to rank candidate terms for each motif. We then use the expected mean reciprocal rank to evaluate the performance. The results show that, in general, all these methods perform well, suggesting that they can all be useful for predicting the function of an unknown motif. Among the methods tested, a probabilistic translation model with a popularity prior performs the best.
Collapse
Affiliation(s)
- Tao Tao
- Department of Computer Science, University of Illinois at Urbana-Champaign, 506 S. Matthews Avenue, Urbana, IL 61801, USA.
| | | | | | | |
Collapse
|
11
|
Abstract
MOTIVATION Multiple sequence alignment at the level of whole proteomes requires a high degree of automation, precluding the use of traditional validation methods such as manual curation. Since evolutionary models are too general to describe the history of each residue in a protein family, there is no single algorithm/model combination that can yield a biologically or evolutionarily optimal alignment. We propose a 'shotgun' strategy where many different algorithms are used to align the same family, and the best of these alignments is then chosen with a reliable objective function. We present WOOF, a novel 'word-oriented' objective function that relies on the identification and scoring of conserved amino acid patterns (words) between pairs of sequences. RESULTS Tests on a subset of reference protein alignments from BAliBASE showed that WOOF tended to rank the (manually curated) reference alignment highest among 1060 alternative (automatically generated) alignments for a majority of protein families. Among the automated alignments, there was a strong positive relationship between the WOOF score and similarity to the reference alignment. The speed of WOOF and its independence from explicit considerations of three-dimensional structure make it an excellent tool for analyzing large numbers of protein families. AVAILABILITY On request from the authors.
Collapse
Affiliation(s)
- Robert G Beiko
- ARC Centre in Bioinformatics and Institute for Molecular Bioscience, The University of Queensland Brisbane, Qld 4072, Australia
| | | | | |
Collapse
|
12
|
Liu AH, Califano A. CASTOR: clustering algorithm for sequence taxonomical organization and relationships. J Comput Biol 2003; 10:21-45. [PMID: 12676049 DOI: 10.1089/106652703763255651] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
Given a set of related proteins, two important problems in biology are the inference of protein subsets such that members of one subset share a common function and the identification of protein regions that possess functional significance. The former is typically approached by hierarchical bottom-up clustering based on pairwise sequence similarity and various linkage rules. The latter is typically approached in a supervised manner, based on global multiple sequence alignment. However, the two problems are inextricably linked, since functional subsets are usually characterized by distinctive functional regions. This paper introduces CASTOR, an automatic and unsupervised system that addresses both problems simultaneously and efficiently. It identifies protein regions that are likely to have functional significance by discovering and refining statistically significant motifs. It infers likely functional protein subsets and their relationships based on the presence of the discovered motifs in a top-down and recursive manner, allowing the identification of both hierarchical and nonhierarchical subset relationships. This is, to our knowledge, the first system that approaches both problems simultaneously in a top-down, systematic manner. CASTOR's performance is evaluated against the G-protein coupled receptor superfamily. The identified protein regions lead to a taxonomical organization of this superfamily that is in remarkable agreement with a biologically motivated one and which outperforms those produced by bottom-up clustering methods. We also find that conventional hierarchical representations may fail to accurately describe the complexity of evolutionary development responsible for the final organization of a complex protein family. In particular, many functional relationships governing distant subfamilies of such a protein family may not be represented hierarchically.
Collapse
Affiliation(s)
- Agatha H Liu
- Computational Biology Center, TJ Watson IBM Research, Yorktown Heights, NY 10598, USA
| | | |
Collapse
|
13
|
Liu AH, Zhang X, Stolovitzky GA, Califano A, Firestein SJ. Motif-based construction of a functional map for mammalian olfactory receptors. Genomics 2003; 81:443-56. [PMID: 12706103 DOI: 10.1016/s0888-7543(03)00022-3] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
Abstract
We applied an automatic and unsupervised system to a nearly complete database of mammalian odor receptor genes. The generated motifs and gene classification were subjected to extensive and systematic downstream analysis to obtain biological insights. Two major results from this analysis were: (1) a map of sequence motifs that may correlate with function and (2) the corresponding receptor classes in which members of each class are likely to share specific functions. We have discovered motifs that have been implicated in structural integrity and posttranslational modification, as well as motifs very likely to be directly involved in ligand binding. We further propose a combinatorial molecular hypothesis, based on unique combinations of the observed motifs, that provides a foundation for understanding the generation of a large number of ligand binding sites.
Collapse
Affiliation(s)
- Agatha H Liu
- Computational Biology Center, T. J. Watson IBM Research, P.O. Box 218, Yorktown Heights, NY 10598, USA
| | | | | | | | | |
Collapse
|
14
|
Abstract
In its early days, the entire field of computational biology revolved almost entirely around biological sequence analysis. Over the past few years, however, a number of new non-sequence-based areas of investigation have become mainstream, from the analysis of gene expression data from microarrays, to whole-genome association discovery, and to the reverse engineering of gene regulatory pathways. Nonetheless, with the completion of private and public efforts to map the human genome, as well as those of other organisms, sequence data continue to be a veritable mother lode of valuable biological information that can be mined in a variety of contexts. Furthermore, the integration of sequence data with a variety of alternative information is providing valuable and fundamentally new insight into biological processes, as well as an array of new computational methodologies for the analysis of biological data.
Collapse
Affiliation(s)
- A Califano
- First Genetic Trust Inc., 9 Polito Avenue, Suite 930, Lyndhurst, NJ 07071, USA.
| |
Collapse
|
15
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447210 DOI: 10.1002/cfg.57] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
|