1
|
Castillo S, Barth D, Arvas M, Pakula TM, Pitkänen E, Blomberg P, Seppanen-Laakso T, Nygren H, Sivasiddarthan D, Penttilä M, Oja M. Whole-genome metabolic model of Trichoderma reesei built by comparative reconstruction. BIOTECHNOLOGY FOR BIOFUELS 2016; 9:252. [PMID: 27895706 PMCID: PMC5117618 DOI: 10.1186/s13068-016-0665-0] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/09/2016] [Accepted: 11/10/2016] [Indexed: 05/02/2023]
Abstract
BACKGROUND Trichoderma reesei is one of the main sources of biomass-hydrolyzing enzymes for the biotechnology industry. There is a need for improving its enzyme production efficiency. The use of metabolic modeling for the simulation and prediction of this organism's metabolism is potentially a valuable tool for improving its capabilities. An accurate metabolic model is needed to perform metabolic modeling analysis. RESULTS A whole-genome metabolic model of T. reesei has been reconstructed together with metabolic models of 55 related species using the metabolic model reconstruction algorithm CoReCo. The previously published CoReCo method has been improved to obtain better quality models. The main improvements are the creation of a unified database of reactions and compounds and the use of reaction directions as constraints in the gap-filling step of the algorithm. In addition, the biomass composition of T. reesei has been measured experimentally to build and include a specific biomass equation in the model. CONCLUSIONS The improvements presented in this work on the CoReCo pipeline for metabolic model reconstruction resulted in higher-quality metabolic models compared with previous versions. A metabolic model of T. reesei has been created and is publicly available in the BIOMODELS database. The model contains a biomass equation, reaction boundaries and uptake/export reactions which make it ready for simulation. To validate the model, we dem1onstrate that the model is able to predict biomass production accurately and no stoichiometrically infeasible yields are detected. The new T. reesei model is ready to be used for simulations of protein production processes.
Collapse
Affiliation(s)
- Sandra Castillo
- VTT Technical Research Centre of Finland, Tietotie 2, P.O. Box FI-1000, 02044 Espoo, Finland
| | - Dorothee Barth
- VTT Technical Research Centre of Finland, Tietotie 2, P.O. Box FI-1000, 02044 Espoo, Finland
| | - Mikko Arvas
- VTT Technical Research Centre of Finland, Tietotie 2, P.O. Box FI-1000, 02044 Espoo, Finland
| | - Tiina M. Pakula
- VTT Technical Research Centre of Finland, Tietotie 2, P.O. Box FI-1000, 02044 Espoo, Finland
| | - Esa Pitkänen
- Department of Computer Science, University of Helsinki, P.O. 68 (Gustaf Hällströmin katu 2b), 00014 Helsinki, Finland
| | - Peter Blomberg
- VTT Technical Research Centre of Finland, Tietotie 2, P.O. Box FI-1000, 02044 Espoo, Finland
| | | | - Heli Nygren
- VTT Technical Research Centre of Finland, Tietotie 2, P.O. Box FI-1000, 02044 Espoo, Finland
| | | | - Merja Penttilä
- VTT Technical Research Centre of Finland, Tietotie 2, P.O. Box FI-1000, 02044 Espoo, Finland
| | - Merja Oja
- VTT Technical Research Centre of Finland, Tietotie 2, P.O. Box FI-1000, 02044 Espoo, Finland
| |
Collapse
|
2
|
Kludas J, Arvas M, Castillo S, Pakula T, Oja M, Brouard C, Jäntti J, Penttilä M, Rousu J. Machine Learning of Protein Interactions in Fungal Secretory Pathways. PLoS One 2016; 11:e0159302. [PMID: 27441920 PMCID: PMC4956264 DOI: 10.1371/journal.pone.0159302] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2016] [Accepted: 06/30/2016] [Indexed: 12/18/2022] Open
Abstract
In this paper we apply machine learning methods for predicting protein interactions in fungal secretion pathways. We assume an inter-species transfer setting, where training data is obtained from a single species and the objective is to predict protein interactions in other, related species. In our methodology, we combine several state of the art machine learning approaches, namely, multiple kernel learning (MKL), pairwise kernels and kernelized structured output prediction in the supervised graph inference framework. For MKL, we apply recently proposed centered kernel alignment and p-norm path following approaches to integrate several feature sets describing the proteins, demonstrating improved performance. For graph inference, we apply input-output kernel regression (IOKR) in supervised and semi-supervised modes as well as output kernel trees (OK3). In our experiments simulating increasing genetic distance, Input-Output Kernel Regression proved to be the most robust prediction approach. We also show that the MKL approaches improve the predictions compared to uniform combination of the kernels. We evaluate the methods on the task of predicting protein-protein-interactions in the secretion pathways in fungi, S.cerevisiae, baker's yeast, being the source, T. reesei being the target of the inter-species transfer learning. We identify completely novel candidate secretion proteins conserved in filamentous fungi. These proteins could contribute to their unique secretion capabilities.
Collapse
Affiliation(s)
- Jana Kludas
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland
| | - Mikko Arvas
- VTT Technical Research Centre of Finland, Espoo, Finland
| | | | - Tiina Pakula
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Merja Oja
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Céline Brouard
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland
| | - Jussi Jäntti
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Merja Penttilä
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Juho Rousu
- Helsinki Institute for Information Technology HIIT, Department of Computer Science, Aalto University, Espoo, Finland
| |
Collapse
|
3
|
Ma J, Wang S, Wang Z, Xu J. Protein contact prediction by integrating joint evolutionary coupling analysis and supervised learning. Bioinformatics 2015; 31:3506-13. [PMID: 26275894 DOI: 10.1093/bioinformatics/btv472] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2014] [Accepted: 08/08/2015] [Indexed: 02/07/2023] Open
Abstract
MOTIVATION Protein contact prediction is important for protein structure and functional study. Both evolutionary coupling (EC) analysis and supervised machine learning methods have been developed, making use of different information sources. However, contact prediction is still challenging especially for proteins without a large number of sequence homologs. RESULTS This article presents a group graphical lasso (GGL) method for contact prediction that integrates joint multi-family EC analysis and supervised learning to improve accuracy on proteins without many sequence homologs. Different from existing single-family EC analysis that uses residue coevolution information in only the target protein family, our joint EC analysis uses residue coevolution in both the target family and its related families, which may have divergent sequences but similar folds. To implement this, we model a set of related protein families using Gaussian graphical models and then coestimate their parameters by maximum-likelihood, subject to the constraint that these parameters shall be similar to some degree. Our GGL method can also integrate supervised learning methods to further improve accuracy. Experiments show that our method outperforms existing methods on proteins without thousands of sequence homologs, and that our method performs better on both conserved and family-specific contacts. AVAILABILITY AND IMPLEMENTATION See http://raptorx.uchicago.edu/ContactMap/ for a web server implementing the method. CONTACT j3xu@ttic.edu SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Jianzhu Ma
- Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave. Chicago, Illinois 60637 USA
| | - Sheng Wang
- Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave. Chicago, Illinois 60637 USA
| | - Zhiyong Wang
- Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave. Chicago, Illinois 60637 USA
| | - Jinbo Xu
- Toyota Technological Institute at Chicago, 6045 S. Kenwood Ave. Chicago, Illinois 60637 USA
| |
Collapse
|
4
|
Roston RL, Wang K, Kuhn LA, Benning C. Structural determinants allowing transferase activity in SENSITIVE TO FREEZING 2, classified as a family I glycosyl hydrolase. J Biol Chem 2014; 289:26089-26106. [PMID: 25100720 DOI: 10.1074/jbc.m114.576694] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
SENSITIVE TO FREEZING 2 (SFR2) is classified as a family I glycosyl hydrolase but has recently been shown to have galactosyltransferase activity in Arabidopsis thaliana. Natural occurrences of apparent glycosyl hydrolases acting as transferases are interesting from a biocatalysis standpoint, and knowledge about the interconversion can assist in engineering SFR2 in crop plants to resist freezing. To understand how SFR2 evolved into a transferase, the relationship between its structure and function are investigated by activity assay, molecular modeling, and site-directed mutagenesis. SFR2 has no detectable hydrolase activity, although its catalytic site is highly conserved with that of family 1 glycosyl hydrolases. Three regions disparate from glycosyl hydrolases are identified as required for transferase activity as follows: a loop insertion, the C-terminal peptide, and a hydrophobic patch adjacent to the catalytic site. Rationales for the effects of these regions on the SFR2 mechanism are discussed.
Collapse
Affiliation(s)
- Rebecca L Roston
- Departments of Biochemistry and Molecular Biology and Michigan State University, East Lansing, Michigan 48824.
| | - Kun Wang
- Departments of Biochemistry and Molecular Biology and Michigan State University, East Lansing, Michigan 48824
| | - Leslie A Kuhn
- Departments of Biochemistry and Molecular Biology and Michigan State University, East Lansing, Michigan 48824; Departments of Computer Science and Engineering, Michigan State University, East Lansing, Michigan 48824
| | - Christoph Benning
- Departments of Biochemistry and Molecular Biology and Michigan State University, East Lansing, Michigan 48824
| |
Collapse
|
5
|
Pitkänen E, Jouhten P, Hou J, Syed MF, Blomberg P, Kludas J, Oja M, Holm L, Penttilä M, Rousu J, Arvas M. Comparative genome-scale reconstruction of gapless metabolic networks for present and ancestral species. PLoS Comput Biol 2014; 10:e1003465. [PMID: 24516375 PMCID: PMC3916221 DOI: 10.1371/journal.pcbi.1003465] [Citation(s) in RCA: 66] [Impact Index Per Article: 6.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2013] [Accepted: 12/18/2013] [Indexed: 12/12/2022] Open
Abstract
We introduce a novel computational approach, CoReCo, for comparative metabolic reconstruction and provide genome-scale metabolic network models for 49 important fungal species. Leveraging on the exponential growth in sequenced genome availability, our method reconstructs genome-scale gapless metabolic networks simultaneously for a large number of species by integrating sequence data in a probabilistic framework. High reconstruction accuracy is demonstrated by comparisons to the well-curated Saccharomyces cerevisiae consensus model and large-scale knock-out experiments. Our comparative approach is particularly useful in scenarios where the quality of available sequence data is lacking, and when reconstructing evolutionary distant species. Moreover, the reconstructed networks are fully carbon mapped, allowing their use in 13C flux analysis. We demonstrate the functionality and usability of the reconstructed fungal models with computational steady-state biomass production experiment, as these fungi include some of the most important production organisms in industrial biotechnology. In contrast to many existing reconstruction techniques, only minimal manual effort is required before the reconstructed models are usable in flux balance experiments. CoReCo is available at http://esaskar.github.io/CoReCo/. Advances in next-generation sequencing technologies are revolutionizing molecular biology. Sequencing-enabled cost-effective characterization of microbial genomes is a particularly exciting development in metabolic engineering. There, considerable effort has been put to reconstructing genome-scale metabolic networks that describe the collection of hundreds to thousands of biochemical reactions available for a microbial cell. These network models are instrumental in understanding microbial metabolism and guiding metabolic engineering efforts to improve biochemical yields. We have developed a novel computational method, CoReCo, which bridges the growing gap between the availability of sequenced genomes and respective reconstructed metabolic networks. The method reconstructs genome-scale metabolic networks simultaneously for related microbial species. It utilizes the available sequencing data from these species to correct for incomplete and missing data. We used the method to reconstruct metabolic networks for a set of 49 fungal species providing the method protein sequence data and a phylogenetic tree describing the evolutionary relationships between the species. We demonstrate the applicability of the method by comparing a metabolic reconstruction of Saccharomyces cerevisiae to the manually curated, high-quality consensus network. We also provide an easy-to-use implementation of the method, usable both in single computer and distributed computing environments.
Collapse
Affiliation(s)
- Esa Pitkänen
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Department of Medical Genetics, Genome-Scale Biology Research Program, University of Helsinki, Helsinki, Finland
- * E-mail:
| | - Paula Jouhten
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Jian Hou
- Department of Computer Science, University of Helsinki, Helsinki, Finland
- Department of Information and Computer Science, Aalto University, Espoo, Finland
| | | | - Peter Blomberg
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Jana Kludas
- Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Merja Oja
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Liisa Holm
- Institute of Biotechnology & Department of Biosciences, University of Helsinki, Helsinki, Finland
| | - Merja Penttilä
- VTT Technical Research Centre of Finland, Espoo, Finland
| | - Juho Rousu
- Department of Information and Computer Science, Aalto University, Espoo, Finland
| | - Mikko Arvas
- VTT Technical Research Centre of Finland, Espoo, Finland
| |
Collapse
|
6
|
Plyusnin I, Holm L. Comprehensive comparison of graph based multiple protein sequence alignment strategies. BMC Bioinformatics 2012; 13:64. [PMID: 22540977 PMCID: PMC3375188 DOI: 10.1186/1471-2105-13-64] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2011] [Accepted: 04/29/2012] [Indexed: 12/03/2022] Open
Abstract
Background Alignment of protein sequences (MPSA) is the starting point for a multitude of applications in molecular biology. Here, we present a novel MPSA program based on the SeqAn sequence alignment library. Our implementation has a strict modular structure, which allows to swap different components of the alignment process and, thus, to investigate their contribution to the alignment quality and computation time. We systematically varied information sources, guiding trees, score transformations and iterative refinement options, and evaluated the resulting alignments on BAliBASE and SABmark. Results Our results indicate the optimal alignment strategy based on the choices compared. First, we show that pairwise global and local alignments contain sufficient information to construct a high quality multiple alignment. Second, single linkage clustering is almost invariably the best algorithm to build a guiding tree for progressive alignment. Third, triplet library extension, with introduction of new edges, is the most efficient consistency transformation of those compared. Alternatively, one can apply tree dependent partitioning as a post processing step, which was shown to be comparable with the best consistency transformation in both time and accuracy. Finally, propagating information beyond four transitive links introduces more noise than signal. Conclusions This is the first time multiple protein alignment strategies are comprehensively and clearly compared using a single implementation platform. In particular, we showed which of the existing consistency transformations and iterative refinement techniques are the most valid. Our implementation is freely available at http://ekhidna.biocenter.helsinki.fi/MMSA and as a supplementary file attached to this article (see Additional file 1).
Collapse
Affiliation(s)
- Ilya Plyusnin
- Institute of Biotechnology, University of Helsinki, P,O, Box 56, Viikinkaari 5, Helsinki, Finland.
| | | |
Collapse
|
7
|
Söding J, Remmert M. Protein sequence comparison and fold recognition: progress and good-practice benchmarking. Curr Opin Struct Biol 2011; 21:404-11. [PMID: 21458982 DOI: 10.1016/j.sbi.2011.03.005] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2011] [Revised: 03/01/2011] [Accepted: 03/09/2011] [Indexed: 11/26/2022]
Abstract
Protein sequence comparison methods have grown increasingly sensitive during the last decade and can often identify distantly related proteins sharing a common ancestor some 3 billion years ago. Although cellular function is not conserved so long, molecular functions and structures of protein domains often are. In combination with a domain-centered approach to function and structure prediction, modern remote homology detection methods have a great and largely underexploited potential for elucidating protein functions and evolution. Advances during the last few years include nonlinear scoring functions combining various sequence features, the use of sequence context information, and powerful new software packages. Since progress depends on realistically assessing new and existing methods and published benchmarks are often hard to compare, we propose 10 rules of good-practice benchmarking.
Collapse
Affiliation(s)
- Johannes Söding
- Gene Center and Center for Integrated Protein Science, Ludwig-Maximilians-Universität München, Feodor-Lynen-Strasse 25, Munich, Germany.
| | | |
Collapse
|
8
|
Malysiak-Mrozek B, Mrozek D. An Improved Method for Protein Similarity Searching by Alignment of Fuzzy Energy Signatures. INT J COMPUT INT SYS 2011. [DOI: 10.1080/18756891.2011.9727765] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022] Open
|
9
|
Melvin I, Weston J, Noble WS, Leslie C. Detecting remote evolutionary relationships among proteins by large-scale semantic embedding. PLoS Comput Biol 2011; 7:e1001047. [PMID: 21298082 PMCID: PMC3029239 DOI: 10.1371/journal.pcbi.1001047] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2010] [Accepted: 12/02/2010] [Indexed: 12/23/2022] Open
Abstract
Virtually every molecular biologist has searched a protein or DNA sequence database to find sequences that are evolutionarily related to a given query. Pairwise sequence comparison methods--i.e., measures of similarity between query and target sequences--provide the engine for sequence database search and have been the subject of 30 years of computational research. For the difficult problem of detecting remote evolutionary relationships between protein sequences, the most successful pairwise comparison methods involve building local models (e.g., profile hidden Markov models) of protein sequences. However, recent work in massive data domains like web search and natural language processing demonstrate the advantage of exploiting the global structure of the data space. Motivated by this work, we present a large-scale algorithm called ProtEmbed, which learns an embedding of protein sequences into a low-dimensional "semantic space." Evolutionarily related proteins are embedded in close proximity, and additional pieces of evidence, such as 3D structural similarity or class labels, can be incorporated into the learning process. We find that ProtEmbed achieves superior accuracy to widely used pairwise sequence methods like PSI-BLAST and HHSearch for remote homology detection; it also outperforms our previous RankProp algorithm, which incorporates global structure in the form of a protein similarity network. Finally, the ProtEmbed embedding space can be visualized, both at the global level and local to a given query, yielding intuition about the structure of protein sequence space.
Collapse
Affiliation(s)
- Iain Melvin
- NEC Laboratories America, Princeton, New Jersey, United States of America
| | - Jason Weston
- Google, New York, New York, United States of America
| | - William Stafford Noble
- Department of Genome Sciences, University of Washington, Seattle, Washington, United States of America
| | - Christina Leslie
- Computational Biology Program, Memorial Sloan-Kettering Cancer Center, New York, New York, United States of America
| |
Collapse
|
10
|
Hasegawa H, Holm L. Advances and pitfalls of protein structural alignment. Curr Opin Struct Biol 2009; 19:341-8. [PMID: 19481444 DOI: 10.1016/j.sbi.2009.04.003] [Citation(s) in RCA: 275] [Impact Index Per Article: 17.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2009] [Accepted: 04/16/2009] [Indexed: 11/30/2022]
Abstract
Structure comparison opens a window into the distant past of protein evolution, which has been unreachable by sequence comparison alone. With 55,000 entries in the Protein Data Bank and about 500 new structures added each week, automated processing, comparison, and classification are necessary. A variety of methods use different representations, scoring functions, and optimization algorithms, and they generate contradictory results even for moderately distant structures. Sequence mutations, insertions, and deletions are accommodated by plastic deformations of the common core, retaining the precise geometry of the active site, and peripheral regions may refold completely. Therefore structure comparison methods that allow for flexibility and plasticity generate the most biologically meaningful alignments. Active research directions include both the search for fold invariant features and the modeling of structural transitions in evolution. Advances have been made in algorithmic robustness, multiple alignment, and speeding up database searches.
Collapse
Affiliation(s)
- Hitomi Hasegawa
- Institute of Biotechnology, University of Helsinki, P.O. Box 56 (Viikinkaari 5), 00014 University of Helsinki, Finland
| | | |
Collapse
|
11
|
Abstract
Current protein classification methods treat high-resolution structures as static entities. However, experiments have well documented the dynamic nature of proteins. With knowledge that thermodynamic fluctuations around the high-resolution structure contribute to a more physically accurate and biologically meaningful picture of a protein, the concept of a protein's energetic profile is introduced. It is demonstrated on a large scale that energetic profiles are both diagnostic of a protein fold and evolutionarily relevant. Development of Structural Thermodynamic Ensemble-based Protein Homology (STEPH), an algorithm that searches for local similarities between energetic profiles, constitutes a first step towards a long-term goal of our laboratory to integrate thermodynamic information into protein-fold classification approaches.
Collapse
Affiliation(s)
- Jason Vertrees
- Department of Biochemistry and Molecular Biophysics, University of Texas Medical Branch, Galveston, Texas, USA,Sealy Center for Structural Biology and Molecular Biophysics, University of Texas Medical Branch, Galveston, Texas, USA
| | - James O. Wrabl
- Department of Biochemistry and Molecular Biophysics, University of Texas Medical Branch, Galveston, Texas, USA,Sealy Center for Structural Biology and Molecular Biophysics, University of Texas Medical Branch, Galveston, Texas, USA
| | - Vincent J. Hilser
- Department of Biochemistry and Molecular Biophysics, University of Texas Medical Branch, Galveston, Texas, USA,Sealy Center for Structural Biology and Molecular Biophysics, University of Texas Medical Branch, Galveston, Texas, USA
| |
Collapse
|
12
|
Astikainen K, Holm L, Pitkänen E, Szedmak S, Rousu J. Towards structured output prediction of enzyme function. BMC Proc 2008; 2 Suppl 4:S2. [PMID: 19091049 PMCID: PMC2654971 DOI: 10.1186/1753-6561-2-s4-s2] [Citation(s) in RCA: 26] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022] Open
Abstract
BACKGROUND In this paper we describe work in progress in developing kernel methods for enzyme function prediction. Our focus is in developing so called structured output prediction methods, where the enzymatic reaction is the combinatorial target object for prediction. We compared two structured output prediction methods, the Hierarchical Max-Margin Markov algorithm (HM3) and the Maximum Margin Regression algorithm (MMR) in hierarchical classification of enzyme function. As sequence features we use various string kernels and the GTG feature set derived from the global alignment trace graph of protein sequences. RESULTS In our experiments, in predicting enzyme EC classification we obtain over 85% accuracy (predicting the four digit EC code) and over 91% microlabel F1 score (predicting individual EC digits). In predicting the Gold Standard enzyme families, we obtain over 79% accuracy (predicting family correctly) and over 89% microlabel F1 score (predicting superfamilies and families). In the latter case, structured output methods are significantly more accurate than nearest neighbor classifier. A polynomial kernel over the GTG feature set turned out to be a prerequisite for accurate function prediction. Combining GTG with string kernels boosted accuracy slightly in the case of EC class prediction. CONCLUSION Structured output prediction with GTG features is shown to be computationally feasible and to have accuracy on par with state-of-the-art approaches in enzyme function prediction.
Collapse
Affiliation(s)
- Katja Astikainen
- Department of Computer Science, PO Box 68, FI-00014 University of Helsinki, Finland
| | - Liisa Holm
- Institute of Biotechnology, P.O. Box 56, FI-00014 University of Helsinki, Finland
| | - Esa Pitkänen
- Department of Computer Science, PO Box 68, FI-00014 University of Helsinki, Finland
| | - Sandor Szedmak
- Electronics and Computer Science, University of Southampton, SO17 1BJ, UK
| | - Juho Rousu
- Department of Computer Science, PO Box 68, FI-00014 University of Helsinki, Finland
| |
Collapse
|
13
|
Holm L, Kääriäinen S, Rosenström P, Schenkel A. Searching protein structure databases with DaliLite v.3. Bioinformatics 2008; 24:2780-1. [PMID: 18818215 PMCID: PMC2639270 DOI: 10.1093/bioinformatics/btn507] [Citation(s) in RCA: 872] [Impact Index Per Article: 51.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The Red Queen said, ‘It takes all the running you can do, to keep in the same place.’ Lewis Carrol Motivation: Newly solved protein structures are routinely scanned against structures already in the Protein Data Bank (PDB) using Internet servers. In favourable cases, comparing 3D structures may reveal biologically interesting similarities that are not detectable by comparing sequences. The number of known structures continues to grow exponentially. Sensitive—thorough but slow—search algorithms are challenged to deliver results in a reasonable time, as there are now more structures in the PDB than seconds in a day. The brute-force solution would be to distribute the individual comparisons on a massively parallel computer. A frugal solution, as implemented in the Dali server, is to reduce the total computational cost by pruning search space using prior knowledge about the distribution of structures in fold space. This note reports paradigm revisions that enable maintaining such a knowledge base up-to-date on a PC. Availability: The Dali server for protein structure database searching at http://ekhidna.biocenter.helsinki.fi/dali_server is running DaliLite v.3. The software can be downloaded for academic use from http://ekhidna.biocenter.helsinki.fi/dali_lite/downloads/v3. Contact:liisa.holm@helsinki.fi
Collapse
Affiliation(s)
- L Holm
- Department of Biological and Environmental Sciences, and Institute of Biotechnology, P.O.Box 56 (Viikinkaari 5), 00014 University of Helsinki, Finland.
| | | | | | | |
Collapse
|
14
|
Pei J. Multiple protein sequence alignment. Curr Opin Struct Biol 2008; 18:382-6. [PMID: 18485694 DOI: 10.1016/j.sbi.2008.03.007] [Citation(s) in RCA: 38] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/13/2008] [Accepted: 03/18/2008] [Indexed: 11/16/2022]
Abstract
Multiple sequence alignments are essential in computational analysis of protein sequences and structures, with applications in structure modeling, functional site prediction, phylogenetic analysis and sequence database searching. Constructing accurate multiple alignments for divergent protein sequences remains a difficult computational task, and alignment speed becomes an issue for large sequence datasets. Here, I review methodologies and recent advances in the multiple protein sequence alignment field, with emphasis on the use of additional sequence and structural information to improve alignment quality.
Collapse
Affiliation(s)
- Jimin Pei
- Howard Hughes Medical Institute, University of Texas Southwestern Medical Center at Dallas, 5323 Harry Hines Boulevard, Dallas, TX 75390, USA.
| |
Collapse
|
15
|
Heger A, Korpelainen E, Hupponen T, Mattila K, Ollikainen V, Holm L. PairsDB atlas of protein sequence space. Nucleic Acids Res 2008; 36:D276-80. [PMID: 17986464 PMCID: PMC2238971 DOI: 10.1093/nar/gkm879] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2007] [Revised: 09/28/2007] [Accepted: 10/01/2007] [Indexed: 11/12/2022] Open
Abstract
Sequence similarity/database searching is a cornerstone of molecular biology. PairsDB is a database intended to make exploring protein sequences and their similarity relationships quick and easy. Behind PairsDB is a comprehensive collection of protein sequences and BLAST and PSI-BLAST alignments between them. Instead of running BLAST or PSI-BLAST individually on each request, results are retrieved instantaneously from a database of pre-computed alignments. Filtering options allow you to find a set of sequences satisfying a set of criteria-for example, all human proteins with solved structure and without transmembrane segments. PairsDB is continually updated and covers all sequences in Uniprot. The data is stored in a MySQL relational database. Data files will be made available for download at ftp://nic.funet.fi/pub/sci/molbio. PairsDB can also be accessed interactively at http://pairsdb.csc.fi. PairsDB data is a valuable platform to build various downstream automated analysis pipelines. For example, the graph of all-against-all similarity relationships is the starting point for clustering protein families, delineating domains, improving alignment accuracy by consistency measures, and defining orthologous genes. Moreover, query-anchored stacked sequence alignments, profiles and consensus sequences are useful in studies of sequence conservation patterns for clues about possible functional sites.
Collapse
Affiliation(s)
- Andreas Heger
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| | - Eija Korpelainen
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| | - Taavi Hupponen
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| | - Kimmo Mattila
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| | - Vesa Ollikainen
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| | - Liisa Holm
- MRC Functional Genetics Unit, University of Oxford, UK, Institute of Biotechnology, University of Helsinki, Center for Scientific Computing (CSC), Espoo and Department of Biological and Environmental Sciences, Division of Genetics, University of Helsinki, Finland
| |
Collapse
|