1
|
Jain S, Bakolitsa C, Brenner SE, Radivojac P, Moult J, Repo S, Hoskins RA, Andreoletti G, Barsky D, Chellapan A, Chu H, Dabbiru N, Kollipara NK, Ly M, Neumann AJ, Pal LR, Odell E, Pandey G, Peters-Petrulewicz RC, Srinivasan R, Yee SF, Yeleswarapu SJ, Zuhl M, Adebali O, Patra A, Beer MA, Hosur R, Peng J, Bernard BM, Berry M, Dong S, Boyle AP, Adhikari A, Chen J, Hu Z, Wang R, Wang Y, Miller M, Wang Y, Bromberg Y, Turina P, Capriotti E, Han JJ, Ozturk K, Carter H, Babbi G, Bovo S, Di Lena P, Martelli PL, Savojardo C, Casadio R, Cline MS, De Baets G, Bonache S, Díez O, Gutiérrez-Enríquez S, Fernández A, Montalban G, Ootes L, Özkan S, Padilla N, Riera C, De la Cruz X, Diekhans M, Huwe PJ, Wei Q, Xu Q, Dunbrack RL, Gotea V, Elnitski L, Margolin G, Fariselli P, Kulakovskiy IV, Makeev VJ, Penzar DD, Vorontsov IE, Favorov AV, Forman JR, Hasenahuer M, Fornasari MS, Parisi G, Avsec Z, Çelik MH, Nguyen TYD, Gagneur J, Shi FY, Edwards MD, Guo Y, Tian K, Zeng H, Gifford DK, Göke J, Zaucha J, Gough J, Ritchie GRS, Frankish A, Mudge JM, Harrow J, Young EL, Yu Y, Huff CD, Murakami K, Nagai Y, Imanishi T, Mungall CJ, Jacobsen JOB, Kim D, Jeong CS, Jones DT, Li MJ, Guthrie VB, Bhattacharya R, Chen YC, Douville C, Fan J, Kim D, Masica D, Niknafs N, Sengupta S, Tokheim C, Turner TN, Yeo HTG, Karchin R, Shin S, Welch R, Keles S, Li Y, Kellis M, Corbi-Verge C, Strokach AV, Kim PM, Klein TE, Mohan R, Sinnott-Armstrong NA, Wainberg M, Kundaje A, Gonzaludo N, Mak ACY, Chhibber A, Lam HYK, Dahary D, Fishilevich S, Lancet D, Lee I, Bachman B, Katsonis P, Lua RC, Wilson SJ, Lichtarge O, Bhat RR, Sundaram L, Viswanath V, Bellazzi R, Nicora G, Rizzo E, Limongelli I, Mezlini AM, Chang R, Kim S, Lai C, O’Connor R, Topper S, van den Akker J, Zhou AY, Zimmer AD, Mishne G, Bergquist TR, Breese MR, Guerrero RF, Jiang Y, Kiga N, Li B, Mort M, Pagel KA, Pejaver V, Stamboulian MH, Thusberg J, Mooney SD, Teerakulkittipong N, Cao C, Kundu K, Yin Y, Yu CH, Kleyman M, Lin CF, Stackpole M, Mount SM, Eraslan G, Mueller NS, Naito T, Rao AR, Azaria JR, Brodie A, Ofran Y, Garg A, Pal D, Hawkins-Hooker A, Kenlay H, Reid J, Mucaki EJ, Rogan PK, Schwarz JM, Searls DB, Lee GR, Seok C, Krämer A, Shah S, Huang CV, Kirsch JF, Shatsky M, Cao Y, Chen H, Karimi M, Moronfoye O, Sun Y, Shen Y, Shigeta R, Ford CT, Nodzak C, Uppal A, Shi X, Joseph T, Kotte S, Rana S, Rao A, Saipradeep VG, Sivadasan N, Sunderam U, Stanke M, Su A, Adzhubey I, Jordan DM, Sunyaev S, Rousseau F, Schymkowitz J, Van Durme J, Tavtigian SV, Carraro M, Giollo M, Tosatto SCE, Adato O, Carmel L, Cohen NE, Fenesh T, Holtzer T, Juven-Gershon T, Unger R, Niroula A, Olatubosun A, Väliaho J, Yang Y, Vihinen M, Wahl ME, Chang B, Chong KC, Hu I, Sun R, Wu WKK, Xia X, Zee BC, Wang MH, Wang M, Wu C, Lu Y, Chen K, Yang Y, Yates CM, Kreimer A, Yan Z, Yosef N, Zhao H, Wei Z, Yao Z, Zhou F, Folkman L, Zhou Y, Daneshjou R, Altman RB, Inoue F, Ahituv N, Arkin AP, Lovisa F, Bonvini P, Bowdin S, Gianni S, Mantuano E, Minicozzi V, Novak L, Pasquo A, Pastore A, Petrosino M, Puglisi R, Toto A, Veneziano L, Chiaraluce R, Ball MP, Bobe JR, Church GM, Consalvi V, Cooper DN, Buckley BA, Sheridan MB, Cutting GR, Scaini MC, Cygan KJ, Fredericks AM, Glidden DT, Neil C, Rhine CL, Fairbrother WG, Alontaga AY, Fenton AW, Matreyek KA, Starita LM, Fowler DM, Löscher BS, Franke A, Adamson SI, Graveley BR, Gray JW, Malloy MJ, Kane JP, Kousi M, Katsanis N, Schubach M, Kircher M, Mak ACY, Tang PLF, Kwok PY, Lathrop RH, Clark WT, Yu GK, LeBowitz JH, Benedicenti F, Bettella E, Bigoni S, Cesca F, Mammi I, Marino-Buslje C, Milani D, Peron A, Polli R, Sartori S, Stanzial F, Toldo I, Turolla L, Aspromonte MC, Bellini M, Leonardi E, Liu X, Marshall C, McCombie WR, Elefanti L, Menin C, Meyn MS, Murgia A, Nadeau KCY, Neuhausen SL, Nussbaum RL, Pirooznia M, Potash JB, Dimster-Denk DF, Rine JD, Sanford JR, Snyder M, Cote AG, Sun S, Verby MW, Weile J, Roth FP, Tewhey R, Sabeti PC, Campagna J, Refaat MM, Wojciak J, Grubb S, Schmitt N, Shendure J, Spurdle AB, Stavropoulos DJ, Walton NA, Zandi PP, Ziv E, Burke W, Chen F, Carr LR, Martinez S, Paik J, Harris-Wai J, Yarborough M, Fullerton SM, Koenig BA, McInnes G, Shigaki D, Chandonia JM, Furutsuki M, Kasak L, Yu C, Chen R, Friedberg I, Getz GA, Cong Q, Kinch LN, Zhang J, Grishin NV, Voskanian A, Kann MG, Tran E, Ioannidis NM, Hunter JM, Udani R, Cai B, Morgan AA, Sokolov A, Stuart JM, Minervini G, Monzon AM, Batzoglou S, Butte AJ, Greenblatt MS, Hart RK, Hernandez R, Hubbard TJP, Kahn S, O’Donnell-Luria A, Ng PC, Shon J, Veltman J, Zook JM. CAGI, the Critical Assessment of Genome Interpretation, establishes progress and prospects for computational genetic variant interpretation methods. Genome Biol 2024; 25:53. [PMID: 38389099 PMCID: PMC10882881 DOI: 10.1186/s13059-023-03113-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/21/2023] [Accepted: 11/17/2023] [Indexed: 02/24/2024] Open
Abstract
BACKGROUND The Critical Assessment of Genome Interpretation (CAGI) aims to advance the state-of-the-art for computational prediction of genetic variant impact, particularly where relevant to disease. The five complete editions of the CAGI community experiment comprised 50 challenges, in which participants made blind predictions of phenotypes from genetic data, and these were evaluated by independent assessors. RESULTS Performance was particularly strong for clinical pathogenic variants, including some difficult-to-diagnose cases, and extends to interpretation of cancer-related variants. Missense variant interpretation methods were able to estimate biochemical effects with increasing accuracy. Assessment of methods for regulatory variants and complex trait disease risk was less definitive and indicates performance potentially suitable for auxiliary use in the clinic. CONCLUSIONS Results show that while current methods are imperfect, they have major utility for research and clinical applications. Emerging methods and increasingly large, robust datasets for training and assessment promise further progress ahead.
Collapse
|
2
|
Dembech E, Malatesta M, De Rito C, Mori G, Cavazzini D, Secchi A, Morandin F, Percudani R. Identification of hidden associations among eukaryotic genes through statistical analysis of coevolutionary transitions. Proc Natl Acad Sci U S A 2023; 120:e2218329120. [PMID: 37043529 PMCID: PMC10120013 DOI: 10.1073/pnas.2218329120] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2022] [Accepted: 03/10/2023] [Indexed: 04/13/2023] Open
Abstract
Coevolution at the gene level, as reflected by correlated events of gene loss or gain, can be revealed by phylogenetic profile analysis. The optimal method and metric for comparing phylogenetic profiles, especially in eukaryotic genomes, are not yet established. Here, we describe a procedure suitable for large-scale analysis, which can reveal coevolution based on the assessment of the statistical significance of correlated presence/absence transitions between gene pairs. This metric can identify coevolution in profiles with low overall similarities and is not affected by similarities lacking coevolutionary information. We applied the procedure to a large collection of 60,912 orthologous gene groups (orthogroups) in 1,264 eukaryotic genomes extracted from OrthoDB. We found significant cotransition scores for 7,825 orthogroups associated in 2,401 coevolving modules linking known and unknown genes in protein complexes and biological pathways. To demonstrate the ability of the method to predict hidden gene associations, we validated through experiments the involvement of vertebrate malate synthase-like genes in the conversion of (S)-ureidoglycolate into glyoxylate and urea, the last step of purine catabolism. This identification explains the presence of glyoxylate cycle genes in metazoa and suggests an anaplerotic role of purine degradation in early eukaryotes.
Collapse
Affiliation(s)
- Elena Dembech
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Marco Malatesta
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Carlo De Rito
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Giulia Mori
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Davide Cavazzini
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Andrea Secchi
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| | - Francesco Morandin
- Department of Mathematical, Physical and Computer Sciences, University of Parma, Parma43124, Italy
| | - Riccardo Percudani
- Department of Chemistry, Life Sciences and Environmental Sustainability, University of Parma, Parma43124, Italy
| |
Collapse
|
3
|
Oliveira LS, Reyes A, Dutilh BE, Gruber A. Rational Design of Profile HMMs for Sensitive and Specific Sequence Detection with Case Studies Applied to Viruses, Bacteriophages, and Casposons. Viruses 2023; 15:519. [PMID: 36851733 PMCID: PMC9966878 DOI: 10.3390/v15020519] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2022] [Revised: 02/01/2023] [Accepted: 02/09/2023] [Indexed: 02/15/2023] Open
Abstract
Profile hidden Markov models (HMMs) are a powerful way of modeling biological sequence diversity and constitute a very sensitive approach to detecting divergent sequences. Here, we report the development of protocols for the rational design of profile HMMs. These methods were implemented on TABAJARA, a program that can be used to either detect all biological sequences of a group or discriminate specific groups of sequences. By calculating position-specific information scores along a multiple sequence alignment, TABAJARA automatically identifies the most informative sequence motifs and uses them to construct profile HMMs. As a proof-of-principle, we applied TABAJARA to generate profile HMMs for the detection and classification of two viral groups presenting different evolutionary rates: bacteriophages of the Microviridae family and viruses of the Flavivirus genus. We obtained conserved models for the generic detection of any Microviridae or Flavivirus sequence, and profile HMMs that can specifically discriminate Microviridae subfamilies or Flavivirus species. In another application, we constructed Cas1 endonuclease-derived profile HMMs that can discriminate CRISPRs and casposons, two evolutionarily related transposable elements. We believe that the protocols described here, and implemented on TABAJARA, constitute a generic toolbox for generating profile HMMs for the highly sensitive and specific detection of sequence classes.
Collapse
Affiliation(s)
- Liliane S. Oliveira
- Department of Parasitology, Instituto de Ciências Biomédicas, Universidade de São Paulo, São Paulo 05508-000, SP, Brazil
| | - Alejandro Reyes
- Max Planck Tandem Group in Computational Biology, Department of Biological Sciences, Universidad de los Andes, Bogotá 111711, Colombia
- The Edison Family Center for Genome Sciences and Systems Biology, Washington University School of Medicine, Saint Louis, MO 63108, USA
| | - Bas E. Dutilh
- Institute of Biodiversity, Faculty of Biological Sciences, Cluster of Excellence Balance of the Microverse, Friedrich-Schiller-University Jena, 07743 Jena, Germany
- Theoretical Biology and Bioinformatics, Science for Life, Utrecht University, Padualaan 8, 3584 CH Utrecht, The Netherlands
- European Virus Bioinformatics Center, Leutragraben 1, 07743 Jena, Germany
| | - Arthur Gruber
- Department of Parasitology, Instituto de Ciências Biomédicas, Universidade de São Paulo, São Paulo 05508-000, SP, Brazil
- European Virus Bioinformatics Center, Leutragraben 1, 07743 Jena, Germany
| |
Collapse
|
4
|
Schütze K, Heinzinger M, Steinegger M, Rost B. Nearest neighbor search on embeddings rapidly identifies distant protein relations. FRONTIERS IN BIOINFORMATICS 2022; 2:1033775. [PMID: 36466147 PMCID: PMC9714024 DOI: 10.3389/fbinf.2022.1033775] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 10/31/2022] [Indexed: 11/29/2023] Open
Abstract
Since 1992, all state-of-the-art methods for fast and sensitive identification of evolutionary, structural, and functional relations between proteins (also referred to as "homology detection") use sequences and sequence-profiles (PSSMs). Protein Language Models (pLMs) generalize sequences, possibly capturing the same constraints as PSSMs, e.g., through embeddings. Here, we explored how to use such embeddings for nearest neighbor searches to identify relations between protein pairs with diverged sequences (remote homology detection for levels of <20% pairwise sequence identity, PIDE). While this approach excelled for proteins with single domains, we demonstrated the current challenges applying this to multi-domain proteins and presented some ideas how to overcome existing limitations, in principle. We observed that sufficiently challenging data set separations were crucial to provide deeply relevant insights into the behavior of nearest neighbor search when applied to the protein embedding space, and made all our methods readily available for others.
Collapse
Affiliation(s)
- Konstantin Schütze
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology—i12, Munich, Germany
| | - Michael Heinzinger
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology—i12, Munich, Germany
- TUM Graduate School, Center of Doctoral Studies in Informatics and its Applications (CeDoSIA), Garching, Germany
| | - Martin Steinegger
- School of Biological Sciences, Seoul National University, Seoul, South Korea
- Artificial Intelligence Institute, Seoul National University, Seoul, South Korea
| | - Burkhard Rost
- TUM (Technical University of Munich) Department of Informatics, Bioinformatics & Computational Biology—i12, Munich, Germany
- Institute for Advanced Study (TUM-IAS), Germany & TUM School of Life Sciences Weihenstephan (WZW), Freising, Germany
| |
Collapse
|
5
|
Abrahim M, Machado E, Alvarez-Valín F, de Miranda AB, Catanho M. Uncovering Pseudogenes and Intergenic Protein-coding Sequences in TriTryps' Genomes. Genome Biol Evol 2022; 14:6754225. [PMID: 36208292 PMCID: PMC9576210 DOI: 10.1093/gbe/evac142] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 09/14/2022] [Accepted: 09/20/2022] [Indexed: 01/24/2023] Open
Abstract
Trypanosomatids belong to a remarkable group of unicellular, parasitic organisms of the order Kinetoplastida, an early diverging branch of the phylogenetic tree of eukaryotes, exhibiting intriguing biological characteristics affecting gene expression (intronless polycistronic transcription, trans-splicing, and RNA editing), metabolism, surface molecules, and organelles (compartmentalization of glycolysis, variation of the surface molecules, and unique mitochondrial DNA), cell biology and life cycle (phagocytic vacuoles evasion and intricate patterns of cell morphogenesis). With numerous genomic-scale data of several trypanosomatids becoming available since 2005 (genomes, transcriptomes, and proteomes), the scientific community can further investigate the mechanisms underlying these unusual features and address other unexplored phenomena possibly revealing biological aspects of the early evolution of eukaryotes. One fundamental aspect comprises the processes and mechanisms involved in the acquisition and loss of genes throughout the evolutionary history of these primitive microorganisms. Here, we present a comprehensive in silico analysis of pseudogenes in three major representatives of this group: Leishmania major, Trypanosoma brucei, and Trypanosoma cruzi. Pseudogenes, DNA segments originating from altered genes that lost their original function, are genomic relics that can offer an essential record of the evolutionary history of functional genes, as well as clues about the dynamics and evolution of hosting genomes. Scanning these genomes with functional proteins as proxies to reveal intergenic regions with protein-coding features, relying on a customized threshold to distinguish statistically and biologically significant sequence similarities, and reassembling remnant sequences from their debris, we found thousands of pseudogenes and hundreds of open reading frames, with particular characteristics in each trypanosomatid: mutation profile, number, content, density, codon bias, average size, single- or multi-copy gene origin, number and type of mutations, putative primitive function, and transcriptional activity. These features suggest a common process of pseudogene formation, different patterns of pseudogene evolution and extant biological functions, and/or distinct genome organization undertaken by those parasites during evolution, as well as different evolutionary and/or selective pressures acting on distinct lineages.
Collapse
Affiliation(s)
- Mayla Abrahim
- Laboratório de Tecnologia Imunológica, Instituto de Tecnologia em Imunobiológicos, Vice-Diretoria de Desenvolvimento Tecnológico, Bio-Manguinhos, Fundação Oswaldo Cruz (FIOCRUZ), Rio de Janeiro, RJ, Brazil
| | - Edson Machado
- Laboratório de Biologia Molecular Aplicada a Micobactérias, Instituto Oswaldo Cruz, Fiocruz, Brazil
| | - Fernando Alvarez-Valín
- Unidad de Genómica Evolutiva, Sección Biomatemática, Universidad de la República del Uruguay, Montevideo, Uruguay
| | | | | |
Collapse
|
6
|
Karaoz U, Brodie EL. microTrait: A Toolset for a Trait-Based Representation of Microbial Genomes. FRONTIERS IN BIOINFORMATICS 2022; 2:918853. [PMID: 36304272 PMCID: PMC9580909 DOI: 10.3389/fbinf.2022.918853] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/12/2022] [Accepted: 06/20/2022] [Indexed: 11/29/2023] Open
Abstract
Remote sensing approaches have revolutionized the study of macroorganisms, allowing theories of population and community ecology to be tested across increasingly larger scales without much compromise in resolution of biological complexity. In microbial ecology, our remote window into the ecology of microorganisms is through the lens of genome sequencing. For microbial organisms, recent evidence from genomes recovered from metagenomic samples corroborate a highly complex view of their metabolic diversity and other associated traits which map into high physiological complexity. Regardless, during the first decades of this omics era, microbial ecological research has primarily focused on taxa and functional genes as ecological units, favoring breadth of coverage over resolution of biological complexity manifested as physiological diversity. Recently, the rate at which provisional draft genomes are generated has increased substantially, giving new insights into ecological processes and interactions. From a genotype perspective, the wide availability of genome-centric data requires new data synthesis approaches that place organismal genomes center stage in the study of environmental roles and functional performance. Extraction of ecologically relevant traits from microbial genomes will be essential to the future of microbial ecological research. Here, we present microTrait, a computational pipeline that infers and distills ecologically relevant traits from microbial genome sequences. microTrait maps a genome sequence into a trait space, including discrete and continuous traits, as well as simple and composite. Traits are inferred from genes and pathways representing energetic, resource acquisition, and stress tolerance mechanisms, while genome-wide signatures are used to infer composite, or life history, traits of microorganisms. This approach is extensible to any microbial habitat, although we provide initial examples of this approach with reference to soil microbiomes.
Collapse
Affiliation(s)
- Ulas Karaoz
- Earth and Environmental Sciences, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
| | - Eoin L. Brodie
- Earth and Environmental Sciences, Lawrence Berkeley National Laboratory, Berkeley, CA, United States
- Department of Environmental Science, Policy and Management, University of California, Berkeley, CA, United States
| |
Collapse
|
7
|
Rodriguez-Valera F, Pushkarev A, Rosselli R, Béjà O. Searching Metagenomes for New Rhodopsins. Methods Mol Biol 2022; 2501:101-108. [PMID: 35857224 DOI: 10.1007/978-1-0716-2329-9_4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Most microbial groups have not been cultivated yet, and the only way to approach the enormous diversity of rhodopsins that they contain in a sensible timeframe is through the analysis of their genomes. High-throughput sequencing technologies have allowed the release of community genomics (metagenomics) of many habitats in the photic zones of the ocean and lakes. Already the harvest is impressive and included from the first bacterial rhodopsin (proteorhodopsin) to the recent discovery of heliorhodopsin by functional metagenomics. However, the search continues using bioinformatic or biochemical routes.
Collapse
Affiliation(s)
- Francisco Rodriguez-Valera
- Evolutionary Genomics Group, Departamento de Producción Vegetal y Microbiología, Universidad Miguel Hernández, San Juan de Alicante, Alicante, Spain
- Research Center for Molecular Mechanisms of Aging and Age-Related Diseases, Moscow Institute of Physics and Technology (National Research University), Dolgoprudny, Russia
| | - Alina Pushkarev
- Faculty of Biology, Technion - Israel Institute of Technology, Haifa, Israel
| | - Riccardo Rosselli
- Departamento de Fisiología, Genética y Microbiología, Facultad de Ciencias, Universidad de Alicante, Alicante, Spain
| | - Oded Béjà
- Faculty of Biology, Technion - Israel Institute of Technology, Haifa, Israel.
| |
Collapse
|
8
|
Sensitive protein alignments at tree-of-life scale using DIAMOND. Nat Methods 2021; 18:366-368. [PMID: 33828273 PMCID: PMC8026399 DOI: 10.1038/s41592-021-01101-x] [Citation(s) in RCA: 883] [Impact Index Per Article: 294.3] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2020] [Accepted: 02/22/2021] [Indexed: 12/05/2022]
Abstract
We are at the beginning of a genomic revolution in which all known species are planned to be sequenced. Accessing such data for comparative analyses is crucial in this new age of data-driven biology. Here, we introduce an improved version of DIAMOND that greatly exceeds previous search performances and harnesses supercomputing to perform tree-of-life scale protein alignments in hours, while matching the sensitivity of the gold standard BLASTP. An updated version of DIAMOND uses improved algorithmic procedures and a customized high-performance computing framework to make seemingly prohibitive large-scale protein sequence alignments feasible.
Collapse
|
9
|
Rational Design of Profile Hidden Markov Models for Viral Classification and Discovery. Bioinformatics 2021. [DOI: 10.36255/exonpublications.bioinformatics.2021.ch9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] Open
|
10
|
Characterization of a Novel Mitovirus of the Sand Fly Lutzomyia longipalpis Using Genomic and Virus-Host Interaction Signatures. Viruses 2020; 13:v13010009. [PMID: 33374584 PMCID: PMC7822452 DOI: 10.3390/v13010009] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2020] [Revised: 12/17/2020] [Accepted: 12/21/2020] [Indexed: 02/06/2023] Open
Abstract
Hematophagous insects act as the major reservoirs of infectious agents due to their intimate contact with a large variety of vertebrate hosts. Lutzomyia longipalpis is the main vector of Leishmania chagasi in the New World, but its role as a host of viruses is poorly understood. In this work, Lu. longipalpis RNA libraries were subjected to progressive assembly using viral profile HMMs as seeds. A sequence phylogenetically related to fungal viruses of the genus Mitovirus was identified and this novel virus was named Lul-MV-1. The 2697-base genome presents a single gene coding for an RNA-directed RNA polymerase with an organellar genetic code. To determine the possible host of Lul-MV-1, we analyzed the molecular characteristics of the viral genome. Dinucleotide composition and codon usage showed profiles similar to mitochondrial DNA of invertebrate hosts. Also, the virus-derived small RNA profile was consistent with the activation of the siRNA pathway, with size distribution and 5′ base enrichment analogous to those observed in viruses of sand flies, reinforcing Lu. longipalpis as a putative host. Finally, RT-PCR of different insect pools and sequences of public Lu. longipalpis RNA libraries confirmed the high prevalence of Lul-MV-1. This is the first report of a mitovirus infecting an insect host.
Collapse
|
11
|
Urban G, Torrisi M, Magnan CN, Pollastri G, Baldi P. Protein profiles: Biases and protocols. Comput Struct Biotechnol J 2020; 18:2281-2289. [PMID: 32994887 PMCID: PMC7486441 DOI: 10.1016/j.csbj.2020.08.015] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/14/2020] [Revised: 08/14/2020] [Accepted: 08/15/2020] [Indexed: 11/13/2022] Open
Abstract
The use of evolutionary profiles to predict protein secondary structure, as well as other protein structural features, has been standard practice since the 1990s. Using profiles in the input of such predictors, in place or in addition to the sequence itself, leads to significantly more accurate predictions. While profiles can enhance structural signals, their role remains somewhat surprising as proteins do not use profiles when folding in vivo. Furthermore, the same sequence-based redundancy reduction protocols initially derived to train and evaluate sequence-based predictors, have been applied to train and evaluate profile-based predictors. This can lead to unfair comparisons since profiles may facilitate the bleeding of information between training and test sets. Here we use the extensively studied problem of secondary structure prediction to better evaluate the role of profiles and show that: (1) high levels of profile similarity between training and test proteins are observed when using standard sequence-based redundancy protocols; (2) the gain in accuracy for profile-based predictors, over sequence-based predictors, strongly relies on these high levels of profile similarity between training and test proteins; and (3) the overall accuracy of a profile-based predictor on a given protein dataset provides a biased measure when trying to estimate the actual accuracy of the predictor, or when comparing it to other predictors. We show, however, that this bias can be mitigated by implementing a new protocol (EVALpro) which evaluates the accuracy of profile-based predictors as a function of the profile similarity between training and test proteins. Such a protocol not only allows for a fair comparison of the predictors on equally hard or easy examples, but also reduces the impact of choosing a given similarity cutoff when selecting test proteins. The EVALpro program is available in the SCRATCH suite ( www.scratch.proteomics.ics.uci.edu) and can be downloaded at: www.download.igb.uci.edu/#evalpro.
Collapse
Affiliation(s)
- Gregor Urban
- Department of Computer Science & Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA
| | - Mirko Torrisi
- UCD Institute for Discovery, University College Dublin, Dublin, 4, Ireland
| | - Christophe N Magnan
- Department of Computer Science & Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA
| | - Gianluca Pollastri
- UCD Institute for Discovery, University College Dublin, Dublin, 4, Ireland
| | - Pierre Baldi
- Department of Computer Science & Institute for Genomics and Bioinformatics, University of California, Irvine, CA 92697, USA
| |
Collapse
|
12
|
Prediction of Protein Tertiary Structure via Regularized Template Classification Techniques. Molecules 2020; 25:molecules25112467. [PMID: 32466409 PMCID: PMC7321371 DOI: 10.3390/molecules25112467] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2020] [Revised: 05/21/2020] [Accepted: 05/22/2020] [Indexed: 11/24/2022] Open
Abstract
We discuss the use of the regularized linear discriminant analysis (LDA) as a model reduction technique combined with particle swarm optimization (PSO) in protein tertiary structure prediction, followed by structure refinement based on singular value decomposition (SVD) and PSO. The algorithm presented in this paper corresponds to the category of template-based modeling. The algorithm performs a preselection of protein templates before constructing a lower dimensional subspace via a regularized LDA. The protein coordinates in the reduced spaced are sampled using a highly explorative optimization algorithm, regressive–regressive PSO (RR-PSO). The obtained structure is then projected onto a reduced space via singular value decomposition and further optimized via RR-PSO to carry out a structure refinement. The final structures are similar to those predicted by best structure prediction tools, such as Rossetta and Zhang servers. The main advantage of our methodology is that alleviates the ill-posed character of protein structure prediction problems related to high dimensional optimization. It is also capable of sampling a wide range of conformational space due to the application of a regularized linear discriminant analysis, which allows us to expand the differences over a reduced basis set.
Collapse
|
13
|
Jin X, Liao Q, Liu B. PL-search: a profile-link-based search method for protein remote homology detection. Brief Bioinform 2020; 22:5840006. [PMID: 32427287 DOI: 10.1093/bib/bbaa051] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2020] [Revised: 03/11/2020] [Accepted: 03/12/2020] [Indexed: 12/26/2022] Open
Abstract
Protein remote homology detection is a fundamental and important task for protein structure and function analysis. Several search methods have been proposed to improve the detection performance of the remote homologues and the accuracy of ranking lists. The position-specific scoring matrix (PSSM) profile and hidden Markov model (HMM) profile can contribute to improving the performance of the state-of-the-art search methods. In this paper, we improved the profile-link (PL) information for constructing PSSM or HMM profiles, and proposed a PL-based search method (PL-search). In PL-search, more robust PLs are constructed through the double-link and iterative extending strategies, and an accurate similarity score of sequence pairs is calculated from the two-level Jaccard distance for remote homologues. We tested our method on two widely used benchmark datasets. Our results show that whether HHblits, JackHMMER or position-specific iterated-BLAST is used, PL-search obviously improves the search performance in terms of ranking quality as well as the number of detected remote homologues. For ease of use of PL-search, both its stand-alone tool and the web server are constructed, which can be accessed at http://bliulab.net/PL-search/.
Collapse
|
14
|
Rao R, Bhattacharya N, Thomas N, Duan Y, Chen X, Canny J, Abbeel P, Song YS. Evaluating Protein Transfer Learning with TAPE. ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS 2019; 32:9689-9701. [PMID: 33390682 PMCID: PMC7774645] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/12/2023]
Abstract
Machine learning applied to protein sequences is an increasingly popular area of research. Semi-supervised learning for proteins has emerged as an important paradigm due to the high cost of acquiring supervised protein labels, but the current literature is fragmented when it comes to datasets and standardized evaluation techniques. To facilitate progress in this field, we introduce the Tasks Assessing Protein Embeddings (TAPE), a set of five biologically relevant semi-supervised learning tasks spread across different domains of protein biology. We curate tasks into specific training, validation, and test splits to ensure that each task tests biologically relevant generalization that transfers to real-life scenarios. We benchmark a range of approaches to semi-supervised protein representation learning, which span recent work as well as canonical sequence learning techniques. We find that self-supervised pretraining is helpful for almost all models on all tasks, more than doubling performance in some cases. Despite this increase, in several cases features learned by self-supervised pretraining still lag behind features extracted by state-of-the-art non-neural techniques. This gap in performance suggests a huge opportunity for innovative architecture design and improved modeling paradigms that better capture the signal in biological sequences. TAPE will help the machine learning community focus effort on scientifically relevant problems. Toward this end, all data and code used to run these experiments are available at https://github.com/songlab-cal/tape.
Collapse
|
15
|
Trivedi R, Nagarajaram HA. Amino acid substitution scoring matrices specific to intrinsically disordered regions in proteins. Sci Rep 2019; 9:16380. [PMID: 31704957 PMCID: PMC6841959 DOI: 10.1038/s41598-019-52532-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2019] [Accepted: 10/15/2019] [Indexed: 01/09/2023] Open
Abstract
An amino acid substitution scoring matrix encapsulates the rates at which various amino acid residues in proteins are substituted by other amino acid residues, over time. Database search methods make use of substitution scoring matrices to identify sequences with homologous relationships. However, widely used substitution scoring matrices, such as BLOSUM series, have been developed using aligned blocks that are mostly devoid of disordered regions in proteins. Hence, these substitution-scoring matrices are mostly inappropriate for homology searches involving proteins enriched with disordered regions as the disordered regions have distinct amino acid compositional bias, and therefore expected to have undergone amino acid substitutions that are distinct from those in the ordered regions. We, therefore, developed a novel series of substitution scoring matrices referred to as EDSSMat by exclusively considering the substitution frequencies of amino acids in the disordered regions of the eukaryotic proteins. The newly developed matrices were tested for their ability to detect homologs of proteins enriched with disordered regions by means of SSEARCH tool. The results unequivocally demonstrate that EDSSMat matrices detect more number of homologs than the widely used BLOSUM, PAM and other standard matrices, indicating their utility value for homology searches of intrinsically disordered proteins.
Collapse
Affiliation(s)
- Rakesh Trivedi
- Laboratory of Computational Biology, Centre for DNA Fingerprinting and Diagnostics, Uppal, Hyderabad, Telangana, 500039, India.,Graduate School, Manipal Academy of Higher Education, Manipal, Karnataka, 576104, India
| | - Hampapathalu Adimurthy Nagarajaram
- Department of Systems and Computational Biology, School of Life Sciences, University of Hyderabad, Hyderabad, Telangana, 500 046, India. .,Centre for Modelling, Simulation and Design, University of Hyderabad, Hyderabad, Telangana, 500 046, India.
| |
Collapse
|
16
|
van Weezep E, Kooi EA, van Rijn PA. PCR diagnostics: In silico validation by an automated tool using freely available software programs. J Virol Methods 2019; 270:106-112. [PMID: 31095975 PMCID: PMC7113775 DOI: 10.1016/j.jviromet.2019.05.002] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2019] [Revised: 04/18/2019] [Accepted: 05/11/2019] [Indexed: 11/15/2022]
Abstract
In silico validation of PCR tests using exponentially expanding databases. The need of regular in silico validation of PCR tests by expanding databases. Fulfilling quality standards of in silico validation of molecular diagnostics.
PCR diagnostics are often the first line of laboratory diagnostics and are regularly designed to either differentiate between or detect all pathogen variants of a family, genus or species. The ideal PCR test detects all variants of the target pathogen, including newly discovered and emerging variants, while closely related pathogens and their variants should not be detected. This is challenging as pathogens show a high degree of genetic variation due to genetic drift, adaptation and evolution. Therefore, frequent re-evaluation of PCR diagnostics is needed to monitor its usefulness. Validation of PCR diagnostics recognizes three stages, in silico, in vitro and in vivo validation. In vitro and in vivo testing are usually costly, labour intensive and imply a risk of handling dangerous pathogens. In silico validation reduces this burden. In silico validation checks primers and probes by comparing their sequences with available nucleotide sequences. In recent years the amount of available sequences has dramatically increased by high throughput and deep sequencing projects. This makes in silico validation more informative, but also more computing intensive. To facilitate validation of PCR tests, a software tool named PCRv was developed. PCRv consists of a user friendly graphical user interface and coordinates the use of the software programs ClustalW and SSEARCH in order to perform in silico validation of PCR tests of different formats. Use of internal control sequences makes the analysis compliant to laboratory quality control systems. Finally, PCRv generates a validation report that includes an overview as well as a list of detailed results. In-house developed, published and OIE-recommended PCR tests were easily (re-) evaluated by use of PCRv. To demonstrate the power of PCRv, in silico validation of several PCR tests are shown and discussed.
Collapse
Affiliation(s)
- Erik van Weezep
- Department of Virology, Wageningen Bioveterinary Research (WBVR), Lelystad, the Netherlands.
| | - Engbert A Kooi
- Department of Virology, Wageningen Bioveterinary Research (WBVR), Lelystad, the Netherlands.
| | - Piet A van Rijn
- Department of Virology, Wageningen Bioveterinary Research (WBVR), Lelystad, the Netherlands; Department of Biochemistry, North West University, Potchefstroom, South Africa.
| |
Collapse
|
17
|
Kirsip H, Abroi A. Protein Structure-Guided Hidden Markov Models (HMMs) as A Powerful Method in the Detection of Ancestral Endogenous Viral Elements. Viruses 2019; 11:v11040320. [PMID: 30986983 PMCID: PMC6520822 DOI: 10.3390/v11040320] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2019] [Revised: 03/23/2019] [Accepted: 03/27/2019] [Indexed: 12/19/2022] Open
Abstract
It has been believed for a long time that the transfer and fixation of genetic material from RNA viruses to eukaryote genomes is very unlikely. However, during the last decade, there have been several cases in which “virus-to-host” gene transfer from various viral families into various eukaryotic phyla have been described. These transfers have been identified by sequence similarity, which may disappear very quickly, especially in the case of RNA viruses. However, compared to sequences, protein structure is known to be more conserved. Applying protein structure-guided protein domain-specific Hidden Markov Models, we detected homologues of the Virgaviridae capsid protein in Schizophora flies. Further data analysis supported “virus-to-host” transfer into Schizophora ancestors as a single transfer event. This transfer was not identifiable by BLAST or by other methods we applied. Our data show that structure-guided Hidden Markov Models should be used to detect ancestral virus-to-host transfers.
Collapse
Affiliation(s)
- Heleri Kirsip
- Department of Bioinformatics, University of Tartu, Tartu, 51010, Riia 23, Estonia.
| | - Aare Abroi
- Institute of Technology, University of Tartu, Tartu, 50411, Nooruse 1, Estonia.
| |
Collapse
|
18
|
Streptococcus mitis Expressing Pneumococcal Serotype 1 Capsule. Sci Rep 2018; 8:17959. [PMID: 30568178 PMCID: PMC6299277 DOI: 10.1038/s41598-018-35921-3] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2018] [Accepted: 11/08/2018] [Indexed: 01/22/2023] Open
Abstract
Streptococcus pneumoniae's polysaccharide capsule is an important virulence factor; vaccine-induced immunity to specific capsular polysaccharide effectively prevents disease. Serotype 1 S. pneumoniae is rarely found in healthy persons, but is highly invasive and a common cause of meningitis outbreaks and invasive disease outside of the United States. Here we show that genes for polysaccharide capsule similar to those expressed by pneumococci were commonly detected by polymerase chain reaction among upper respiratory tract samples from older US adults not carrying pneumococci. Serotype 1-specific genes were predominantly detected. In five oropharyngeal samples tested, serotype 1 gene belonging to S. mitis expressed capsules immunologically indistinct from pneumococcal capsules. Whole genome sequencing revealed three distinct S. mitis clones, each representing a cps1 operon highly similar to the pneumococcal cps1 reference operon. These findings raise important questions about the contribution of commensal streptococci to natural immunity against pneumococci, a leading cause of mortality worldwide.
Collapse
|
19
|
Ahuja AK, Cheema RS. Homology between cattle bull sperm and bacterial antigenic proteins viz a viz possible role in immunological infertility. Reprod Domest Anim 2018; 53:1530-1538. [DOI: 10.1111/rda.13292] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2018] [Accepted: 07/23/2018] [Indexed: 12/17/2022]
Affiliation(s)
- Ankit Kumar Ahuja
- Department of Veterinary Gynaecology and Obstetrics GADVASU Ludhiana Punjab India
| | - Ranjna S. Cheema
- Department of Veterinary Gynaecology and Obstetrics GADVASU Ludhiana Punjab India
| |
Collapse
|
20
|
Analysis of sequencing strategies and tools for taxonomic annotation: Defining standards for progressive metagenomics. Sci Rep 2018; 8:12034. [PMID: 30104688 PMCID: PMC6089906 DOI: 10.1038/s41598-018-30515-5] [Citation(s) in RCA: 55] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/22/2018] [Accepted: 07/24/2018] [Indexed: 12/30/2022] Open
Abstract
Metagenomics research has recently thrived due to DNA sequencing technologies improvement, driving the emergence of new analysis tools and the growth of taxonomic databases. However, there is no all-purpose strategy that can guarantee the best result for a given project and there are several combinations of software, parameters and databases that can be tested. Therefore, we performed an impartial comparison, using statistical measures of classification for eight bioinformatic tools and four taxonomic databases, defining a benchmark framework to evaluate each tool in a standardized context. Using in silico simulated data for 16S rRNA amplicons and whole metagenome shotgun data, we compared the results from different software and database combinations to detect biases related to algorithms or database annotation. Using our benchmark framework, researchers can define cut-off values to evaluate the expected error rate and coverage for their results, regardless the score used by each software. A quick guide to select the best tool, all datasets and scripts to reproduce our results and benchmark any new method are available at https://github.com/Ales-ibt/Metagenomic-benchmark. Finally, we stress out the importance of gold standards, database curation and manual inspection of taxonomic profiling results, for a better and more accurate microbial diversity description.
Collapse
|
21
|
Govindarajan R, Leela BC, Nair AS. RBLOSUM performs better than CorBLOSUM with lesser error per query. BMC Res Notes 2018; 11:328. [PMID: 29784028 PMCID: PMC5963171 DOI: 10.1186/s13104-018-3415-5] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2017] [Accepted: 05/07/2018] [Indexed: 11/18/2022] Open
Abstract
Objective BLOSUM matrices serve as standard matrices for many protein sequence alignment programs. BLOSUM matrices have been constructed using BLOCKS version5.0 with 27,102 BLOCKS, whereas the latest updated version14.3 has 6,739,916 BLOCKS. We read with interest the research article by Hess et al. (BMC Bioinform 17:189, 2016) on CorBLOSUM, wherein it is argued that an inaccuracy in the BLOSUM code affects the cluster memberships of sequences. They show that replacing the integer based clustering threshold to floating point arguably improves the performances of CorBLOSUM over BLOSUM and RBLOSUM matrices. They compare BLOSUM6214.3 against RBLOSUM69, with relative entropies of 0.2685 and 0.2662 respectively. The present work attempts to repeat the computation to verify the respective analog matrices. Results In our attempt to repeat the computation, we observed that the relative entropy of BLOSUM6214.3 is 0.2360 and BLOSUM5014.3 is 0.1198. As only matrices of similar entropies can be compared, BLOSUM62 can be compared only with RBLOSUM66 and BLOSUM50 can be compared only with RBLOSUM56. We conducted experiments with Astral data sets, and demonstrated the improved accuracy in the coverage. Our results imply that RBLOSUM performs statistically better than CorBLOSUM and BLOSUM matrices. Electronic supplementary material The online version of this article (10.1186/s13104-018-3415-5) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Renganayaki Govindarajan
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India.
| | - Biji Christopher Leela
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| | - Achuthsankar S Nair
- Department of Computational Biology and Bioinformatics, University of Kerala, Thiruvananthapuram, Kerala, India
| |
Collapse
|
22
|
Kihara D, Yang YD, Hawkins T. Bioinformatics Resources for Cancer Research with an Emphasis on Gene Function and Structure Prediction Tools. Cancer Inform 2017. [DOI: 10.1177/117693510600200020] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/19/2023] Open
Abstract
The immensely popular fields of cancer research and bioinformatics overlap in many different areas, e.g. large data repositories that allow for users to analyze data from many experiments (data handling, databases), pattern mining, microarray data analysis, and interpretation of proteomics data. There are many newly available resources in these areas that may be unfamiliar to most cancer researchers wanting to incorporate bioinformatics tools and analyses into their work, and also to bioinformaticians looking for real data to develop and test algorithms. This review reveals the interdependence of cancer research and bioinformatics, and highlight the most appropriate and useful resources available to cancer researchers. These include not only public databases, but general and specific bioinformatics tools which can be useful to the cancer researcher. The primary foci are function and structure prediction tools of protein genes. The result is a useful reference to cancer researchers and bioinformaticians studying cancer alike.
Collapse
Affiliation(s)
- Daisuke Kihara
- Department of Biological Sciences; College of Science, Purdue University, West Lafayette, IN, 47907, USA
- Department of Computer Science; College of Science, Purdue University, West Lafayette, IN, 47907, USA
- Markey Center for Structural Biology; College of Science, Purdue University, West Lafayette, IN, 47907, USA
- The Bindley Bioscience Center, College of Science, Purdue University, West Lafayette, IN, 47907, USA
| | - Yifeng David Yang
- Department of Biological Sciences; College of Science, Purdue University, West Lafayette, IN, 47907, USA
| | - Troy Hawkins
- Department of Biological Sciences; College of Science, Purdue University, West Lafayette, IN, 47907, USA
| |
Collapse
|
23
|
Keul F, Hess M, Goesele M, Hamacher K. PFASUM: a substitution matrix from Pfam structural alignments. BMC Bioinformatics 2017; 18:293. [PMID: 28583067 PMCID: PMC5460430 DOI: 10.1186/s12859-017-1703-z] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2017] [Accepted: 05/22/2017] [Indexed: 11/10/2022] Open
Abstract
Background Detecting homologous protein sequences and computing multiple sequence alignments (MSA) are fundamental tasks in molecular bioinformatics. These tasks usually require a substitution matrix for modeling evolutionary substitution events derived from a set of aligned sequences. Over the last years, the known sequence space increased drastically and several publications demonstrated that this can lead to significantly better performing matrices. Interestingly, matrices based on dated sequence datasets are still the de facto standard for both tasks even though their data basis may limit their capabilities. We address these aspects by presenting a new substitution matrix series called PFASUM. These matrices are derived from Pfam seed MSAs using a novel algorithm and thus build upon expert ground truth data covering a large and diverse sequence space. Results We show results for two use cases: First, we tested the homology search performance of PFASUM matrices on up-to-date ASTRAL databases with varying sequence similarity. Our study shows that the usage of PFASUM matrices can lead to significantly better homology search results when compared to conventional matrices. PFASUM matrices with comparable relative entropies to the commonly used substitution matrices BLOSUM50, BLOSUM62, PAM250, VTML160 and VTML200 outperformed their corresponding counterparts in 93% of all test cases. A general assessment also comparing matrices with different relative entropies showed that PFASUM matrices delivered the best homology search performance in the test set. Second, our results demonstrate that the usage of PFASUM matrices for MSA construction improves their quality when compared to conventional matrices. On up-to-date MSA benchmarks, at least 60% of all MSAs were reconstructed in an equal or higher quality when using MUSCLE with PFASUM31, PFASUM43 and PFASUM60 matrices instead of conventional matrices. This rate even increases to at least 76% for MSAs containing similar sequences. Conclusions We present the novel PFASUM substitution matrices derived from manually curated MSA ground truth data covering the currently known sequence space. Our results imply that PFASUM matrices improve homology search performance as well as MSA quality in many cases when compared to conventional substitution matrices. Hence, we encourage the usage of PFASUM matrices and especially PFASUM60 for these specific tasks. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1703-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Frank Keul
- Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany
| | - Martin Hess
- Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, Rundeturmstraße 12, Darmstadt, 64283, Germany.
| | - Michael Goesele
- Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, Rundeturmstraße 12, Darmstadt, 64283, Germany
| | - Kay Hamacher
- Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany
| |
Collapse
|
24
|
Abstract
Comparative protein structure modeling predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. This unit describes how to calculate comparative models using the program MODELLER and how to use the ModBase database of such models, and discusses all four steps of comparative modeling, frequently observed errors, and some applications. Modeling lactate dehydrogenase from Trichomonas vaginalis (TvLDH) is described as an example. The download and installation of the MODELLER software is also described. © 2016 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Benjamin Webb
- University of California at San Francisco, San Francisco, California
| | - Andrej Sali
- University of California at San Francisco, San Francisco, California
| |
Collapse
|
25
|
Fidler DR, Murphy SE, Courtis K, Antonoudiou P, El-Tohamy R, Ient J, Levine TP. Using HHsearch to tackle proteins of unknown function: A pilot study with PH domains. Traffic 2016; 17:1214-1226. [PMID: 27601190 PMCID: PMC5091641 DOI: 10.1111/tra.12432] [Citation(s) in RCA: 38] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2015] [Revised: 08/30/2016] [Accepted: 08/30/2016] [Indexed: 01/08/2023]
Abstract
Advances in membrane cell biology are hampered by the relatively high proportion of proteins with no known function. Such proteins are largely or entirely devoid of structurally significant domain annotations. Structural bioinformaticians have developed profile‐profile tools such as HHsearch (online version called HHpred), which can detect remote homologies that are missed by tools used to annotate databases. Here we have applied HHsearch to study a single structural fold in a single model organism as proof of principle. In the entire clan of protein domains sharing the pleckstrin homology domain fold in yeast, systematic application of HHsearch accurately identified known PH‐like domains. It also predicted 16 new domains in 13 yeast proteins many of which are implicated in intracellular traffic. One of these was Vps13p, where we confirmed the functional importance of the predicted PH‐like domain. Even though such predictions require considerable work to be corroborated, they are useful first steps. HHsearch should be applied more widely, particularly across entire proteomes of model organisms, to significantly improve database annotations.
Collapse
Affiliation(s)
- David R Fidler
- Department of Cell Biology, UCL Institute of Ophthalmology, London, UK
| | - Sarah E Murphy
- Department of Cell Biology, UCL Institute of Ophthalmology, London, UK
| | - Katherine Courtis
- Department of Cell Biology, UCL Institute of Ophthalmology, London, UK
| | | | - Rana El-Tohamy
- Department of Cell Biology, UCL Institute of Ophthalmology, London, UK
| | - Jonathan Ient
- Department of Cell Biology, UCL Institute of Ophthalmology, London, UK
| | - Timothy P Levine
- Department of Cell Biology, UCL Institute of Ophthalmology, London, UK.
| |
Collapse
|
26
|
Bastos VA, Gomes-Neto F, Perales J, Neves-Ferreira AGC, Valente RH. Natural Inhibitors of Snake Venom Metalloendopeptidases: History and Current Challenges. Toxins (Basel) 2016; 8:toxins8090250. [PMID: 27571103 PMCID: PMC5037476 DOI: 10.3390/toxins8090250] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2016] [Revised: 08/11/2016] [Accepted: 08/15/2016] [Indexed: 01/13/2023] Open
Abstract
The research on natural snake venom metalloendopeptidase inhibitors (SVMPIs) began in the 18th century with the pioneering work of Fontana on the resistance that vipers exhibited to their own venom. During the past 40 years, SVMPIs have been isolated mainly from the sera of resistant animals, and characterized to different extents. They are acidic oligomeric glycoproteins that remain biologically active over a wide range of pH and temperature values. Based on primary structure determination, mammalian plasmatic SVMPIs are classified as members of the immunoglobulin (Ig) supergene protein family, while the one isolated from muscle belongs to the ficolin/opsonin P35 family. On the other hand, SVMPIs from snake plasma have been placed in the cystatin superfamily. These natural antitoxins constitute the first line of defense against snake venoms, inhibiting the catalytic activities of snake venom metalloendopeptidases through the establishment of high-affinity, non-covalent interactions. This review presents a historical account of the field of natural resistance, summarizing its main discoveries and current challenges, which are mostly related to the limitations that preclude three-dimensional structural determinations of these inhibitors using “gold-standard” methods; perspectives on how to circumvent such limitations are presented. Potential applications of these SVMPIs in medicine are also highlighted.
Collapse
Affiliation(s)
- Viviane A Bastos
- Laboratory of Toxinology, Oswaldo Cruz Foundation (FIOCRUZ), Rio de Janeiro 21040-900, Brazil.
- National Institute of Science and Technology on Toxins (INCTTOX), CNPq, Brasilia 71605-001, Brazil.
| | - Francisco Gomes-Neto
- Laboratory of Toxinology, Oswaldo Cruz Foundation (FIOCRUZ), Rio de Janeiro 21040-900, Brazil.
- National Institute of Science and Technology on Toxins (INCTTOX), CNPq, Brasilia 71605-001, Brazil.
| | - Jonas Perales
- Laboratory of Toxinology, Oswaldo Cruz Foundation (FIOCRUZ), Rio de Janeiro 21040-900, Brazil.
- National Institute of Science and Technology on Toxins (INCTTOX), CNPq, Brasilia 71605-001, Brazil.
| | - Ana Gisele C Neves-Ferreira
- Laboratory of Toxinology, Oswaldo Cruz Foundation (FIOCRUZ), Rio de Janeiro 21040-900, Brazil.
- National Institute of Science and Technology on Toxins (INCTTOX), CNPq, Brasilia 71605-001, Brazil.
| | - Richard H Valente
- Laboratory of Toxinology, Oswaldo Cruz Foundation (FIOCRUZ), Rio de Janeiro 21040-900, Brazil.
- National Institute of Science and Technology on Toxins (INCTTOX), CNPq, Brasilia 71605-001, Brazil.
| |
Collapse
|
27
|
Webb B, Sali A. Comparative Protein Structure Modeling Using MODELLER. CURRENT PROTOCOLS IN BIOINFORMATICS 2016; 54:5.6.1-5.6.37. [PMID: 27322406 PMCID: PMC5031415 DOI: 10.1002/cpbi.3] [Citation(s) in RCA: 1813] [Impact Index Per Article: 226.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/22/2022]
Abstract
Comparative protein structure modeling predicts the three-dimensional structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. This unit describes how to calculate comparative models using the program MODELLER and how to use the ModBase database of such models, and discusses all four steps of comparative modeling, frequently observed errors, and some applications. Modeling lactate dehydrogenase from Trichomonas vaginalis (TvLDH) is described as an example. The download and installation of the MODELLER software is also described. © 2016 by John Wiley & Sons, Inc.
Collapse
Affiliation(s)
- Benjamin Webb
- University of California at San Francisco, San Francisco, California
| | - Andrej Sali
- University of California at San Francisco, San Francisco, California
| |
Collapse
|
28
|
Hess M, Keul F, Goesele M, Hamacher K. Addressing inaccuracies in BLOSUM computation improves homology search performance. BMC Bioinformatics 2016; 17:189. [PMID: 27122148 PMCID: PMC4849092 DOI: 10.1186/s12859-016-1060-3] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2015] [Accepted: 04/21/2016] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND BLOSUM matrices belong to the most commonly used substitution matrix series for protein homology search and sequence alignments since their publication in 1992. In 2008, Styczynski et al. discovered miscalculations in the clustering step of the matrix computation. Still, the RBLOSUM64 matrix based on the corrected BLOSUM code was reported to perform worse at a statistically significant level than the BLOSUM62. Here, we present a further correction of the (R)BLOSUM code and provide a thorough performance analysis of BLOSUM-, RBLOSUM- and the newly derived CorBLOSUM-type matrices. Thereby, we assess homology search performance of these matrix-types derived from three different BLOCKS databases on all versions of the ASTRAL20, ASTRAL40 and ASTRAL70 subsets resulting in 51 different benchmarks in total. Our analysis is focused on two of the most popular BLOSUM matrices - BLOSUM50 and BLOSUM62. RESULTS Our study shows that fixing small errors in the BLOSUM code results in substantially different substitution matrices with a beneficial influence on homology search performance when compared to the original matrices. The CorBLOSUM matrices introduced here performed at least as good as their BLOSUM counterparts in ∼75 % of all test cases. On up-to-date ASTRAL databases BLOSUM matrices were even outperformed by CorBLOSUM matrices in more than 86 % of the times. In contrast to the study by Styczynski et al., the tested RBLOSUM matrices also outperformed the corresponding BLOSUM matrices in most of the cases. Comparing the CorBLOSUM with the RBLOSUM matrices revealed no general performance advantages for either on older ASTRAL releases. On up-to-date ASTRAL databases however CorBLOSUM matrices performed better than their RBLOSUM counterparts in ∼74 % of the test cases. CONCLUSIONS Our results imply that CorBLOSUM type matrices outperform the BLOSUM matrices on a statistically significant level in most of the cases, especially on up-to-date databases such as ASTRAL ≥2.01. Additionally, CorBLOSUM matrices are closer to those originally intended by Henikoff and Henikoff on a conceptual level. Hence, we encourage the usage of CorBLOSUM over (R)BLOSUM matrices for the task of homology search.
Collapse
Affiliation(s)
- Martin Hess
- Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, Rundeturmstraße 12, Darmstadt, 64283, Germany.,Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany
| | - Frank Keul
- Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany.
| | - Michael Goesele
- Graphics, Capture and Massively Parallel Computing, Department of Computer Science, Technische Universität Darmstadt, Rundeturmstraße 12, Darmstadt, 64283, Germany
| | - Kay Hamacher
- Computational Biology and Simulation, Department of Biology, Technische Universität Darmstadt, Schnittspahnstraße 2, Darmstadt, 64287, Germany
| |
Collapse
|
29
|
Alves JMP, de Oliveira AL, Sandberg TOM, Moreno-Gallego JL, de Toledo MAF, de Moura EMM, Oliveira LS, Durham AM, Mehnert DU, Zanotto PMDA, Reyes A, Gruber A. GenSeed-HMM: A Tool for Progressive Assembly Using Profile HMMs as Seeds and its Application in Alpavirinae Viral Discovery from Metagenomic Data. Front Microbiol 2016; 7:269. [PMID: 26973638 PMCID: PMC4777721 DOI: 10.3389/fmicb.2016.00269] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/29/2015] [Accepted: 02/19/2016] [Indexed: 01/01/2023] Open
Abstract
This work reports the development of GenSeed-HMM, a program that implements seed-driven progressive assembly, an approach to reconstruct specific sequences from unassembled data, starting from short nucleotide or protein seed sequences or profile Hidden Markov Models (HMM). The program can use any one of a number of sequence assemblers. Assembly is performed in multiple steps and relatively few reads are used in each cycle, consequently the program demands low computational resources. As a proof-of-concept and to demonstrate the power of HMM-driven progressive assemblies, GenSeed-HMM was applied to metagenomic datasets in the search for diverse ssDNA bacteriophages from the recently described Alpavirinae subfamily. Profile HMMs were built using Alpavirinae-specific regions from multiple sequence alignments (MSA) using either the viral protein 1 (VP1; major capsid protein) or VP4 (genome replication initiation protein). These profile HMMs were used by GenSeed-HMM (running Newbler assembler) as seeds to reconstruct viral genomes from sequencing datasets of human fecal samples. All contigs obtained were annotated and taxonomically classified using similarity searches and phylogenetic analyses. The most specific profile HMM seed enabled the reconstruction of 45 partial or complete Alpavirinae genomic sequences. A comparison with conventional (global) assembly of the same original dataset, using Newbler in a standalone execution, revealed that GenSeed-HMM outperformed global genomic assembly in several metrics employed. This approach is capable of detecting organisms that have not been used in the construction of the profile HMM, which opens up the possibility of diagnosing novel viruses, without previous specific information, constituting a de novo diagnosis. Additional applications include, but are not limited to, the specific assembly of extrachromosomal elements such as plastid and mitochondrial genomes from metagenomic data. Profile HMM seeds can also be used to reconstruct specific protein coding genes for gene diversity studies, and to determine all possible gene variants present in a metagenomic sample. Such surveys could be useful to detect the emergence of drug-resistance variants in sensitive environments such as hospitals and animal production facilities, where antibiotics are regularly used. Finally, GenSeed-HMM can be used as an adjunct for gap closure on assembly finishing projects, by using multiple contig ends as anchored seeds.
Collapse
Affiliation(s)
- João M P Alves
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - André L de Oliveira
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Tatiana O M Sandberg
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | | | - Marcelo A F de Toledo
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Elisabeth M M de Moura
- Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Liliane S Oliveira
- Department of Parasitology, Institute of Biomedical Sciences, University of São PauloSão Paulo, Brazil; Department of Computer Science, Institute of Mathematics and Statistics, University of São PauloSão Paulo, Brazil
| | - Alan M Durham
- Department of Computer Science, Institute of Mathematics and Statistics, University of São Paulo São Paulo, Brazil
| | - Dolores U Mehnert
- Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Paolo M de A Zanotto
- Department of Microbiology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| | - Alejandro Reyes
- Department of Biological Sciences, Universidad de los AndesBogotá, Colombia; Center for Genome Sciences and Systems Biology, Department of Pathology and Immunology, Washington University in Saint LouisMO, USA
| | - Arthur Gruber
- Department of Parasitology, Institute of Biomedical Sciences, University of São Paulo São Paulo, Brazil
| |
Collapse
|
30
|
|
31
|
Ghouzam Y, Postic G, de Brevern AG, Gelly JC. Improving protein fold recognition with hybrid profiles combining sequence and structure evolution. Bioinformatics 2015; 31:3782-9. [PMID: 26254434 DOI: 10.1093/bioinformatics/btv462] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2015] [Accepted: 08/02/2015] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Template-based modeling, the most successful approach for predicting protein 3D structure, often requires detecting distant evolutionary relationships between the target sequence and proteins of known structure. Developed for this purpose, fold recognition methods use elaborate strategies to exploit evolutionary information, mainly by encoding amino acid sequence into profiles. Since protein structure is more conserved than sequence, the inclusion of structural information can improve the detection of remote homology. RESULTS Here, we present ORION, a new fold recognition method based on the pairwise comparison of hybrid profiles that contain evolutionary information from both protein sequence and structure. Our method uses the 16-state structural alphabet Protein Blocks, which provides an accurate 1D description of protein structure local conformations. ORION systematically outperforms PSI-BLAST and HHsearch on several benchmarks, including target sequences from the modeling competitions CASP8, 9 and 10, and detects ∼10% more templates at fold and superfamily SCOP levels. AVAILABILITY Software freely available for download at http://www.dsimb.inserm.fr/orion/. CONTACT jean-christophe.gelly@univ-paris-diderot.fr. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yassine Ghouzam
- Inserm U1134, Paris, France, Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France, Institut National de la Transfusion Sanguine, Paris, France and Laboratory of Excellence GR-Ex, Paris, France
| | - Guillaume Postic
- Inserm U1134, Paris, France, Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France, Institut National de la Transfusion Sanguine, Paris, France and Laboratory of Excellence GR-Ex, Paris, France
| | - Alexandre G de Brevern
- Inserm U1134, Paris, France, Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France, Institut National de la Transfusion Sanguine, Paris, France and Laboratory of Excellence GR-Ex, Paris, France
| | - Jean-Christophe Gelly
- Inserm U1134, Paris, France, Université Paris Diderot, Sorbonne Paris Cité, UMR_S 1134, Paris, France, Institut National de la Transfusion Sanguine, Paris, France and Laboratory of Excellence GR-Ex, Paris, France
| |
Collapse
|
32
|
Abstract
Functional characterization of a protein sequence is one of the most frequent problems in biology. This task is usually facilitated by accurate three-dimensional (3-D) structure of the studied protein. In the absence of an experimentally determined structure, comparative or homology modeling can sometimes provide a useful 3-D model for a protein that is related to at least one known protein structure. Comparative modeling predicts the 3-D structure of a given protein sequence (target) based primarily on its alignment to one or more proteins of known structure (templates). The prediction process consists of fold assignment, target-template alignment, model building, and model evaluation. This unit describes how to calculate comparative models using the program MODELLER and discusses all four steps of comparative modeling, frequently observed errors, and some applications. Modeling lactate dehydrogenase from Trichomonas vaginalis (TvLDH) is described as an example. The download and installation of the MODELLER software is also described.
Collapse
Affiliation(s)
- Benjamin Webb
- University of California at San Francisco, San Francisco, California
| | | |
Collapse
|
33
|
Assessing the applicability of template-based protein docking in the twilight zone. Structure 2014; 22:1356-1362. [PMID: 25156427 DOI: 10.1016/j.str.2014.07.009] [Citation(s) in RCA: 32] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/22/2014] [Revised: 07/24/2014] [Accepted: 07/31/2014] [Indexed: 11/20/2022]
Abstract
The structural modeling of protein interactions in the absence of close homologous templates is a challenging task. Recently, template-based docking methods have emerged to exploit local structural similarities to help ab-initio protocols provide reliable 3D models for protein interactions. In this work, we critically assess the performance of template-based docking in the twilight zone. Our results show that, while it is possible to find templates for nearly all known interactions, the quality of the obtained models is rather limited. We can increase the precision of the models at expenses of coverage, but it drastically reduces the potential applicability of the method, as illustrated by the whole-interactome modeling of nine organisms. Template-based docking is likely to play an important role in the structural characterization of the interaction space, but we still need to improve the repertoire of structural templates onto which we can reliably model protein complexes.
Collapse
|
34
|
Skewes-Cox P, Sharpton TJ, Pollard KS, DeRisi JL. Profile hidden Markov models for the detection of viruses within metagenomic sequence data. PLoS One 2014; 9:e105067. [PMID: 25140992 PMCID: PMC4139300 DOI: 10.1371/journal.pone.0105067] [Citation(s) in RCA: 119] [Impact Index Per Article: 11.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2014] [Accepted: 07/20/2014] [Indexed: 01/01/2023] Open
Abstract
Rapid, sensitive, and specific virus detection is an important component of clinical diagnostics. Massively parallel sequencing enables new diagnostic opportunities that complement traditional serological and PCR based techniques. While massively parallel sequencing promises the benefits of being more comprehensive and less biased than traditional approaches, it presents new analytical challenges, especially with respect to detection of pathogen sequences in metagenomic contexts. To a first approximation, the initial detection of viruses can be achieved simply through alignment of sequence reads or assembled contigs to a reference database of pathogen genomes with tools such as BLAST. However, recognition of highly divergent viral sequences is problematic, and may be further complicated by the inherently high mutation rates of some viral types, especially RNA viruses. In these cases, increased sensitivity may be achieved by leveraging position-specific information during the alignment process. Here, we constructed HMMER3-compatible profile hidden Markov models (profile HMMs) from all the virally annotated proteins in RefSeq in an automated fashion using a custom-built bioinformatic pipeline. We then tested the ability of these viral profile HMMs ("vFams") to accurately classify sequences as viral or non-viral. Cross-validation experiments with full-length gene sequences showed that the vFams were able to recall 91% of left-out viral test sequences without erroneously classifying any non-viral sequences into viral protein clusters. Thorough reanalysis of previously published metagenomic datasets with a set of the best-performing vFams showed that they were more sensitive than BLAST for detecting sequences originating from more distant relatives of known viruses. To facilitate the use of the vFams for rapid detection of remote viral homologs in metagenomic data, we provide two sets of vFams, comprising more than 4,000 vFams each, in the HMMER3 format. We also provide the software necessary to build custom profile HMMs or update the vFams as more viruses are discovered (http://derisilab.ucsf.edu/software/vFam).
Collapse
Affiliation(s)
- Peter Skewes-Cox
- Biological and Medical Informatics Graduate Program, University of California San Francisco, San Francisco, California, United States of America
- Departments of Medicine, Biochemistry and Biophysics, and Microbiology, University of California San Francisco, San Francisco, California, United States of America
- Howard Hughes Medical Institute, Bethesda, Maryland, United States of America
| | - Thomas J. Sharpton
- The J. David Gladstone Institutes, University of California San Francisco, San Francisco, California, United States of America
| | - Katherine S. Pollard
- The J. David Gladstone Institutes, University of California San Francisco, San Francisco, California, United States of America
- Institute for Human Genetics & Division of Biostatistics, University of California San Francisco, San Francisco, California, United States of America
| | - Joseph L. DeRisi
- Departments of Medicine, Biochemistry and Biophysics, and Microbiology, University of California San Francisco, San Francisco, California, United States of America
- Howard Hughes Medical Institute, Bethesda, Maryland, United States of America
| |
Collapse
|
35
|
Jabeen R, Mustafa G, Ul Abdin Z, Iqbal MJ, Jamil A. Expression profiling of bioactive genes from Moringa oleifera. Appl Biochem Biotechnol 2014; 174:657-66. [PMID: 25086925 DOI: 10.1007/s12010-014-1122-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2014] [Accepted: 07/23/2014] [Indexed: 01/10/2023]
Abstract
Plants are under constant assault by biotic and abiotic agents. When an elicitor is prologued, an immense reprogramming of plant gene expression and defense responses are initiated, which could be a natural source for potential drug development and insertional mutagenesis. In this regard, differential expression analysis of a medicinal plant Moringa oleifera was performed for bioactive genes at seedling stage, using differential display-RT-PCR technique. Infected seedlings with a fungus Fusarium solani collected at different time intervals, showed a massive change in their gene expression profile. The data analysis revealed that at least 150 pathogen-induced and about 60 suppressed genes were differentially expressed at 8-h postinoculation of the biotic stress. Fifty-five selective genes were disunited and reamplified. Sequence analysis of these potential genes illustrated that these genes had properties of some induced peroxidase mRNA, cell proliferation, others were mitogen activated protein kinases, ribosomal protein genes, defense regulating genes, and a few also had structural properties. Further studies about the utility of these genes in plant metabolism could assist to develop improved transgenic breeds with enhanced value of infection tolerance not only of M. oleifera but of other cultivars also.
Collapse
Affiliation(s)
- Raheela Jabeen
- Molecular Biochemistry Lab, Department of Chemistry and Biochemistry, University of Agriculture, Faisalabad, Pakistan
| | | | | | | | | |
Collapse
|
36
|
Identification of genetic bases of vibrio fluvialis species-specific biochemical pathways and potential virulence factors by comparative genomic analysis. Appl Environ Microbiol 2014; 80:2029-37. [PMID: 24441165 DOI: 10.1128/aem.03588-13] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/19/2022] Open
Abstract
Vibrio fluvialis is an important food-borne pathogen that causes diarrheal illness and sometimes extraintestinal infections in humans. In this study, we sequenced the genome of a clinical V. fluvialis strain and determined its phylogenetic relationships with other Vibrio species by comparative genomic analysis. We found that the closest relationship was between V. fluvialis and V. furnissii, followed by those with V. cholerae and V. mimicus. Moreover, based on genome comparisons and gene complementation experiments, we revealed genetic mechanisms of the biochemical tests that differentiate V. fluvialis from closely related species. Importantly, we identified a variety of genes encoding potential virulence factors, including multiple hemolysins, transcriptional regulators, and environmental survival and adaptation apparatuses, and the type VI secretion system, which is indicative of complex regulatory pathways modulating pathogenesis in this organism. The availability of V. fluvialis genome sequences may promote our understanding of pathogenic mechanisms for this emerging pathogen.
Collapse
|
37
|
Abstract
Structural proteomics aims to understand the structural basis of protein interactions and functions. A prerequisite for this is the availability of 3D protein structures that mediate the biochemical interactions. The explosion in the number of available gene sequences set the stage for the next step in genome-scale projects -- to obtain 3D structures for each protein. To achieve this ambitious goal, the slow and costly structure determination experiments are supplemented with theoretical approaches. The current state and recent advances in structure modeling approaches are reviewed here, with special emphasis on comparative protein structure modeling techniques.
Collapse
Affiliation(s)
- András Fiser
- Department of Biochemistry, Seaver Foundation Center for Bioinformatics, Albert Einstein College of Medicine, 1300 Morris Park Ave., Bronx, NY 10461, USA.
| |
Collapse
|
38
|
Webb B, Eswar N, Fan H, Khuri N, Pieper U, Dong G, Sali A. Comparative Modeling of Drug Target Proteins☆. REFERENCE MODULE IN CHEMISTRY, MOLECULAR SCIENCES AND CHEMICAL ENGINEERING 2014. [PMCID: PMC7157477 DOI: 10.1016/b978-0-12-409547-2.11133-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
In this perspective, we begin by describing the comparative protein structure modeling technique and the accuracy of the corresponding models. We then discuss the significant role that comparative prediction plays in drug discovery. We focus on virtual ligand screening against comparative models and illustrate the state-of-the-art by a number of specific examples.
Collapse
|
39
|
Walker SD, McEldowney S. Molecular docking: a potential tool to aid ecotoxicity testing in environmental risk assessment of pharmaceuticals. CHEMOSPHERE 2013; 93:2568-2577. [PMID: 24344392 DOI: 10.1016/j.chemosphere.2013.09.074] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
A cocktail of human pharmaceuticals pollute aquatic environments and there is considerable scientific uncertainty about the effects that this may have on aquatic organisms. Human drug target proteins can be highly conserved in non target species suggesting that similar modes of action (MoA) may occur. The aim of this work was to explore whether molecular docking offers a potential tool to predict the effects of pharmaceutical compounds on non target organisms. Three highly prescribed drugs, diclofenac, ibuprofen and levonorgestrel which regularly pollute freshwater environments were used as examples. Their primary drug targets are cyclooxygenase 2 (COX2) and progesterone receptor (PR). Molecular docking experiments were performed using these drugs and their primary drug target homologues for Danio rerio, Salmo salar, Oncorhynchus mykiss, Xenopus tropicalis, Xenopus laevis and Daphnia pulex. The results show that fish and frog COX2 enzymes are likely to bind diclofenac and ibuprofen in the same way as humans but that D. pulex would not. Binding will probably lead to inhibition of COX function and reduced prostaglandin production. Levonorgestrel was found to bind in the same binding pocket of the progesterone receptor in frogs and fish as the human form. This suggests implications for the fecundity of fish and frogs which are exposed to levonorgestrel. Chronic ecotoxicological effects of these drugs reported in the literature support these findings. Molecular docking may provide a valuable tool for ecotoxicity tests by guiding selection of test species and incorporating the MoA of drugs for relevant chronic test end points in environmental risk assessments.
Collapse
|
40
|
Mary Rajathei D, Selvaraj S. Analysis of sequence repeats of proteins in the PDB. Comput Biol Chem 2013; 47:156-66. [PMID: 24121644 DOI: 10.1016/j.compbiolchem.2013.09.001] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/26/2013] [Revised: 08/27/2013] [Accepted: 09/05/2013] [Indexed: 10/26/2022]
Abstract
Internal repeats in protein sequences play a significant role in the evolution of protein structure and function. Applications of different bioinformatics tools help in the identification and characterization of these repeats. In the present study, we analyzed sequence repeats in a non-redundant set of proteins available in the Protein Data Bank (PDB). We used RADAR for detecting internal repeats in a protein, PDBeFOLD for assessing structural similarity, PDBsum for finding functional involvement and Pfam for domain assignment of the repeats in a protein. Through the analysis of sequence repeats, we found that identity of the sequence repeats falls in the range of 20-40% and, the superimposed structures of the most of the sequence repeats maintain similar overall folding. Analysis sequence repeats at the functional level reveals that most of the sequence repeats are involved in the function of the protein through functionally involved residues in the repeat regions. We also found that sequence repeats in single and two domain proteins often contained conserved sequence motifs for the function of the domain.
Collapse
Affiliation(s)
- David Mary Rajathei
- Department of Bioinformatics, School of Life Sciences, Bharathidasan University, Tiruchirappalli 620024, Tamilnadu, India
| | | |
Collapse
|
41
|
Abstract
Motivation: Sequence similarity searches performed with BLAST, SSEARCH and FASTA achieve high sensitivity by using scoring matrices (e.g. BLOSUM62) that target low identity (<33%) alignments. Although such scoring matrices can effectively identify distant homologs, they can also produce local alignments that extend beyond the homologous regions. Results: We measured local alignment start/stop boundary accuracy using a set of queries where the correct alignment boundaries were known, and found that 7% of BLASTP and 8% of SSEARCH alignment boundaries were overextended. Overextended alignments include non-homologous sequences; they occur most frequently between sequences that are more closely related (>33% identity). Adjusting the scoring matrix to reflect the identity of the homologous sequence can correct higher identity overextended alignment boundaries. In addition, the scoring matrix that produced a correct alignment could be reliably predicted based on the sequence identity seen in the original BLOSUM62 alignment. Realigning with the predicted scoring matrix corrected 37% of all overextended alignments, resulting in more correct alignments than using BLOSUM62 alone. Availability: RefProtDom2 (RPD2) sequences and the FASTA software are available from http://faculty.virginia.edu/wrpearson/fasta. Contact:wrp@virginia.edu
Collapse
Affiliation(s)
- Lauren J Mills
- Department of Molecular, Cell and Developmental Biology and Department of Biochemistry and Molecular Genetics, University of Virginia, Charlottesville, VA 22908, USA
| | | |
Collapse
|
42
|
Hähnke V, Rupp M, Hartmann AK, Schneider G. Pharmacophore Alignment Search Tool (PhAST): Significance Assessment of Chemical Similarity. Mol Inform 2013; 32:625-46. [PMID: 27481770 DOI: 10.1002/minf.201300021] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/12/2013] [Accepted: 04/19/2013] [Indexed: 11/06/2022]
Abstract
Previously, we proposed a ligand-based virtual screening technique (PhAST) based on global alignment of linearized interaction patterns. Here, we applied techniques developed for similarity assessment in local sequence alignments to our method resulting in p-values for chemical similarity. We compared two sampling strategies, a simple sampling strategy and a Markov Chain Monte Carlo (MCMC) method, and investigated the similarity of sampled distributions to Gaussian, Gumbel, modified Gumbel, and Gamma distributions. The Gumbel distribution with a Gaussian correction term was identified as the most similar to the observed empirical distributions. These techniques were applied in retrospective screenings on a drug-like dataset. Obtained p-values were adjusted to the size of the screening library with four different methods. Evaluation of E-value thresholds corroborated the Bonferroni correction as a preferred means to identify significant chemical similarity with PhAST. An online version of PhAST with significance estimation is available at http://modlab-cadd.ethz.ch/.
Collapse
Affiliation(s)
- Volker Hähnke
- Eidgenössische Technische Hochschule (ETH), Department of Chemistry and Applied Biosciences, Institute of Pharmaceutical Sciences, Wolfgang-Pauli-Str. 10, 8093 Zürich, Switzerland phone: +1 (202)436-5989.
| | - Matthias Rupp
- Eidgenössische Technische Hochschule (ETH), Department of Chemistry and Applied Biosciences, Institute of Pharmaceutical Sciences, Wolfgang-Pauli-Str. 10, 8093 Zürich, Switzerland phone: +1 (202)436-5989
| | - Alexander K Hartmann
- Universität Oldenburg, Computational Theoretical Physics, Institut für Physik, Carl-von-Ossietzky Strasse 9-11, 26111 Oldenburg, Germany
| | - Gisbert Schneider
- Eidgenössische Technische Hochschule (ETH), Department of Chemistry and Applied Biosciences, Institute of Pharmaceutical Sciences, Wolfgang-Pauli-Str. 10, 8093 Zürich, Switzerland phone: +1 (202)436-5989
| |
Collapse
|
43
|
Falak S, Jamil A. Expression profiling of bioactive genes from a medicinal plant Nigella sativa L. Appl Biochem Biotechnol 2013; 170:1472-81. [PMID: 23686472 DOI: 10.1007/s12010-013-0281-4] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/30/2012] [Accepted: 05/01/2013] [Indexed: 01/31/2023]
Abstract
Plants respond to stress in part by modulating gene expression either constitutively or in an inducible manner which ultimately leads to the restoration of cellular homeostasis, detoxification of toxins, and recovery of growth. Upon introduction to various elicitors such as pathogen-associated molecular patterns, a massive reprogramming of plant gene expression is initiated. Differential display PCR offers rapid and multiple comparisons of gene expression to various stress durations and intensities. Nigella sativa has acclaimed many medicinal properties in traditional medicine. To explore the underlying molecular mechanisms in response to stress in the plants, Fusarium solani (a fungus) stress was induced at different time intervals ranging from 0 to 48 h. RNA was subjected to complementary DNA (cDNA) synthesis followed by PCR using different sets of anchored primers and arbitrary primers. The expression was visualized after silver staining on urea-PAGE. Out of the 23 upregulated re-amplified cDNA products, ten differential fragments showed significant homologies with domains related to cellular metabolism, signal transduction, and disease resistance. Such genes could be an informative source for developing genetically improved breeds under infectious stress.
Collapse
Affiliation(s)
- Sadia Falak
- University of Agriculture Faisalabad, Faisalabad, Pakistan
| | | |
Collapse
|
44
|
Kaznadzey A, Alexandrova N, Novichkov V, Kaznadzey D. PSimScan: algorithm and utility for fast protein similarity search. PLoS One 2013; 8:e58505. [PMID: 23505522 PMCID: PMC3591303 DOI: 10.1371/journal.pone.0058505] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/27/2012] [Accepted: 02/07/2013] [Indexed: 01/19/2023] Open
Abstract
In the era of metagenomics and diagnostics sequencing, the importance of protein comparison methods of boosted performance cannot be overstated. Here we present PSimScan (Protein Similarity Scanner), a flexible open source protein similarity search tool which provides a significant gain in speed compared to BLASTP at the price of controlled sensitivity loss. The PSimScan algorithm introduces a number of novel performance optimization methods that can be further used by the community to improve the speed and lower hardware requirements of bioinformatics software. The optimization starts at the lookup table construction, then the initial lookup table–based hits are passed through a pipeline of filtering and aggregation routines of increasing computational complexity. The first step in this pipeline is a novel algorithm that builds and selects ‘similarity zones’ aggregated from neighboring matches on small arrays of adjacent diagonals. PSimScan performs 5 to 100 times faster than the standard NCBI BLASTP, depending on chosen parameters, and runs on commodity hardware. Its sensitivity and selectivity at the slowest settings are comparable to the NCBI BLASTP’s and decrease with the increase of speed, yet stay at the levels reasonable for many tasks. PSimScan is most advantageous when used on large collections of query sequences. Comparing the entire proteome of Streptocuccus pneumoniae (2,042 proteins) to the NCBI’s non-redundant protein database of 16,971,855 records takes 6.5 hours on a moderately powerful PC, while the same task with the NCBI BLASTP takes over 66 hours. We describe innovations in the PSimScan algorithm in considerable detail to encourage bioinformaticians to improve on the tool and to use the innovations in their own software development.
Collapse
Affiliation(s)
- Anna Kaznadzey
- Institute for Information Transmission Problems, Russian Academy of Sciences, Moscow, Russia
| | - Natalia Alexandrova
- Genome Designs, Inc., Walnut Creek, California, United States of America
- * E-mail:
| | | | - Denis Kaznadzey
- DOE Joint Genome Institute, Walnut Creek, California, United States of America
| |
Collapse
|
45
|
Melo MCR, Bernardi RC, Fernandes TVA, Pascutti PG. GSAFold: a new application of GSA to protein structure prediction. Proteins 2012; 80:2305-10. [PMID: 22622959 DOI: 10.1002/prot.24120] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/15/2012] [Revised: 05/08/2012] [Accepted: 05/21/2012] [Indexed: 11/07/2022]
Abstract
The folding process defines three-dimensional protein structures from their amino acid chains. A protein's structure determines its activity and properties; thus knowing such conformation on an atomic level is essential for both basic and applied studies of protein function and dynamics. However, the acquisition of such structures by experimental methods is slow and expensive, and current computational methods mostly depend on previously known structures to determine new ones. Here we present a new software called GSAFold that applies the generalized simulated annealing (GSA) algorithm on ab initio protein structure prediction. The GSA is a stochastic search algorithm employed in energy minimization and used in global optimization problems, especially those that depend on long-range interactions, such as gravity models and conformation optimization of small molecules. This new implementation applies, for the first time in ab initio protein structure prediction, an analytical inverse for the Visitation function of GSA. It also employs the broadly used NAMD Molecular Dynamics package to carry out energy calculations, allowing the user to select different force fields and parameterizations. Moreover, the software also allows the execution of several simulations simultaneously. Applications that depend on protein structures include rational drug design and structure-based protein function prediction. Applying GSAFold in a test peptide, it was possible to predict the structure of mastoparan-X to a root mean square deviation of 3.00 Å.
Collapse
Affiliation(s)
- Marcelo C R Melo
- Laboratório de Biotecnologia/DIPRO, Instituto Nacional de Metrologia, Qualidade e Tecnologia, Rio de Janeiro, Brasil
| | | | | | | |
Collapse
|
46
|
Krissinel E. Enhanced fold recognition using efficient short fragment clustering. JOURNAL OF MOLECULAR BIOCHEMISTRY 2012; 1:76-85. [PMID: 27882309 PMCID: PMC5117261] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
The main structure aligner in the CCP4 Software Suite, SSM (Secondary Structure Matching) has a limited applicability on the intermediate stages of the structure solution process, when the secondary structure cannot be reliably computed due to structural incompleteness or a fragmented mainchain. In this study, we describe a new algorithm for the alignment and comparison of protein structures in CCP4, which was designed to overcome SSM's limitations but retain its quality and speed. The new algorithm, named GESAMT (General Efficient Structural Alignment of Macromolecular Targets), employs the old idea of deriving the global structure similarity from a promising set of locally similar short fragments, but uses a few technical solutions that make it considerably faster. A comparative sensitivity and selectivity analysis revealed an unexpected significant improvement in the fold recognition properties of the new algorithm, which also makes it useful for applications in the structural bioinformatics domain. The new tool is included in the CCP4 Software Suite starting from version 6.3.
Collapse
Affiliation(s)
- Evgeny Krissinel
- CCP4, Research Complex at Harwell, Rutherford Appleton Laboratory, Harwell Science and Innovation Campus, Didcot, Oxon, OX11 0FA, United Kingdom
| |
Collapse
|
47
|
Peris G, Marzal A. Normalized global alignment for protein sequences. J Theor Biol 2011; 291:22-8. [PMID: 21945336 DOI: 10.1016/j.jtbi.2011.09.017] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2011] [Revised: 07/19/2011] [Accepted: 09/08/2011] [Indexed: 10/17/2022]
Abstract
Global alignment is used to compare proteins in different fields, for example in phylogenetic research. In order to reduce the length and composition dependence of global alignment scores, Z-score is computed with a Monte-Carlo algorithm. This technique requires a great number of sequence alignments on shuffled sequences, leading to a high computational cost. In this work, a normalized global alignment score is introduced in order to correct the length dependence of global alignments. This score is defined as the best ratio between the score of an alignment and its length, and an algorithm to compute it based on fractional programming is implemented. The properties and effectiveness of normalized global alignment applied to protein comparison are analyzed. Experiments with proteins selected from the SCOP ASTRAL database were run to study relationship of normalized global alignment with Z-score and performance in homologous detection. Results show that normalized global alignment has a computational cost equivalent to 2.5 Needleman-Wunsch runs and a linear relationship with Z-score. This linearity allows us to use normalized global alignment as a cheap substitute to a computationally expensive Z-score. Experiments show that normalized global alignment improves the ability to identify homologous proteins. Software used to compute normalized global alignments is available from http://www3.uji.es/∼peris/nga.
Collapse
Affiliation(s)
- Guillermo Peris
- Department de Llenguatges i Sistemes Informátics, Universitat Jaume I, 12071 Castelló, Spain.
| | | |
Collapse
|
48
|
Hawkins T, Kihara D. FUNCTION PREDICTION OF UNCHARACTERIZED PROTEINS. J Bioinform Comput Biol 2011; 5:1-30. [PMID: 17477489 DOI: 10.1142/s0219720007002503] [Citation(s) in RCA: 75] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2006] [Revised: 09/23/2006] [Accepted: 10/10/2006] [Indexed: 11/18/2022]
Abstract
Function prediction of uncharacterized protein sequences generated by genome projects has emerged as an important focus for computational biology. We have categorized several approaches beyond traditional sequence similarity that utilize the overwhelmingly large amounts of available data for computational function prediction, including structure-, association (genomic context)-, interaction (cellular context)-, process (metabolic context)-, and proteomics-experiment-based methods. Because they incorporate structural and experimental data that is not used in sequence-based methods, they can provide additional accuracy and reliability to protein function prediction. Here, first we review the definition of protein function. Then the recent developments of these methods are introduced with special focus on the type of predictions that can be made. The need for further development of comprehensive systems biology techniques that can utilize the ever-increasing data presented by the genomics and proteomics communities is emphasized. For the readers' convenience, tables of useful online resources in each category are included. The role of computational scientists in the near future of biological research and the interplay between computational and experimental biology are also addressed.
Collapse
Affiliation(s)
- Troy Hawkins
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA.
| | | |
Collapse
|
49
|
Chua GH, Krishnan A, Li KB, Tomita M. MULTIRESOLUTION ANALYSIS UNCOVERS HIDDEN CONSERVATION OF PROPERTIES IN STRUCTURALLY AND FUNCTIONALLY SIMILAR PROTEINS. J Bioinform Comput Biol 2011; 4:1245-67. [PMID: 17245813 DOI: 10.1142/s0219720006002442] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/29/2006] [Revised: 09/13/2006] [Accepted: 09/13/2006] [Indexed: 11/18/2022]
Abstract
Physicochemcial properties of amino acids are important factors in determining protein structure and function. Most approaches make use of averaged properties over entire domains or even proteins to analyze their structure or function. This level of coarseness tends to hide the richness of the variability in the different properties across functional domains. This paper studies the conservation of physicochemical properties in a functionally similar family of proteins using a novel wavelet-based technique known as multiresolution analysis. Such an analysis can help uncover characteristics that can otherwise remain hidden. We have studied the protein kinase family of sequences and our findings are as follows: (a) a number of different properties are conserved over the functional catalytic domain irrespective of the sequence identities; (b) conservation of properties can be observed at different frequency levels and they agree well with the known structural/functional properties of the subdomains for the protein kinase family; (c) structural differences between the different kinase family members are reflected in the waveforms; and (d) functionally important mutations show distortions in the waveforms of conserved properties. The potential usefulness of the above findings in identifying functionally similar sequences in the twilight and midnight zones is demonstrated through a simple prediction model for the protein kinase family which achieved a recall of 93.7% and a precision of 96.75% in cross-validation tests.
Collapse
Affiliation(s)
- Gek-Huey Chua
- Bioinformatics Institute, 30, Biopolis Street, #07-01, Matrix, Singapore
| | | | | | | |
Collapse
|
50
|
Yelton AP, Thomas BC, Simmons SL, Wilmes P, Zemla A, Thelen MP, Justice N, Banfield JF. A semi-quantitative, synteny-based method to improve functional predictions for hypothetical and poorly annotated bacterial and archaeal genes. PLoS Comput Biol 2011; 7:e1002230. [PMID: 22028637 PMCID: PMC3197636 DOI: 10.1371/journal.pcbi.1002230] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2011] [Accepted: 08/30/2011] [Indexed: 11/19/2022] Open
Abstract
During microbial evolution, genome rearrangement increases with increasing sequence divergence. If the relationship between synteny and sequence divergence can be modeled, gene clusters in genomes of distantly related organisms exhibiting anomalous synteny can be identified and used to infer functional conservation. We applied the phylogenetic pairwise comparison method to establish and model a strong correlation between synteny and sequence divergence in all 634 available Archaeal and Bacterial genomes from the NCBI database and four newly assembled genomes of uncultivated Archaea from an acid mine drainage (AMD) community. In parallel, we established and modeled the trend between synteny and functional relatedness in the 118 genomes available in the STRING database. By combining these models, we developed a gene functional annotation method that weights evolutionary distance to estimate the probability of functional associations of syntenous proteins between genome pairs. The method was applied to the hypothetical proteins and poorly annotated genes in newly assembled acid mine drainage Archaeal genomes to add or improve gene annotations. This is the first method to assign possible functions to poorly annotated genes through quantification of the probability of gene functional relationships based on synteny at a significant evolutionary distance, and has the potential for broad application.
Collapse
Affiliation(s)
- Alexis P. Yelton
- Department of Environmental Science, Policy, and Management, University of California, Berkeley, California, United States of America
| | - Brian C. Thomas
- Department of Environmental Science, Policy, and Management, University of California, Berkeley, California, United States of America
| | - Sheri L. Simmons
- Department of Earth and Planetary Sciences, University of California, Berkeley, California, United States of America
| | - Paul Wilmes
- Department of Earth and Planetary Sciences, University of California, Berkeley, California, United States of America
| | - Adam Zemla
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, United States of America
| | - Michael P. Thelen
- Physical and Life Sciences Directorate, Lawrence Livermore National Laboratory, Livermore, California, United States of America
| | - Nicholas Justice
- Department of Plant and Microbial Biology, University of California, Berkeley, California, United States of America
| | - Jillian F. Banfield
- Department of Environmental Science, Policy, and Management, University of California, Berkeley, California, United States of America
- Department of Earth and Planetary Sciences, University of California, Berkeley, California, United States of America
- * E-mail:
| |
Collapse
|