1
|
Abnousi A, Broschat SL, Kalyanaraman A. A Fast Alignment-Free Approach for De Novo Detection of Protein Conserved Regions. PLoS One 2016; 11:e0161338. [PMID: 27552220 PMCID: PMC4995020 DOI: 10.1371/journal.pone.0161338] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/29/2016] [Accepted: 08/03/2016] [Indexed: 12/05/2022] Open
Abstract
BACKGROUND Identifying conserved regions in protein sequences is a fundamental operation, occurring in numerous sequence-driven analysis pipelines. It is used as a way to decode domain-rich regions within proteins, to compute protein clusters, to annotate sequence function, and to compute evolutionary relationships among protein sequences. A number of approaches exist for identifying and characterizing protein families based on their domains, and because domains represent conserved portions of a protein sequence, the primary computation involved in protein family characterization is identification of such conserved regions. However, identifying conserved regions from large collections (millions) of protein sequences presents significant challenges. METHODS In this paper we present a new, alignment-free method for detecting conserved regions in protein sequences called NADDA (No-Alignment Domain Detection Algorithm). Our method exploits the abundance of exact matching short subsequences (k-mers) to quickly detect conserved regions, and the power of machine learning is used to improve the prediction accuracy of detection. We present a parallel implementation of NADDA using the MapReduce framework and show that our method is highly scalable. RESULTS We have compared NADDA with Pfam and InterPro databases. For known domains annotated by Pfam, accuracy is 83%, sensitivity 96%, and specificity 44%. For sequences with new domains not present in the training set an average accuracy of 63% is achieved when compared to Pfam. A boost in results in comparison with InterPro demonstrates the ability of NADDA to capture conserved regions beyond those present in Pfam. We have also compared NADDA with ADDA and MKDOM2, assuming Pfam as ground-truth. On average NADDA shows comparable accuracy, more balanced sensitivity and specificity, and being alignment-free, is significantly faster. Excluding the one-time cost of training, runtimes on a single processor were 49s, 10,566s, and 456s for NADDA, ADDA, and MKDOM2, respectively, for a data set comprised of approximately 2500 sequences.
Collapse
Affiliation(s)
- Armen Abnousi
- School of EECS, Washington State University, Pullman, WA, United States of America
| | - Shira L. Broschat
- School of EECS, Washington State University, Pullman, WA, United States of America
- Paul G. Allen School for Global Animal Health, Washington State University, Pullman, WA, United States of America
- Department of Veterinary Microbiology and Pathology, Washington State University, Pullman, WA, United States of America
| | - Ananth Kalyanaraman
- School of EECS, Washington State University, Pullman, WA, United States of America
- Paul G. Allen School for Global Animal Health, Washington State University, Pullman, WA, United States of America
| |
Collapse
|
2
|
Moore AD, Held A, Terrapon N, Weiner J, Bornberg-Bauer E. DoMosaics: software for domain arrangement visualization and domain-centric analysis of proteins. Bioinformatics 2013; 30:282-3. [PMID: 24222210 DOI: 10.1093/bioinformatics/btt640] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
UNLABELLED DoMosaics is an application that unifies protein domain annotation, domain arrangement analysis and visualization in a single tool. It simplifies the analysis of protein families by consolidating disjunct procedures based on often inconvenient command-line applications and complex analysis tools. It provides a simple user interface with access to domain annotation services such as InterProScan or a local HMMER installation, and can be used to compare, analyze and visualize the evolution of domain architectures. AVAILABILITY AND IMPLEMENTATION DoMosaics is licensed under theApache License, Version 2.0, and binaries can be freely obtained from www.domosaics.net.
Collapse
Affiliation(s)
- Andrew D Moore
- Institute for Evolution and Biodiversity, Hüfferstrasse 1, Westphalian Wilhelms-University Münster, 48147 Münster, Germany, and Max Planck Institute for Infection Biology, Chariteplatz 1, 10117 Berlin, Germany
| | | | | | | | | |
Collapse
|
3
|
Piwowar M, Krzysztof P, Piotr P. ExonVisualiser - application for visualization exon units in 2D and 3D protein structures. Bioinformation 2012; 8:1280-2. [PMID: 23275735 PMCID: PMC3532015 DOI: 10.6026/97320630081280] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2012] [Accepted: 11/14/2012] [Indexed: 11/23/2022] Open
Abstract
UNLABELLED The web application oriented on identification and visualization of protein regions encoded by exons is presented. The Exon Visualiser can be used for visualisation on different levels of protein structure: at the primary (sequence) level and secondary structures level, as well as at the level of tertiary protein structure. The programme is suitable for processing data for all genes which have protein expressions deposited in the PDB database. The procedure steps implemented in the application: I) loading exons sequences and theirs coordinates from GenBank file as well as protein sequences: CDS from GenBank and aminoacid sequence from PDB II) consensus sequence creation (comparing amino acid sequences form PDB file with the CDS sequence from GenBank file) III) matching exon coordinates IV) visualisation in 2D and 3D protein structures. Presented web-tool among others provides the color-coded graphical display of protein sequences and chains in three dimensional protein structures which are correlated with the corresponding exons. AVAILABILITY http://149.156.12.53/ExonVisualiser/
Collapse
Affiliation(s)
- Monika Piwowar
- Department of Bioinformatics and Telemedicine, Collegium Medicum, Jagiellonian University, Lazarza 16, 31-530 Krakow, Poland
| | - Porembski Krzysztof
- Department of Bioinformatics and Telemedicine, Collegium Medicum, Jagiellonian University, Lazarza 16, 31-530 Krakow, Poland
| | - Piwowar Piotr
- Department of Measurement and Electronics, AGH University of Science and Technology, al. A. Mickiewicza 30, 30-059 Krakow, Poland
| |
Collapse
|
4
|
Bae K, Mallick BK, Elsik CG. Prediction of protein interdomain linker regions by a hidden Markov model. Bioinformatics 2005; 21:2264-70. [PMID: 15746283 DOI: 10.1093/bioinformatics/bti363] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Our aim was to predict protein interdomain linker regions using sequence alone, without requiring known homology. Identifying linker regions will delineate domain boundaries, and can be used to computationally dissect proteins into domains prior to clustering them into families. We developed a hidden Markov model of linker/non-linker sequence regions using a linker index derived from amino acid propensity. We employed an efficient Bayesian estimation of the model using Markov Chain Monte Carlo, Gibbs sampling in particular, to simulate parameters from the posteriors. Our model recognizes sequence data to be continuous rather than categorical, and generates a probabilistic output. RESULTS We applied our method to a dataset of protein sequences in which domains and interdomain linkers had been delineated using the Pfam-A database. The prediction results are superior to a simpler method that also uses linker index.
Collapse
Affiliation(s)
- Kyounghwa Bae
- Department of Statistics, Texas A&M University College Station, TX 77843-3143, USA
| | | | | |
Collapse
|
5
|
Abstract
Classification of proteins into families is one of the main goals of functional analysis. Proteins are usually assigned to a family on the basis of the presence of family-specific patterns, domains, or structural elements. Whereas proteins belonging to the same family are generally similar to each other, the extent of similarity varies widely across families. Some families are characterized by short, well-defined motifs, whereas others contain longer, less-specific motifs. We present a simple method for visualizing such differences. We applied our method to the Arabidopsis thaliana families listed at The Arabidopsis Information Resource (TAIR) Web site and for 76% of the nontrivial families (families with more than one member), our method identifies simple similarity measures that are necessary and sufficient to cluster members of the family together. Our visualization method can be used as part of an annotation pipeline to identify potentially incorrectly defined families. We also describe how our method can be extended to identify novel families and to assign unclassified proteins into known families.
Collapse
Affiliation(s)
- Vamsi Veeramachaneni
- Department of Biology, The Pennsylvania State University, University Park, Pennsylvania 16802, USA
| | | |
Collapse
|
6
|
Claudel-Renard C, Chevalet C, Faraut T, Kahn D. Enzyme-specific profiles for genome annotation: PRIAM. Nucleic Acids Res 2004; 31:6633-9. [PMID: 14602924 PMCID: PMC275543 DOI: 10.1093/nar/gkg847] [Citation(s) in RCA: 281] [Impact Index Per Article: 14.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The advent of fully sequenced genomes opens the ground for the reconstruction of metabolic pathways on the basis of the identification of enzyme-coding genes. Here we describe PRIAM, a method for automated enzyme detection in a fully sequenced genome, based on the classification of enzymes in the ENZYME database. PRIAM relies on sets of position-specific scoring matrices ('profiles') automatically tailored for each ENZYME entry. Automatically generated logical rules define which of these profiles is required in order to infer the presence of the corresponding enzyme in an organism. As an example, PRIAM was applied to identify potential metabolic pathways from the complete genome of the nitrogen-fixing bacterium Sinorhizobium meliloti. The results of this automated method were compared with the original genome annotation and visualised on KEGG graphs in order to facilitate the interpretation of metabolic pathways and to highlight potentially missing enzymes.
Collapse
Affiliation(s)
- Clotilde Claudel-Renard
- Laboratoire de Génétique Cellulaire, INRA, INRA/CNRS, BP27, 31326 Castanet-Tolosan Cedex, France
| | | | | | | |
Collapse
|
7
|
Mohseni-Zadeh S, Louis A, Brézellec P, Risler JL. PHYTOPROT: a database of clusters of plant proteins. Nucleic Acids Res 2004; 32:D351-3. [PMID: 14681432 PMCID: PMC308774 DOI: 10.1093/nar/gkh040] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
All the protein sequences from plants (including Arabidopsis thaliana) available from SwissProt/TrEMBL have been the subject of an all-by-all systematic comparison and grouped into clusters of related proteins. Within each cluster, the sequences have been submitted to pyramidal classification; in the case where two or several subfamilies have been grouped together, the pyramidal tree helps in finding which sequences make the links between subfamilies. In addition, the 'domains' that are common to two or more sequences within a cluster were determined and displayed à la ProDom. The resulting graphical representations proved to be quite efficient in pinpointing those protein sequences suffering from a probable error in the annotation of their genes. The clusters can be searched through various criteria and their pyramidal classifications and their domain representations can be displayed by querying http://genoplante-info. infobiogen.fr/phytoprot. The user can also launch a BLAST search of a query sequence against all the clusters.
Collapse
Affiliation(s)
- S Mohseni-Zadeh
- Laboratoire Génome et Informatique, UMR 8116 and Infobiogen, Tour Evry 2, 523 Place des Terrasses, 91034 Evry Cedex, France
| | | | | | | |
Collapse
|
8
|
Holzerlandt R, Orengo C, Kellam P, Albà MM. Identification of new herpesvirus gene homologs in the human genome. Genome Res 2002; 12:1739-48. [PMID: 12421761 PMCID: PMC187546 DOI: 10.1101/gr.334302] [Citation(s) in RCA: 77] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Viruses are intracellular parasites that use many cellular pathways during their replication. Large DNA viruses, such as herpesviruses, have captured a repertoire of cellular genes to block or mimic host immune responses, apoptosis regulation, and cell-cycle control mechanisms. We have conducted a systematic search for all homologs of herpesvirus proteins in the human genome using position-specific scoring matrices representing herpesvirus protein sequence domains, and pair-wise sequence comparisons. The analysis shows that approximately 13% of the herpesvirus proteins have clear sequence similarity to products of the human genome. Different human herpesviruses vary in their numbers of human homologs, indicating distinct rates of gene acquisition in different lineages. Our analysis has identified new families of herpesvirus/human homologs from viruses including human herpesvirus 5 (human cytomegalovirus; HCMV) and human herpesvirus 8 (Kaposi's sarcoma-associated herpesvirus; KSHV), which may play important roles in host-virus interactions.
Collapse
MESH Headings
- Amino Acid Sequence/genetics
- Cytomegalovirus/genetics
- Databases, Genetic
- Databases, Protein
- Gene Transfer, Horizontal/genetics
- Genes, Viral/genetics
- Genome, Human
- Herpesviridae/genetics
- Herpesvirus 2, Gallid/genetics
- Herpesvirus 8, Human/genetics
- Humans
- Molecular Sequence Data
- Sequence Homology, Amino Acid
- Sequence Homology, Nucleic Acid
- Transformation, Genetic/genetics
- Viral Proteins/genetics
- Viral Structural Proteins/genetics
Collapse
Affiliation(s)
- Ria Holzerlandt
- Wohl Virion Centre, Department of Immunology and Molecular Pathology, University College London, London W1T 4JF, United Kingdom
| | | | | | | |
Collapse
|
9
|
Louis A, Ollivier E, Aude JC, Risler JL. Massive sequence comparisons as a help in annotating genomic sequences. Genome Res 2001; 11:1296-303. [PMID: 11435413 PMCID: PMC311131 DOI: 10.1101/gr.gr-1776r] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
An all-by-all comparison of all the publicly available protein sequences from plants has been performed, followed by a clusterization process. Within each of the 1064 resulting clusters-containing sequences that are orthologous as well as paralogous-the sequences have been submitted to a pyramidal classification and their domains delineated by an automated procedure à la. This process provides a means for easily checking for any apparent inconsistency in a cluster, for example, whether one sequence is shorter or longer than the others, one domain is missing, etc. In such cases, the alignment of the DNA sequence of the gene with that of a close homologous protein often reveals (in 10% of the clusters) probable sequencing errors (leading to frameshifts) or probable wrong intron/exon predictions. The composition of the clusters, their pyramidal classifications, and domain decomposition, as well as our comments when appropriate, are available from http://chlora.infobiogen.fr:1234/PHYTOPROT.
Collapse
Affiliation(s)
- A Louis
- Laboratoire Génome et Informatique, Université de Versailles, 78035 Versailles Cedex, France.
| | | | | | | |
Collapse
|
10
|
Nierman WC, Feldblyum TV, Laub MT, Paulsen IT, Nelson KE, Eisen JA, Heidelberg JF, Alley MR, Ohta N, Maddock JR, Potocka I, Nelson WC, Newton A, Stephens C, Phadke ND, Ely B, DeBoy RT, Dodson RJ, Durkin AS, Gwinn ML, Haft DH, Kolonay JF, Smit J, Craven MB, Khouri H, Shetty J, Berry K, Utterback T, Tran K, Wolf A, Vamathevan J, Ermolaeva M, White O, Salzberg SL, Venter JC, Shapiro L, Fraser CM, Eisen J. Complete genome sequence of Caulobacter crescentus. Proc Natl Acad Sci U S A 2001; 98:4136-41. [PMID: 11259647 PMCID: PMC31192 DOI: 10.1073/pnas.061029298] [Citation(s) in RCA: 388] [Impact Index Per Article: 16.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The complete genome sequence of Caulobacter crescentus was determined to be 4,016,942 base pairs in a single circular chromosome encoding 3,767 genes. This organism, which grows in a dilute aquatic environment, coordinates the cell division cycle and multiple cell differentiation events. With the annotated genome sequence, a full description of the genetic network that controls bacterial differentiation, cell growth, and cell cycle progression is within reach. Two-component signal transduction proteins are known to play a significant role in cell cycle progression. Genome analysis revealed that the C. crescentus genome encodes a significantly higher number of these signaling proteins (105) than any bacterial genome sequenced thus far. Another regulatory mechanism involved in cell cycle progression is DNA methylation. The occurrence of the recognition sequence for an essential DNA methylating enzyme that is required for cell cycle regulation is severely limited and shows a bias to intergenic regions. The genome contains multiple clusters of genes encoding proteins essential for survival in a nutrient poor habitat. Included are those involved in chemotaxis, outer membrane channel function, degradation of aromatic ring compounds, and the breakdown of plant-derived carbon sources, in addition to many extracytoplasmic function sigma factors, providing the organism with the ability to respond to a wide range of environmental fluctuations. C. crescentus is, to our knowledge, the first free-living alpha-class proteobacterium to be sequenced and will serve as a foundation for exploring the biology of this group of bacteria, which includes the obligate endosymbiont and human pathogen Rickettsia prowazekii, the plant pathogen Agrobacterium tumefaciens, and the bovine and human pathogen Brucella abortus.
Collapse
Affiliation(s)
- W C Nierman
- The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, MD 20850, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
11
|
Albà MM, Lee D, Pearl FM, Shepherd AJ, Martin N, Orengo CA, Kellam P. VIDA: a virus database system for the organization of animal virus genome open reading frames. Nucleic Acids Res 2001; 29:133-6. [PMID: 11125070 PMCID: PMC29831 DOI: 10.1093/nar/29.1.133] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2000] [Revised: 10/27/2000] [Accepted: 10/27/2000] [Indexed: 11/13/2022] Open
Abstract
VIDA is a new virus database that organizes open reading frames (ORFs) from partial and complete genomic sequences from animal viruses. Currently VIDA includes all sequences from GenBank for Herpesviridae, Coronaviridae and Arteriviridae. The ORFs are organized into homologous protein families, which are identified on the basis of sequence similarity relationships. Conserved sequence regions of potential functional importance are identified and can be retrieved as sequence alignments. We use a controlled taxonomical and functional classification for all the proteins and protein families in the database. When available, protein structures that are related to the families have also been included. The database is available for online search and sequence information retrieval at http://www.biochem.ucl.ac.uk/bsm/virus_database/ VIDA.html.
Collapse
Affiliation(s)
- M M Albà
- Wohl Virion Centre, Department of Immunology and Molecular Pathology, Windeyer Institute of Medical Sciences, University College London, London, UK
| | | | | | | | | | | | | |
Collapse
|
12
|
Albà MM, Das R, Orengo CA, Kellam P. Genomewide function conservation and phylogeny in the Herpesviridae. Genome Res 2001; 11:43-54. [PMID: 11156614 PMCID: PMC311046 DOI: 10.1101/gr.149801] [Citation(s) in RCA: 58] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The Herpesviridae are a large group of well-characterized double-stranded DNA viruses for which many complete genome sequences have been determined. We have extracted protein sequences from all predicted open reading frames of 19 herpesvirus genomes. Sequence comparison and protein sequence clustering methods have been used to construct herpesvirus protein homologous families. This resulted in 1692 proteins being clustered into 243 multiprotein families and 196 singleton proteins. Predicted functions were assigned to each homologous family based on genome annotation and published data and each family classified into seven broad functional groups. Phylogenetic profiles were constructed for each herpesvirus from the homologous protein families and used to determine conserved functions and genomewide phylogenetic trees. These trees agreed with molecular-sequence-derived trees and allowed greater insight into the phylogeny of ungulate and murine gammaherpesviruses.
Collapse
Affiliation(s)
- M M Albà
- Wohl Virion Centre, Department of Immunology and Molecular Pathology, University College London, London W1T 4JF, UK
| | | | | | | |
Collapse
|