1
|
Wanchai V, Nookaew I, Ussery DW. ProdMX: Rapid query and analysis of protein functional domain based on compressed sparse matrices. Comput Struct Biotechnol J 2020; 18:3890-3896. [PMID: 33335686 PMCID: PMC7719867 DOI: 10.1016/j.csbj.2020.10.023] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2020] [Revised: 10/20/2020] [Accepted: 10/23/2020] [Indexed: 11/26/2022] Open
Abstract
Large-scale protein analysis has been used to characterize large numbers of proteins across numerous species. One of the applications is to use as a high-throughput screening method for pathogenicity of genomes. Unlike sequence homology methods, protein comparison at a functional level provides us with a unique opportunity to classify proteins, based on their functional structures without dealing with sequence complexity of distantly related species. Protein functions can be abstractly described by a set of protein functional domains, such as PfamA domains; a set of genomes can then be mapped to a matrix, with each row representing a genome, and the columns representing the presence or absence of a given functional domain. However, a powerful tool is needed to analyze the large sparse matrices generated by millions of genomes that will become available in the near future. The ProdMX is a tool with user-friendly utilities developed to facilitate high-throughput analysis of proteins with an ability to be included as an effective module in the high-throughput pipeline. The ProdMX employs a compressed sparse matrix algorithm to reduce computational resources and time used to perform the matrix manipulation during functional domain analysis. The ProdMX is a free and publicly available Python package which can be installed with popular package mangers such as PyPI and Conda, or with a standard installer from source code available on the ProdMX GitHub repository at https://github.com/visanuwan/prodmx.
Collapse
Affiliation(s)
- Visanu Wanchai
- Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA
| | - Intawat Nookaew
- Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA
| | - David W Ussery
- Arkansas Center for Genomic Epidemiology & Medicine and The Department of Biomedical Informatics, University of Arkansas for Medical Sciences, Little Rock, AR 72205, USA
| |
Collapse
|
2
|
Abstract
This chapter reviews current research on how protein domain architectures evolve. We begin by summarizing work on the phylogenetic distribution of proteins, as this will directly impact which domain architectures can be formed in different species. Studies relating domain family size to occurrence have shown that they generally follow power law distributions, both within genomes and larger evolutionary groups. These findings were subsequently extended to multi-domain architectures. Genome evolution models that have been suggested to explain the shape of these distributions are reviewed, as well as evidence for selective pressure to expand certain domain families more than others. Each domain has an intrinsic combinatorial propensity, and the effects of this have been studied using measures of domain versatility or promiscuity. Next, we study the principles of protein domain architecture evolution and how these have been inferred from distributions of extant domain arrangements. Following this, we review inferences of ancestral domain architecture and the conclusions concerning domain architecture evolution mechanisms that can be drawn from these. Finally, we examine whether all known cases of a given domain architecture can be assumed to have a single common origin (monophyly) or have evolved convergently (polyphyly). We end by a discussion of some available tools for computational analysis or exploitation of protein domain architectures and their evolution.
Collapse
|
3
|
Adebali O, Zhulin IB. Aquerium: A web application for comparative exploration of domain-based protein occurrences on the taxonomically clustered genome tree. Proteins 2016; 85:72-77. [PMID: 27802571 DOI: 10.1002/prot.25199] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2016] [Accepted: 10/20/2016] [Indexed: 01/27/2023]
Abstract
Gene duplication and loss are major driving forces in evolution. While many important genomic resources provide information on gene presence, there is a lack of tools giving equal importance to presence and absence information as well as web platforms enabling easy visual comparison of multiple domain-based protein occurrences at once. Here, we present Aquerium, a platform for visualizing genomic presence and absence of biomolecules with a focus on protein domain architectures. The web server offers advanced domain organization querying against the database of pre-computed domains for ∼26,000 organisms and it can be utilized for identification of evolutionary events, such as fusion, disassociation, duplication, and shuffling of protein domains. The tool also allows alternative inputs of custom entries or BLASTP results for visualization. Aquerium will be a useful tool for biologists who perform comparative genomic and evolutionary analyses. The web server is freely accessible at http://aquerium.utk.edu. Proteins 2016; 85:72-77. © 2016 Wiley Periodicals, Inc.
Collapse
Affiliation(s)
- Ogun Adebali
- UT-ORNL Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, Tennessee, 37996.,Department of Microbiology, University of Tennessee, Knoxville, Tennessee, 37996.,Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, 37961
| | - Igor B Zhulin
- UT-ORNL Graduate School of Genome Science and Technology, University of Tennessee, Knoxville, Tennessee, 37996.,Department of Microbiology, University of Tennessee, Knoxville, Tennessee, 37996.,Computer Science and Mathematics Division, Oak Ridge National Laboratory, Oak Ridge, Tennessee, 37961
| |
Collapse
|
4
|
Moore AD, Held A, Terrapon N, Weiner J, Bornberg-Bauer E. DoMosaics: software for domain arrangement visualization and domain-centric analysis of proteins. Bioinformatics 2013; 30:282-3. [PMID: 24222210 DOI: 10.1093/bioinformatics/btt640] [Citation(s) in RCA: 41] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023] Open
Abstract
UNLABELLED DoMosaics is an application that unifies protein domain annotation, domain arrangement analysis and visualization in a single tool. It simplifies the analysis of protein families by consolidating disjunct procedures based on often inconvenient command-line applications and complex analysis tools. It provides a simple user interface with access to domain annotation services such as InterProScan or a local HMMER installation, and can be used to compare, analyze and visualize the evolution of domain architectures. AVAILABILITY AND IMPLEMENTATION DoMosaics is licensed under theApache License, Version 2.0, and binaries can be freely obtained from www.domosaics.net.
Collapse
Affiliation(s)
- Andrew D Moore
- Institute for Evolution and Biodiversity, Hüfferstrasse 1, Westphalian Wilhelms-University Münster, 48147 Münster, Germany, and Max Planck Institute for Infection Biology, Chariteplatz 1, 10117 Berlin, Germany
| | | | | | | | | |
Collapse
|
5
|
Genome of a SAR116 bacteriophage shows the prevalence of this phage type in the oceans. Proc Natl Acad Sci U S A 2013; 110:12343-8. [PMID: 23798439 DOI: 10.1073/pnas.1219930110] [Citation(s) in RCA: 95] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The abundance, genetic diversity, and crucial ecological and evolutionary roles of marine phages have prompted a large number of metagenomic studies. However, obtaining a thorough understanding of marine phages has been hampered by the low number of phage isolates infecting major bacterial groups other than cyanophages and pelagiphages. Therefore, there is an urgent requirement for the isolation of phages that infect abundant marine bacterial groups. In this study, we isolated and characterized HMO-2011, a phage infecting a bacterium of the SAR116 clade, one of the most abundant marine bacterial lineages. HMO-2011, which infects "Candidatus Puniceispirillum marinum" strain IMCC1322, has an ~55-kb dsDNA genome that harbors many genes with novel features rarely found in cultured organisms, including genes encoding a DNA polymerase with a partial DnaJ central domain and an atypical methanesulfonate monooxygenase. Furthermore, homologs of nearly all HMO-2011 genes were predominantly found in marine metagenomes rather than cultured organisms, suggesting the novelty of HMO-2011 and the prevalence of this phage type in the oceans. A significant number of the viral metagenome sequences obtained from the ocean surface were best assigned to the HMO-2011 genome. The number of reads assigned to HMO-2011 accounted for 10.3%-25.3% of the total reads assigned to viruses in seven viromes from the Pacific and Indian Oceans, making the HMO-2011 genome the most or second-most frequently assigned viral genome. Given its ability to infect the abundant SAR116 clade and its widespread distribution, Puniceispirillum phage HMO-2011 could be an important resource for marine virus research.
Collapse
|
6
|
Abstract
With the development of ultra-high-throughput technologies, the cost of sequencing bacterial genomes has been vastly reduced. As more genomes are sequenced, less time can be spent manually annotating those genomes, resulting in an increased reliance on automatic annotation pipelines. However, automatic pipelines can produce inaccurate genome annotation and their results often require manual curation. Here, we discuss the automatic and manual annotation of bacterial genomes, identify common problems introduced by the current genome annotation process and suggests potential solutions.
Collapse
Affiliation(s)
- Emily J Richardson
- The Roslin Institute, University of Edinburgh, Easter Bush, EH25 9RG, UK
| | | |
Collapse
|
7
|
Punta M, Coggill PC, Eberhardt RY, Mistry J, Tate J, Boursnell C, Pang N, Forslund K, Ceric G, Clements J, Heger A, Holm L, Sonnhammer ELL, Eddy SR, Bateman A, Finn RD. The Pfam protein families database. Nucleic Acids Res 2011; 40:D290-301. [PMID: 22127870 PMCID: PMC3245129 DOI: 10.1093/nar/gkr1065] [Citation(s) in RCA: 2862] [Impact Index Per Article: 220.2] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022] Open
Abstract
Pfam is a widely used database of protein families, currently containing more than 13,000 manually curated protein families as of release 26.0. Pfam is available via servers in the UK (http://pfam.sanger.ac.uk/), the USA (http://pfam.janelia.org/) and Sweden (http://pfam.sbc.su.se/). Here, we report on changes that have occurred since our 2010 NAR paper (release 24.0). Over the last 2 years, we have generated 1840 new families and increased coverage of the UniProt Knowledgebase (UniProtKB) to nearly 80%. Notably, we have taken the step of opening up the annotation of our families to the Wikipedia community, by linking Pfam families to relevant Wikipedia pages and encouraging the Pfam and Wikipedia communities to improve and expand those pages. We continue to improve the Pfam website and add new visualizations, such as the 'sunburst' representation of taxonomic distribution of families. In this work we additionally address two topics that will be of particular interest to the Pfam community. First, we explain the definition and use of family-specific, manually curated gathering thresholds. Second, we discuss some of the features of domains of unknown function (also known as DUFs), which constitute a rapidly growing class of families within Pfam.
Collapse
Affiliation(s)
- Marco Punta
- Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton CB10 1SA, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
8
|
FACT: functional annotation transfer between proteins with similar feature architectures. BMC Bioinformatics 2010; 11:417. [PMID: 20696036 PMCID: PMC2931517 DOI: 10.1186/1471-2105-11-417] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2010] [Accepted: 08/09/2010] [Indexed: 11/24/2022] Open
Abstract
Background The increasing number of sequenced genomes provides the basis for exploring the genetic and functional diversity within the tree of life. Only a tiny fraction of the encoded proteins undergoes a thorough experimental characterization. For the remainder, bioinformatics annotation tools are the only means to infer their function. Exploiting significant sequence similarities to already characterized proteins, commonly taken as evidence for homology, is the prevalent method to deduce functional equivalence. Such methods fail when homologs are too diverged, or when they have assumed a different function. Finally, due to convergent evolution, functional equivalence is not necessarily linked to common ancestry. Therefore complementary approaches are required to identify functional equivalents. Results We present the Feature Architecture Comparison Tool http://www.cibiv.at/FACT to search for functionally equivalent proteins. FACT uses the similarity between feature architectures of two proteins, i.e., the arrangements of functional domains, secondary structure elements and compositional properties, as a proxy for their functional equivalence. A scoring function measures feature architecture similarities, which enables searching for functional equivalents in entire proteomes. Our evaluation of 9,570 EC classified enzymes revealed that FACT, using the full feature, set outperformed the existing architecture-based approaches by identifying significantly more functional equivalents as highest scoring proteins. We show that FACT can identify functional equivalents that share no significant sequence similarity. However, when the highest scoring protein of FACT is also the protein with the highest local sequence similarity, it is in 99% of the cases functionally equivalent to the query. We demonstrate the versatility of FACT by identifying a missing link in the yeast glutathione metabolism and also by searching for the human GolgA5 equivalent in Trypanosoma brucei. Conclusions FACT facilitates a quick and sensitive search for functionally equivalent proteins in entire proteomes. FACT is complementary to approaches using sequence similarity to identify proteins with the same function. Thus, FACT is particularly useful when functional equivalents need to be identified in evolutionarily distant species, or when functional equivalents are not homologous. The most reliable annotation transfers, however, are achieved when feature architecture similarity and sequence similarity are jointly taken into account.
Collapse
|
9
|
Park BH, Karpinets TV, Syed MH, Leuze MR, Uberbacher EC. CAZymes Analysis Toolkit (CAT): web service for searching and analyzing carbohydrate-active enzymes in a newly sequenced organism using CAZy database. Glycobiology 2010; 20:1574-84. [PMID: 20696711 DOI: 10.1093/glycob/cwq106] [Citation(s) in RCA: 231] [Impact Index Per Article: 16.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
The Carbohydrate-Active Enzyme (CAZy) database provides a rich set of manually annotated enzymes that degrade, modify, or create glycosidic bonds. Despite rich and invaluable information stored in the database, software tools utilizing this information for annotation of newly sequenced genomes by CAZy families are limited. We have employed two annotation approaches to fill the gap between manually curated high-quality protein sequences collected in the CAZy database and the growing number of other protein sequences produced by genome or metagenome sequencing projects. The first approach is based on a similarity search against the entire nonredundant sequences of the CAZy database. The second approach performs annotation using links or correspondences between the CAZy families and protein family domains. The links were discovered using the association rule learning algorithm applied to sequences from the CAZy database. The approaches complement each other and in combination achieved high specificity and sensitivity when cross-evaluated with the manually curated genomes of Clostridium thermocellum ATCC 27405 and Saccharophagus degradans 2-40. The capability of the proposed framework to predict the function of unknown protein domains and of hypothetical proteins in the genome of Neurospora crassa is demonstrated. The framework is implemented as a Web service, the CAZymes Analysis Toolkit, and is available at http://cricket.ornl.gov/cgi-bin/cat.cgi.
Collapse
|
10
|
Tamuri AU, Laskowski RA. ArchSchema: a tool for interactive graphing of related Pfam domain architectures. Bioinformatics 2010; 26:1260-1. [DOI: 10.1093/bioinformatics/btq119] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
|
11
|
Abstract
Background The general method used to determine the function of newly discovered proteins is to transfer annotations from well-characterized homologous proteins. The process of selecting homologous proteins can largely be classified into sequence-based and domain-based approaches. Domain-based methods have several advantages for identifying distant homology and homology among proteins with multiple domains, as compared to sequence-based methods. However, these methods are challenged by large families defined by 'promiscuous' (or 'mobile') domains. Results Here we present a measure, called Weighed Domain Architecture Comparison (WDAC), of domain architecture similarity, which can be used to identify homolog of multidomain proteins. To distinguish these promiscuous domains from conventional protein domains, we assigned a weight score to Pfam domain extracted from RefSeq proteins, based on its abundance and versatility. To measure the similarity of two domain architectures, cosine similarity (a similarity measure used in information retrieval) is used. We combined sequence similarity with domain architecture comparisons to identify proteins belonging to the same domain architecture. Using human and nematode proteomes, we compared WDAC with an unweighted domain architecture method (DAC) to evaluate the effectiveness of domain weight scores. We found that WDAC is better at identifying homology among multidomain proteins. Conclusion Our analysis indicates that considering domain weight scores in domain architecture comparisons improves protein homology identification. We developed a web-based server to allow users to compare their proteins with protein domain architectures.
Collapse
Affiliation(s)
- Byungwook Lee
- Korean BioInformation Center, KRIBB, Daejeon 305-806, Korea.
| | | |
Collapse
|
12
|
Wichadakul D, Numnark S, Ingsriswang S. d-Omix: a mixer of generic protein domain analysis tools. Nucleic Acids Res 2009; 37:W417-21. [PMID: 19465389 PMCID: PMC2703976 DOI: 10.1093/nar/gkp329] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
Domain combination provides important clues to the roles of protein domains in protein function, interaction and evolution. We have developed a web server d-Omix (a Mixer of Protein Domain Analysis Tools) aiming as a unified platform to analyze, compare and visualize protein data sets in various aspects of protein domain combinations. With InterProScan files for protein sets of interest provided by users, the server incorporates four services for domain analyses. First, it constructs protein phylogenetic tree based on a distance matrix calculated from protein domain architectures (DAs), allowing the comparison with a sequence-based tree. Second, it calculates and visualizes the versatility, abundance and co-presence of protein domains via a domain graph. Third, it compares the similarity of proteins based on DA alignment. Fourth, it builds a putative protein network derived from domain–domain interactions from DOMINE. Users may select a variety of input data files and flexibly choose domain search tools (e.g. hmmpfam, superfamily) for a specific analysis. Results from the d-Omix could be interactively explored and exported into various formats such as SVG, JPG, BMP and CSV. Users with only protein sequences could prepare an InterProScan file using a service provided by the server as well. The d-Omix web server is freely available at http://www.biotec.or.th/isl/Domix.
Collapse
Affiliation(s)
- Duangdao Wichadakul
- National Center for Genetic Engineering and Biotechnology (BIOTEC) - Information Systems Laboratory, Pathumthani, Thailand.
| | | | | |
Collapse
|