1
|
Cao H, Ma Q, Chen X, Xu Y. DOOR: a prokaryotic operon database for genome analyses and functional inference. Brief Bioinform 2020; 20:1568-1577. [PMID: 28968679 DOI: 10.1093/bib/bbx088] [Citation(s) in RCA: 24] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2017] [Revised: 06/13/2017] [Indexed: 11/14/2022] Open
Abstract
The rapid accumulation of fully sequenced prokaryotic genomes provides unprecedented information for biological studies of bacterial and archaeal organisms in a systematic manner. Operons are the basic functional units for conducting such studies. Here, we review an operon database DOOR (the Database of prOkaryotic OpeRons) that we have previously developed and continue to update. Currently, the database contains 6 975 454 computationally predicted operons in 2072 complete genomes. In addition, the database also contains the following information: (i) transcriptional units for 24 genomes derived using publicly available transcriptomic data; (ii) orthologous gene mapping across genomes; (iii) 6408 cis-regulatory motifs for transcriptional factors of some operons for 203 genomes; (iv) 3 456 718 Rho-independent terminators for 2072 genomes; as well as (v) a suite of tools in support of applications of the predicted operons. In this review, we will explain how such data are computationally derived and demonstrate how they can be used to derive a wide range of higher-level information needed for systems biology studies to tackle complex and fundamental biology questions.
Collapse
|
2
|
Reyes PFL, Michoel T, Joshi A, Devailly G. Meta-analysis of Liver and Heart Transcriptomic Data for Functional Annotation Transfer in Mammalian Orthologs. Comput Struct Biotechnol J 2017; 15:425-432. [PMID: 29187960 PMCID: PMC5691612 DOI: 10.1016/j.csbj.2017.08.002] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2017] [Revised: 08/10/2017] [Accepted: 08/11/2017] [Indexed: 11/30/2022] Open
Abstract
Functional annotation transfer across multi-gene family
orthologs can lead to functional misannotations. We hypothesised that co-expression
network will help predict functional orthologs amongst complex homologous gene
families. To explore the use of transcriptomic data available in public domain to
identify functionally equivalent ones from all predicted orthologs, we collected
genome wide expression data in mouse and rat liver from over 1500 experiments with
varied treatments. We used a hyper-graph clustering method to identify clusters of
orthologous genes co-expressed in both mouse and rat. We validated these clusters by
analysing expression profiles in each species separately, and demonstrating a high
overlap. We then focused on genes in 18 homology groups with one-to-many or
many-to-many relationships between two species, to discriminate between functionally
equivalent and non-equivalent orthologs. Finally, we further applied our method by
collecting heart transcriptomic data (over 1400 experiments) in rat and mouse to
validate the method in an independent tissue.
Collapse
Affiliation(s)
| | - Tom Michoel
- The Roslin Institute, The University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, Scotland, UK
| | - Anagha Joshi
- The Roslin Institute, The University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, Scotland, UK
| | - Guillaume Devailly
- The Roslin Institute, The University of Edinburgh, Easter Bush, Midlothian, EH25 9RG, Scotland, UK
| |
Collapse
|
3
|
Bouadjenek MR, Verspoor K, Zobel J. Automated detection of records in biological sequence databases that are inconsistent with the literature. J Biomed Inform 2017. [PMID: 28624643 DOI: 10.1016/j.jbi.2017.06.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/25/2023]
Abstract
We investigate and analyse the data quality of nucleotide sequence databases with the objective of automatic detection of data anomalies and suspicious records. Specifically, we demonstrate that the published literature associated with each data record can be used to automatically evaluate its quality, by cross-checking the consistency of the key content of the database record with the referenced publications. Focusing on GenBank, we describe a set of quality indicators based on the relevance paradigm of information retrieval (IR). Then, we use these quality indicators to train an anomaly detection algorithm to classify records as "confident" or "suspicious". Our experiments on the PubMed Central collection show assessing the coherence between the literature and database records, through our algorithms, is an effective mechanism for assisting curators to perform data cleansing. Although fewer than 0.25% of the records in our data set are known to be faulty, we would expect that there are many more in GenBank that have not yet been identified. By automated comparison with literature they can be identified with a precision of up to 10% and a recall of up to 30%, while strongly outperforming several baselines. While these results leave substantial room for improvement, they reflect both the very imbalanced nature of the data, and the limited explicitly labelled data that is available. Overall, the obtained results show promise for the development of a new kind of approach to detecting low-quality and suspicious sequence records based on literature analysis and consistency. From a practical point of view, this will greatly help curators in identifying inconsistent records in large-scale sequence databases by highlighting records that are likely to be inconsistent with the literature.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| | - Karin Verspoor
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| | - Justin Zobel
- Department of Computing and Information Systems, The University of Melbourne, Parkville 3053, Australia.
| |
Collapse
|
4
|
Zallot R, Harrison KJ, Kolaczkowski B, de Crécy-Lagard V. Functional Annotations of Paralogs: A Blessing and a Curse. Life (Basel) 2016; 6:life6030039. [PMID: 27618105 PMCID: PMC5041015 DOI: 10.3390/life6030039] [Citation(s) in RCA: 39] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2016] [Revised: 08/29/2016] [Accepted: 09/02/2016] [Indexed: 12/15/2022] Open
Abstract
Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines.
Collapse
Affiliation(s)
- Rémi Zallot
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Katherine J Harrison
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Bryan Kolaczkowski
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| | - Valérie de Crécy-Lagard
- Department of Microbiology and Cell Science, Institute of Food and Agricultural Sciences, University of Florida, Gainesville, FL 32611, USA.
| |
Collapse
|
5
|
Li Y, Rao N, Yang F, Zhang Y, Yang Y, Liu HM, Guo F, Huang J. Biocomputional construction of a gene network under acid stress in Synechocystis sp. PCC 6803. Res Microbiol 2014; 165:420-8. [PMID: 24787285 DOI: 10.1016/j.resmic.2014.04.004] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2014] [Accepted: 04/14/2014] [Indexed: 11/25/2022]
Abstract
Acid stress is one of the most serious threats that cyanobacteria have to face, and it has an impact at all levels from genome to phenotype. However, very little is known about the detailed response mechanism to acid stress in this species. We present here a general analysis of the gene regulatory network of Synechocystis sp. PCC 6803 in response to acid stress using comparative genome analysis and biocomputational prediction. In this study, we collected 85 genes and used them as an initial template to predict new genes through co-regulation, protein-protein interactions and the phylogenetic profile, and 179 new genes were obtained to form a complete template. In addition, we found that 11 enriched pathways such as glycolysis are closely related to the acid stress response. Finally, we constructed a regulatory network for the intricate relationship of these genes and summarize the key steps in response to acid stress. This is the first time a bioinformatic approach has been taken systematically to gene interactions in cyanobacteria and the elaboration of their cell metabolism and regulatory pathways under acid stress, which is more efficient than a traditional experimental study. The results also provide theoretical support for similar research into environmental stresses in cyanobacteria and possible industrial applications.
Collapse
Affiliation(s)
- Yi Li
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Nini Rao
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China.
| | - Feng Yang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Ying Zhang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Yang Yang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Han-ming Liu
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Fengbiao Guo
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| | - Jian Huang
- School of Life Science and Technology, University of Electronic Science and Technology of China, Chengdu, China
| |
Collapse
|
6
|
Prokaryotic phylogenies inferred from whole-genome sequence and annotation data. BIOMED RESEARCH INTERNATIONAL 2013; 2013:409062. [PMID: 24073404 PMCID: PMC3773407 DOI: 10.1155/2013/409062] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/15/2013] [Revised: 06/26/2013] [Accepted: 07/22/2013] [Indexed: 11/25/2022]
Abstract
Phylogenetic trees are used to represent the evolutionary relationship among various groups of species. In this paper, a novel method for inferring prokaryotic phylogenies using multiple genomic information is proposed. The method is called CGCPhy and based on the distance matrix of orthologous gene clusters between whole-genome pairs. CGCPhy comprises four main steps. First, orthologous genes are determined by sequence similarity, genomic function, and genomic structure information. Second, genes involving potential HGT events are eliminated, since such genes are considered to be the highly conserved genes across different species and the genes located on fragments with abnormal genome barcode. Third, we calculate the distance of the orthologous gene clusters between each genome pair in terms of the number of orthologous genes in conserved clusters. Finally, the neighbor-joining method is employed to construct phylogenetic trees across different species. CGCPhy has been examined on different datasets from 617 complete single-chromosome prokaryotic genomes and achieved applicative accuracies on different species sets in agreement with Bergey's taxonomy in quartet topologies. Simulation results show that CGCPhy achieves high average accuracy and has a low standard deviation on different datasets, so it has an applicative potential for phylogenetic analysis.
Collapse
|
7
|
Gupta A, Sharma V, Tewari AK, SurenderKumar V, Wadhwa G, Mathur A, Sharma SK, Jain CK. Comparative Molecular docking analysis of DNA Gyrase subunit A in Pseudomonas aeruginosaPAO1. Bioinformation 2013; 9:116-20. [PMID: 23423379 PMCID: PMC3569597 DOI: 10.6026/97320630009116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2012] [Revised: 12/28/2012] [Accepted: 01/03/2013] [Indexed: 12/02/2022] Open
Abstract
Pseudomonas aeruginosa is an opportunistic bacterium known for causing chronic infections in cystic fibrosis and chronic obstructive pulmonary disease (COPD) patients. Recently, several drug targets in Pseudomonas aeruginosa PAO1 have been reported using network biology approaches on the basis of essentiality and topology and further ranked on network measures viz. degree and centrality. Till date no drug/ligand molecule has been reported against this targets.In our work we have identified the ligand /drug molecules, through Orthologous gene mapping against Bacillus subtilis subsp. subtilis str. 168 and performed modelling and docking analysis. From the predicted drug targets in PA PAO1, we selected those drug targets which show statistically significant orthology with a model organism and whose orthologs are present in all the selected drug targets of PA PAO1.Modeling of their structure has been done using I-Tasser web server. Orthologous gene mapping has been performed using Cluster of Orthologs (COGs) and based on orthology; drugs available for Bacillus sp. have been docked with PA PAO1 protein drug targets using MoleGro virtual docker version 4.0.2.Orthologous gene for PA3168 gyrA is BS gyrAfound in Bacillus subtilis subsp. subtilis str. 168. The drugs cited for Bacillus sp. have been docked with PA genes and energy analyses have been made. Based on Orthologous gene mapping andin-silico studies, Nalidixic acid is reported as an effective drug against PA3168 gyrA for the treatment of CF and COPD.
Collapse
Affiliation(s)
- Aman Gupta
- Department of Biotechnology, Jaypee Institute of Information Technology, A-10, Sector-62, Noida, U.P-201301, India
| | - Vanashika Sharma
- Department of Biotechnology, Jaypee Institute of Information Technology, A-10, Sector-62, Noida, U.P-201301, India
| | - Ashish Kumar Tewari
- Department of Biotechnology, Jaypee Institute of Information Technology, A-10, Sector-62, Noida, U.P-201301, India
| | - Vipul SurenderKumar
- Department of Biotechnology, Jaypee Institute of Information Technology, A-10, Sector-62, Noida, U.P-201301, India
| | - Gulshan Wadhwa
- Department of Biotechnology (DBT), Ministry of Science & Technology, New Delhi, Delhi,110003,India
| | - Ashwani Mathur
- Department of Biotechnology, Jaypee Institute of Information Technology, A-10, Sector-62, Noida, U.P-201301, India
| | - Sanjeev Kumar Sharma
- Department of Biotechnology, Jaypee Institute of Information Technology, A-10, Sector-62, Noida, U.P-201301, India
| | - Chakresh Kumar Jain
- Department of Biotechnology, Jaypee Institute of Information Technology, A-10, Sector-62, Noida, U.P-201301, India
| |
Collapse
|
8
|
CINPER: an interactive web system for pathway prediction for prokaryotes. PLoS One 2012; 7:e51252. [PMID: 23236458 PMCID: PMC3517448 DOI: 10.1371/journal.pone.0051252] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/03/2012] [Accepted: 10/30/2012] [Indexed: 11/19/2022] Open
Abstract
We present a web-based network-construction system, CINPER (CSBL INteractive Pathway BuildER), to assist a user to build a user-specified gene network for a prokaryotic organism in an intuitive manner. CINPER builds a network model based on different types of information provided by the user and stored in the system. CINPER’s prediction process has four steps: (i) collection of template networks based on (partially) known pathways of related organism(s) from the SEED or BioCyc database and the published literature; (ii) construction of an initial network model based on the template networks using the P-Map program; (iii) expansion of the initial model, based on the association information derived from operons, protein-protein interactions, co-expression modules and phylogenetic profiles; and (iv) computational validation of the predicted models based on gene expression data. To facilitate easy applications, CINPER provides an interactive visualization environment for a user to enter, search and edit relevant data and for the system to display (partial) results and prompt for additional data. Evaluation of CINPER on 17 well-studied pathways in the MetaCyc database shows that the program achieves an average recall rate of 76% and an average precision rate of 90% on the initial models; and a higher average recall rate at 87% and an average precision rate at 28% on the final models. The reduced precision rate in the final models versus the initial models reflects the reality that the final models have large numbers of novel genes that have no experimental evidences and hence are not yet collected in the MetaCyc database. To demonstrate the usefulness of this server, we have predicted an iron homeostasis gene network of Synechocystis sp. PCC6803 using the server. The predicted models along with the server can be accessed at http://csbl.bmb.uga.edu/cinper/.
Collapse
|
9
|
Li G, Ma Q, Mao X, Yin Y, Zhu X, Xu Y. Integration of sequence-similarity and functional association information can overcome intrinsic problems in orthology mapping across bacterial genomes. Nucleic Acids Res 2011; 39:e150. [PMID: 21965536 PMCID: PMC3239196 DOI: 10.1093/nar/gkr766] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Existing methods for orthologous gene mapping suffer from two general problems: (i) they are computationally too slow and their results are difficult to interpret for automated large-scale applications when based on phylogenetic analyses; or (ii) they are too prone to making mistakes in dealing with complex situations involving horizontal gene transfers and gene fusion due to the lack of a sound basis when based on sequence similarity information. We present a novel algorithm, Global Optimization Strategy (GOST), for orthologous gene mapping through combining sequence similarity and contextual (working partners) information, using a combinatorial optimization framework. Genome-scale applications of GOST show substantial improvements over the predictions by three popular sequence similarity-based orthology mapping programs. Our analysis indicates that our algorithm overcomes the intrinsic issues faced by sequence similarity-based methods, when orthology mapping involves gene fusions and horizontal gene transfers. Our program runs as efficiently as the most efficient sequence similarity-based algorithm in the public domain. GOST is freely downloadable at http://csbl.bmb.uga.edu/~maqin/GOST.
Collapse
Affiliation(s)
- Guojun Li
- Department of Biochemistry and Molecular Biology and Institute of Bioinformatics, Computational Systems Biology Laboratory, University of Georgia, Athens, GA 30602, USA
| | | | | | | | | | | |
Collapse
|
10
|
Mao X, Zhang Y, Xu Y. SEAS: a system for SEED-based pathway enrichment analysis. PLoS One 2011; 6:e22556. [PMID: 21799897 PMCID: PMC3142180 DOI: 10.1371/journal.pone.0022556] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2011] [Accepted: 06/24/2011] [Indexed: 11/18/2022] Open
Abstract
Pathway enrichment analysis represents a key technique for analyzing high-throughput omic data, and it can help to link individual genes or proteins found to be differentially expressed under specific conditions to well-understood biological pathways. We present here a computational tool, SEAS, for pathway enrichment analysis over a given set of genes in a specified organism against the pathways (or subsystems) in the SEED database, a popular pathway database for bacteria. SEAS maps a given set of genes of a bacterium to pathway genes covered by SEED through gene ID and/or orthology mapping, and then calculates the statistical significance of the enrichment of each relevant SEED pathway by the mapped genes. Our evaluation of SEAS indicates that the program provides highly reliable pathway mapping results and identifies more organism-specific pathways than similar existing programs. SEAS is publicly released under the GPL license agreement and freely available at http://csbl.bmb.uga.edu/~xizeng/research/seas/.
Collapse
Affiliation(s)
- Xizeng Mao
- Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, Georgia, United States of America
| | - Yu Zhang
- College of Computer Science and Technology, Jilin University, Changchun, China
- Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, Jilin University, Changchun, China
| | - Ying Xu
- Computational Systems Biology Lab, Department of Biochemistry and Molecular Biology, Institute of Bioinformatics, University of Georgia, Athens, Georgia, United States of America
- BioEnergy Science Center BESC, University of Georgia, Athens, Georgia, United States of America
- College of Computer Science and Technology, Jilin University, Changchun, China
- * E-mail:
| |
Collapse
|
11
|
Mao X, Olman V, Stuart R, Paulsen IT, Palenik B, Xu Y. Computational prediction of the osmoregulation network in Synechococcus sp. WH8102. BMC Genomics 2010; 11:291. [PMID: 20459751 PMCID: PMC2874817 DOI: 10.1186/1471-2164-11-291] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2009] [Accepted: 05/10/2010] [Indexed: 11/16/2022] Open
Abstract
Background Osmotic stress is caused by sudden changes in the impermeable solute concentration around a cell, which induces instantaneous water flow in or out of the cell to balance the concentration. Very little is known about the detailed response mechanism to osmotic stress in marine Synechococcus, one of the major oxygenic phototrophic cyanobacterial genera that contribute greatly to the global CO2 fixation. Results We present here a computational study of the osmoregulation network in response to hyperosmotic stress of Synechococcus sp strain WH8102 using comparative genome analyses and computational prediction. In this study, we identified the key transporters, synthetases, signal sensor proteins and transcriptional regulator proteins, and found experimentally that of these proteins, 15 genes showed significantly changed expression levels under a mild hyperosmotic stress. Conclusions From the predicted network model, we have made a number of interesting observations about WH8102. Specifically, we found that (i) the organism likely uses glycine betaine as the major osmolyte, and others such as glucosylglycerol, glucosylglycerate, trehalose, sucrose and arginine as the minor osmolytes, making it efficient and adaptable to its changing environment; and (ii) σ38, one of the seven types of σ factors, probably serves as a global regulator coordinating the osmoregulation network and the other relevant networks.
Collapse
Affiliation(s)
- Xizeng Mao
- Department of Biochemistry and Molecular Biology and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
| | | | | | | | | | | |
Collapse
|
12
|
Shi G, Zhang L, Jiang T. MSOAR 2.0: Incorporating tandem duplications into ortholog assignment based on genome rearrangement. BMC Bioinformatics 2010; 11:10. [PMID: 20053291 PMCID: PMC2821317 DOI: 10.1186/1471-2105-11-10] [Citation(s) in RCA: 44] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/02/2009] [Accepted: 01/06/2010] [Indexed: 11/28/2022] Open
Abstract
BACKGROUND Ortholog assignment is a critical and fundamental problem in comparative genomics, since orthologs are considered to be functional counterparts in different species and can be used to infer molecular functions of one species from those of other species. MSOAR is a recently developed high-throughput system for assigning one-to-one orthologs between closely related species on a genome scale. It attempts to reconstruct the evolutionary history of input genomes in terms of genome rearrangement and gene duplication events. It assumes that a gene duplication event inserts a duplicated gene into the genome of interest at a random location (i.e., the random duplication model). However, in practice, biologists believe that genes are often duplicated by tandem duplications, where a duplicated gene is located next to the original copy (i.e., the tandem duplication model). RESULTS In this paper, we develop MSOAR 2.0, an improved system for one-to-one ortholog assignment. For a pair of input genomes, the system first focuses on the tandemly duplicated genes of each genome and tries to identify among them those that were duplicated after the speciation (i.e., the so-called inparalogs), using a simple phylogenetic tree reconciliation method. For each such set of tandemly duplicated inparalogs, all but one gene will be deleted from the concerned genome (because they cannot possibly appear in any one-to-one ortholog pairs), and MSOAR is invoked. Using both simulated and real data experiments, we show that MSOAR 2.0 is able to achieve a better sensitivity and specificity than MSOAR. In comparison with the well-known genome-scale ortholog assignment tool InParanoid, Ensembl ortholog database, and the orthology information extracted from the well-known whole-genome multiple alignment program MultiZ, MSOAR 2.0 shows the highest sensitivity. Although the specificity of MSOAR 2.0 is slightly worse than that of InParanoid in the real data experiments, it is actually better than that of InParanoid in the simulation tests. CONCLUSIONS Our preliminary experimental results demonstrate that MSOAR 2.0 is a highly accurate tool for one-to-one ortholog assignment between closely related genomes. The software is available to the public for free and included as online supplementary material.
Collapse
Affiliation(s)
- Guanqun Shi
- Department of Computer Science, University of California, Riverside, CA 92521, USA
| | - Liqing Zhang
- Department of Computer Science, Virginia Tech, Blacksburg, VA 24060, USA
| | - Tao Jiang
- Department of Computer Science, University of California, Riverside, CA 92521, USA
| |
Collapse
|
13
|
New proteins orthologous to cerato-platanin in various Ceratocystis species and the purification and characterization of cerato-populin from Ceratocystis populicola. Appl Microbiol Biotechnol 2009; 84:309-22. [DOI: 10.1007/s00253-009-1998-4] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/03/2008] [Revised: 03/31/2009] [Accepted: 03/31/2009] [Indexed: 10/20/2022]
|
14
|
Lima T, Auchincloss AH, Coudert E, Keller G, Michoud K, Rivoire C, Bulliard V, de Castro E, Lachaize C, Baratin D, Phan I, Bougueleret L, Bairoch A. HAMAP: a database of completely sequenced microbial proteome sets and manually curated microbial protein families in UniProtKB/Swiss-Prot. Nucleic Acids Res 2008; 37:D471-8. [PMID: 18849571 PMCID: PMC2686602 DOI: 10.1093/nar/gkn661] [Citation(s) in RCA: 116] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The growth in the number of completely sequenced microbial genomes (bacterial and archaeal) has generated a need for a procedure that provides UniProtKB/Swiss-Prot-quality annotation to as many protein sequences as possible. We have devised a semi-automated system, HAMAP (High-quality Automated and Manual Annotation of microbial Proteomes), that uses manually built annotation templates for protein families to propagate annotation to all members of manually defined protein families, using very strict criteria. The HAMAP system is composed of two databases, the proteome database and the family database, and of an automatic annotation pipeline. The proteome database comprises biological and sequence information for each completely sequenced microbial proteome, and it offers several tools for CDS searches, BLAST options and retrieval of specific sets of proteins. The family database currently comprises more than 1500 manually curated protein families and their annotation templates that are used to annotate proteins that belong to one of the HAMAP families. On the HAMAP website, individual sequences as well as whole genomes can be scanned against all HAMAP families. The system provides warnings for the absence of conserved amino acid residues, unusual sequence length, etc. Thanks to the implementation of HAMAP, more than 200,000 microbial proteins have been fully annotated in UniProtKB/Swiss-Prot (HAMAP website: http://www.expasy.org/sprot/hamap).
Collapse
Affiliation(s)
- Tania Lima
- Swiss-Prot Group, Swiss Institute of Bioinformatics, 1 rue Michel-Servet, 1211 Geneva 4, Switzerland.
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
15
|
The quest for orthologs: finding the corresponding gene across genomes. Trends Genet 2008; 24:539-51. [PMID: 18819722 DOI: 10.1016/j.tig.2008.08.009] [Citation(s) in RCA: 238] [Impact Index Per Article: 14.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/12/2007] [Revised: 08/20/2008] [Accepted: 08/21/2008] [Indexed: 11/23/2022]
Abstract
Orthology is a key evolutionary concept in many areas of genomic research. It provides a framework for subjects as diverse as the evolution of genomes, gene functions, cellular networks and functional genome annotation. Although orthologous proteins usually perform equivalent functions in different species, establishing true orthologous relationships requires a phylogenetic approach, which combines both trees and graphs (networks) using reliable species phylogeny and available genomic data from more than two species, and an insight into the processes of molecular evolution. Here, we evaluate the available bioinformatics tools and provide a set of guidelines to aid researchers in choosing the most appropriate tool for any situation.
Collapse
|
16
|
Li J, Wu XD, Hao ST, Wang XJ, Ling HQ. Proteomic response to iron deficiency in tomato root. Proteomics 2008; 8:2299-311. [DOI: 10.1002/pmic.200700942] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
|
17
|
Lintner RE, Mishra PK, Srivastava P, Martinez-Vaz BM, Khodursky AB, Blumenthal RM. Limited functional conservation of a global regulator among related bacterial genera: Lrp in Escherichia, Proteus and Vibrio. BMC Microbiol 2008; 8:60. [PMID: 18405378 PMCID: PMC2374795 DOI: 10.1186/1471-2180-8-60] [Citation(s) in RCA: 28] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2007] [Accepted: 04/11/2008] [Indexed: 02/03/2023] Open
Abstract
Background Bacterial genome sequences are being determined rapidly, but few species are physiologically well characterized. Predicting regulation from genome sequences usually involves extrapolation from better-studied bacteria, using the hypothesis that a conserved regulator, conserved target gene, and predicted regulator-binding site in the target promoter imply conserved regulation between the two species. However many compared organisms are ecologically and physiologically diverse, and the limits of extrapolation have not been well tested. In E. coli K-12 the leucine-responsive regulatory protein (Lrp) affects expression of ~400 genes. Proteus mirabilis and Vibrio cholerae have highly-conserved lrp orthologs (98% and 92% identity to E. coli lrp). The functional equivalence of Lrp from these related species was assessed. Results Heterologous Lrp regulated gltB, livK and lrp transcriptional fusions in an E. coli background in the same general way as the native Lrp, though with significant differences in extent. Microarray analysis of these strains revealed that the heterologous Lrp proteins significantly influence only about half of the genes affected by native Lrp. In P. mirabilis, heterologous Lrp restored swarming, though with some pattern differences. P. mirabilis produced substantially more Lrp than E. coli or V. cholerae under some conditions. Lrp regulation of target gene orthologs differed among the three native hosts. Strikingly, while Lrp negatively regulates its own gene in E. coli, and was shown to do so even more strongly in P. mirabilis, Lrp appears to activate its own gene in V. cholerae. Conclusion The overall similarity of regulatory effects of the Lrp orthologs supports the use of extrapolation between related strains for general purposes. However this study also revealed intrinsic differences even between orthologous regulators sharing >90% overall identity, and 100% identity for the DNA-binding helix-turn-helix motif, as well as differences in the amounts of those regulators. These results suggest that predicting regulation of specific target genes based on genome sequence comparisons alone should be done on a conservative basis.
Collapse
Affiliation(s)
- Robert E Lintner
- Department of Medical Microbiology and Immunology, University of Toledo Health Sciences Center, Toledo, OH 43614-2598, USA.
| | | | | | | | | | | |
Collapse
|
18
|
Wu H, Mao F, Olman V, Xu Y. On application of directons to functional classification of genes in prokaryotes. Comput Biol Chem 2008; 32:176-84. [PMID: 18440870 DOI: 10.1016/j.compbiolchem.2008.02.007] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2007] [Accepted: 02/15/2008] [Indexed: 11/30/2022]
Abstract
Functional classification of genes represents one of the most basic problems in genome analysis and annotation. Our analysis of some of the popular methods for functional classification of genes shows that these methods are not always consistent with each other and may not be specific enough for high-resolution gene functional annotations. We have developed a method to integrate genomic neighborhood information of genes with their sequence similarity information for the functional classification of prokaryotic genes. The application of our method to 93 proteobacterial genomes has shown that (i) the genomic neighborhoods are much more conserved across prokaryotic genomes than expected by chance, and such conservation can be utilized to improve functional classification of genes; (ii) while our method is consistent with the existing popular schemes as much as they are among themselves, it does provide functional classification at higher resolution and hence allows functional assignments of (new) genes at a more specific level; and (iii) our method is fairly stable when being applied to different genomes.
Collapse
Affiliation(s)
- Hongwei Wu
- School of Electrical and Computer Engineering, Georgia Institute of Technology, Savannah, GA 31407, USA
| | | | | | | |
Collapse
|
19
|
Lemoine F, Lespinet O, Labedan B. Assessing the evolutionary rate of positional orthologous genes in prokaryotes using synteny data. BMC Evol Biol 2007; 7:237. [PMID: 18047665 PMCID: PMC2238764 DOI: 10.1186/1471-2148-7-237] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2007] [Accepted: 11/29/2007] [Indexed: 11/15/2022] Open
Abstract
Background Comparison of completely sequenced microbial genomes has revealed how fluid these genomes are. Detecting synteny blocks requires reliable methods to determining the orthologs among the whole set of homologs detected by exhaustive comparisons between each pair of completely sequenced genomes. This is a complex and difficult problem in the field of comparative genomics but will help to better understand the way prokaryotic genomes are evolving. Results We have developed a suite of programs that automate three essential steps to study conservation of gene order, and validated them with a set of 107 bacteria and archaea that cover the majority of the prokaryotic taxonomic space. We identified the whole set of shared homologs between two or more species and computed the evolutionary distance separating each pair of homologs. We applied two strategies to extract from the set of homologs a collection of valid orthologs shared by at least two genomes. The first computes the Reciprocal Smallest Distance (RSD) using the PAM distances separating pairs of homologs. The second method groups homologs in families and reconstructs each family's evolutionary tree, distinguishing bona fide orthologs as well as paralogs created after the last speciation event. Although the phylogenetic tree method often succeeds where RSD fails, the reverse could occasionally be true. Accordingly, we used the data obtained with either methods or their intersection to number the orthologs that are adjacent in for each pair of genomes, the Positional Orthologous Genes (POGs), and to further study their properties. Once all these synteny blocks have been detected, we showed that POGs are subject to more evolutionary constraints than orthologs outside synteny groups, whichever the taxonomic distance separating the compared organisms. Conclusion The suite of programs described in this paper allows a reliable detection of orthologs and is useful for evaluating gene order conservation in prokaryotes whichever their taxonomic distance. Thus, our approach will make easy the rapid identification of POGS in the next few years as we are expecting to be inundated with thousands of completely sequenced microbial genomes.
Collapse
Affiliation(s)
- Frédéric Lemoine
- Institut de Génétique et Microbiologie, CNRS UMR 8621, Bâtiment 400, Université Paris Sud XI, 91405 Orsay Cedex, France.
| | | | | |
Collapse
|
20
|
The multiple facets of homology and their use in comparative genomics to study the evolution of genes, genomes, and species. Biochimie 2007; 90:595-608. [PMID: 17961904 DOI: 10.1016/j.biochi.2007.09.010] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2007] [Accepted: 09/14/2007] [Indexed: 11/23/2022]
Abstract
The incredible development of comparative genomics during the last decade has required a correct use of the concept of homology that was previously utilized only by evolutionary biologists. Unhappily, this concept has been often misunderstood and thus misused when exploited outside its evolutionary context. This review brings back to the correct definition of homology and explains how this definition has been progressively refined in order to adapt it to the various new kinds of analysis of gene properties and of their products that appear with the progress of comparative genomics. Then, we illustrate the power and the proficiency of such a concept when using the available genomics data in order to study the evolution of individual genes, of entire genomes and of species, respectively. After explaining how we detect homologues by an exhaustive comparison of a hundred of complete proteomes, we describe three main lines of research we have developed in the recent years. The first one exploits synteny and gene context data to better understand the mechanisms of genome evolution in prokaryotes. The second one is based on phylogenomics approaches to reconstruct the tree of life. The last one is devoted to reminding that protein homology is often limited to structural segments (SOH=segment of homology or module). Detecting and numbering modules allows tracing back protein history by identifying the events of gene duplication and gene fusion. We insist that one of the main present difficulties in such studies is a lack of a reliable method to identify genuine orthologues. Finally, we show how these homology studies are helpful to annotate genes and genomes and to study the complexity of the relationships between sequence and function of a gene.
Collapse
|
21
|
GenMAPP 2: new features and resources for pathway analysis. BMC Bioinformatics 2007; 8:217. [PMID: 17588266 PMCID: PMC1924866 DOI: 10.1186/1471-2105-8-217] [Citation(s) in RCA: 205] [Impact Index Per Article: 11.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2006] [Accepted: 06/24/2007] [Indexed: 12/03/2022] Open
Abstract
Background Microarray technologies have evolved rapidly, enabling biologists to quantify genome-wide levels of gene expression, alternative splicing, and sequence variations for a variety of species. Analyzing and displaying these data present a significant challenge. Pathway-based approaches for analyzing microarray data have proven useful for presenting data and for generating testable hypotheses. Results To address the growing needs of the microarray community we have released version 2 of Gene Map Annotator and Pathway Profiler (GenMAPP), a new GenMAPP database schema, and integrated resources for pathway analysis. We have redesigned the GenMAPP database to support multiple gene annotations and species as well as custom species database creation for a potentially unlimited number of species. We have expanded our pathway resources by utilizing homology information to translate pathway content between species and extending existing pathways with data derived from conserved protein interactions and coexpression. We have implemented a new mode of data visualization to support analysis of complex data, including time-course, single nucleotide polymorphism (SNP), and splicing. GenMAPP version 2 also offers innovative ways to display and share data by incorporating HTML export of analyses for entire sets of pathways as organized web pages. Conclusion GenMAPP version 2 provides a means to rapidly interrogate complex experimental data for pathway-level changes in a diverse range of organisms.
Collapse
|
22
|
Wu H, Mao F, Olman V, Xu Y. Hierarchical classification of functionally equivalent genes in prokaryotes. Nucleic Acids Res 2007; 35:2125-40. [PMID: 17353185 PMCID: PMC1874638 DOI: 10.1093/nar/gkl1114] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2006] [Revised: 11/15/2006] [Accepted: 12/06/2006] [Indexed: 11/20/2022] Open
Abstract
Functional classification of genes represents a fundamental problem to many biological studies. Most of the existing classification schemes are based on the concepts of homology and orthology, which were originally introduced to study gene evolution but might not be the most appropriate for gene function prediction, particularly at high resolution level. We have recently developed a scheme for hierarchical classification of genes (HCGs) in prokaryotes. In the HCG scheme, the functional equivalence relationships among genes are first assessed through a careful application of both sequence similarity and genomic neighborhood information; and genes are then classified into a hierarchical structure of clusters, where genes in each cluster are functionally equivalent at some resolution level, and the level of resolution goes higher as the clusters become increasingly smaller traveling down the hierarchy. The HCG scheme is validated through comparisons with the taxonomy of the prokaryotic genomes, Clusters of Orthologous Groups (COGs) of genes and the Pfam system. We have applied the HCG scheme to 224 complete prokaryotic genomes, and constructed a HCG database consisting of a forest of 5339 multi-level and 15 770 single-level trees of gene clusters covering approximately 93% of the genes of these 224 genomes. The validation results indicate that the HCG scheme not only captures the key features of the existing classification schemes but also provides a much richer organization of genes which can be used for functional prediction of genes at higher resolution and to help reveal evolutionary trace of the genes.
Collapse
Affiliation(s)
| | | | | | - Ying Xu
- Computational Systems Biology Laboratory, Department of Biochemistry and Molecular Biology, and Institute of Bioinformatics, University of Georgia, Athens, GA 30602, USA
| |
Collapse
|
23
|
Abstract
We present a study on computational identification of uber-operons in a prokaryotic genome, each of which represents a group of operons that are evolutionarily or functionally associated through operons in other (reference) genomes. Uber-operons represent a rich set of footprints of operon evolution, whose full utilization could lead to new and more powerful tools for elucidation of biological pathways and networks than what operons have provided, and a better understanding of prokaryotic genome structures and evolution. Our prediction algorithm predicts uber-operons through identifying groups of functionally or transcriptionally related operons, whose gene sets are conserved across the target and multiple reference genomes. Using this algorithm, we have predicted uber-operons for each of a group of 91 genomes, using the other 90 genomes as references. In particular, we predicted 158 uber-operons in Escherichia coli K12 covering 1830 genes, and found that many of the uber-operons correspond to parts of known regulons or biological pathways or are involved in highly related biological processes based on their Gene Ontology (GO) assignments. For some of the predicted uber-operons that are not parts of known regulons or pathways, our analyses indicate that their genes are highly likely to work together in the same biological processes, suggesting the possibility of new regulons and pathways. We believe that our uber-operon prediction provides a highly useful capability and a rich information source for elucidation of complex biological processes, such as pathways in microbes. All the prediction results are available at our Uber-Operon Database: , the first of its kind.
Collapse
Affiliation(s)
- Dongsheng Che
- Department of Computer Science, University of GeorgiaUSA
| | - Guojun Li
- Department of Biochemistry and Molecular Biology, University of GeorgiaUSA
- School of Mathematics and System Sciences, Shandong UniversityChina
| | - Fenglou Mao
- Department of Biochemistry and Molecular Biology, University of GeorgiaUSA
| | - Hongwei Wu
- Department of Biochemistry and Molecular Biology, University of GeorgiaUSA
| | - Ying Xu
- Department of Biochemistry and Molecular Biology, University of GeorgiaUSA
- Department of Computer Science, University of GeorgiaUSA
- To whom correspondence should be addressed. Tel: 1 706 542 9779; Fax: 1 706 542 9751; Ying Xu
| |
Collapse
|
24
|
Su Z, Mao F, Dam P, Wu H, Olman V, Paulsen IT, Palenik B, Xu Y. Computational inference and experimental validation of the nitrogen assimilation regulatory network in cyanobacterium Synechococcus sp. WH 8102. Nucleic Acids Res 2006; 34:1050-65. [PMID: 16473855 PMCID: PMC1363776 DOI: 10.1093/nar/gkj496] [Citation(s) in RCA: 45] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Deciphering the regulatory networks encoded in the genome of an organism represents one of the most interesting and challenging tasks in the post-genome sequencing era. As an example of this problem, we have predicted a detailed model for the nitrogen assimilation network in cyanobacterium Synechococcus sp. WH 8102 (WH8102) using a computational protocol based on comparative genomics analysis and mining experimental data from related organisms that are relatively well studied. This computational model is in excellent agreement with the microarray gene expression data collected under ammonium-rich versus nitrate-rich growth conditions, suggesting that our computational protocol is capable of predicting biological pathways/networks with high accuracy. We then refined the computational model using the microarray data, and proposed a new model for the nitrogen assimilation network in WH8102. An intriguing discovery from this study is that nitrogen assimilation affects the expression of many genes involved in photosynthesis, suggesting a tight coordination between nitrogen assimilation and photosynthesis processes. Moreover, for some of these genes, this coordination is probably mediated by NtcA through the canonical NtcA promoters in their regulatory regions.
Collapse
Affiliation(s)
- Zhengchang Su
- Department of Biochemistry and Molecular Biology, University of GeorgiaAthens, GA 30602, USA
- Computational Biology Institute, Oak Ridge National LaboratoryOak Ridge, TN 37831, USA
| | - Fenglou Mao
- Department of Biochemistry and Molecular Biology, University of GeorgiaAthens, GA 30602, USA
| | - Phuongan Dam
- Department of Biochemistry and Molecular Biology, University of GeorgiaAthens, GA 30602, USA
- Computational Biology Institute, Oak Ridge National LaboratoryOak Ridge, TN 37831, USA
| | - Hongwei Wu
- Department of Biochemistry and Molecular Biology, University of GeorgiaAthens, GA 30602, USA
- Computational Biology Institute, Oak Ridge National LaboratoryOak Ridge, TN 37831, USA
| | - Victor Olman
- Department of Biochemistry and Molecular Biology, University of GeorgiaAthens, GA 30602, USA
| | - Ian T. Paulsen
- The Institute of Genome ResearchRockville, MD 20850, USA
| | - Brian Palenik
- Scripps Institution of Oceanography, University of California at San DiegoSan Diego, CA 92093, USA
| | - Ying Xu
- Department of Biochemistry and Molecular Biology, University of GeorgiaAthens, GA 30602, USA
- Computational Biology Institute, Oak Ridge National LaboratoryOak Ridge, TN 37831, USA
- To whom correspondence should be addressed at Department of Biochemistry and Molecular Biology, A110 Life Sciences Building, 120 Green Street, University of Georgia, Athens, GA, 30602. Tel: +1 706 542 9779; Fax: +1 706 542 9751;
| |
Collapse
|