1
|
Structure-Aware Mycobacterium tuberculosis Functional Annotation Uncloaks Resistance, Metabolic, and Virulence Genes. mSystems 2021; 6:e0067321. [PMID: 34726489 PMCID: PMC8562490 DOI: 10.1128/msystems.00673-21] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022] Open
Abstract
Accurate and timely functional genome annotation is essential for translating basic pathogen research into clinically impactful advances. Here, through literature curation and structure-function inference, we systematically update the functional genome annotation of Mycobacterium tuberculosis virulent type strain H37Rv. First, we systematically curated annotations for 589 genes from 662 publications, including 282 gene products absent from leading databases. Second, we modeled 1,711 underannotated proteins and developed a semiautomated pipeline that captured shared function between 400 protein models and structural matches of known function on Protein Data Bank, including drug efflux proteins, metabolic enzymes, and virulence factors. In aggregate, these structure- and literature-derived annotations update 940/1,725 underannotated H37Rv genes and generate hundreds of functional hypotheses. Retrospectively applying the annotation to a recent whole-genome transposon mutant screen provided missing function for 48% (13/27) of underannotated genes altering antibiotic efficacy and 33% (23/69) required for persistence during mouse tuberculosis (TB) infection. Prospective application of the protein models enabled us to functionally interpret novel laboratory generated pyrazinamide (PZA)-resistant mutants of unknown function, which implicated the emerging coenzyme A depletion model of PZA action in the mutants’ PZA resistance. Our findings demonstrate the functional insight gained by integrating structural modeling and systematic literature curation, even for widely studied microorganisms. Functional annotations and protein structure models are available at https://tuberculosis.sdsu.edu/H37Rv in human- and machine-readable formats. IMPORTANCEMycobacterium tuberculosis, the primary causative agent of tuberculosis, kills more humans than any other infectious bacterium. Yet 40% of its genome is functionally uncharacterized, leaving much about the genetic basis of its resistance to antibiotics, capacity to withstand host immunity, and basic metabolism yet undiscovered. Irregular literature curation for functional annotation contributes to this gap. We systematically curated functions from literature and structural similarity for over half of poorly characterized genes, expanding the functionally annotated Mycobacterium tuberculosis proteome. Applying this updated annotation to recent in vivo functional screens added functional information to dozens of clinically pertinent proteins described as having unknown function. Integrating the annotations with a prospective functional screen identified new mutants resistant to a first-line TB drug, supporting an emerging hypothesis for its mode of action. These improvements in functional interpretation of clinically informative studies underscore the translational value of this functional knowledge. Structure-derived annotations identify hundreds of high-confidence candidates for mechanisms of antibiotic resistance, virulence factors, and basic metabolism and other functions key in clinical and basic tuberculosis research. More broadly, they provide a systematic framework for improving prokaryotic reference annotations.
Collapse
|
2
|
Zhang C, Zheng W, Cheng M, Omenn GS, Freddolino PL, Zhang Y. Functions of Essential Genes and a Scale-Free Protein Interaction Network Revealed by Structure-Based Function and Interaction Prediction for a Minimal Genome. J Proteome Res 2021; 20:1178-1189. [PMID: 33393786 PMCID: PMC7867644 DOI: 10.1021/acs.jproteome.0c00359] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
When the JCVI-syn3.0 genome was designed and implemented in 2016 as the minimal genome of a free-living organism, approximately one-third of the 438 protein-coding genes had no known function. Subsequent refinement into JCVI-syn3A led to inclusion of 16 additional protein-coding genes, including several unknown functions, resulting in an improved growth phenotype. Here, we seek to unveil the biological roles and protein-protein interaction (PPI) networks for these poorly characterized proteins using state-of-the-art deep learning contact-assisted structure prediction, followed by structure-based annotation of functions and PPI predictions. Our pipeline is able to confidently assign functions for many previously unannotated proteins such as putative vitamin transporters, which suggest the importance of nutrient uptake even in a minimized genome. Remarkably, despite the artificial selection of genes in the minimal syn3 genome, our reconstructed PPI network still shows a power law distribution of node degrees typical of naturally evolved bacterial PPI networks. Making use of our framework for combined structure/function/interaction modeling, we are able to identify both fundamental aspects of network biology that are retained in a minimal proteome and additional essential functions not yet recognized among the poorly annotated components of the syn3.0 and syn3A proteomes.
Collapse
Affiliation(s)
- Chengxin Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Wei Zheng
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Micah Cheng
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Gilbert S Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Departments of Internal Medicine and Human Genetics and School of Public Health, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Peter L Freddolino
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| | - Yang Zhang
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan 48109, United States
- Department of Biological Chemistry, University of Michigan, Ann Arbor, Michigan 48109, United States
| |
Collapse
|
3
|
Wu PIF, Ross C, Siegele DA, Hu JC. Insights from the reanalysis of high-throughput chemical genomics data for Escherichia coli K-12. G3-GENES GENOMES GENETICS 2021; 11:6044125. [PMID: 33561236 PMCID: PMC8022724 DOI: 10.1093/g3journal/jkaa035] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/14/2020] [Accepted: 11/11/2020] [Indexed: 11/14/2022]
Abstract
Despite the demonstrated success of genome-wide genetic screens and chemical genomics studies at predicting functions for genes of unknown function or predicting new functions for well-characterized genes, their potential to provide insights into gene function has not been fully explored. We systematically reanalyzed a published high-throughput phenotypic dataset for the model Gram-negative bacterium Escherichia coli K-12. The availability of high-quality annotation sets allowed us to compare the power of different metrics for measuring phenotypic profile similarity to correctly infer gene function. We conclude that there is no single best method; the three metrics tested gave comparable results for most gene pairs. We also assessed how converting quantitative phenotypes to discrete, qualitative phenotypes affected the association between phenotype and function. Our results indicate that this approach may allow phenotypic data from different studies to be combined to produce a larger dataset that may reveal functional connections between genes not detected in individual studies.
Collapse
Affiliation(s)
- Peter I-Fan Wu
- Department of Biochemistry and Biophysics, Texas A&M University and Texas Agrilife Research, College Station, TX 77843-2128, USA
| | - Curtis Ross
- Department of Biochemistry and Biophysics, Texas A&M University and Texas Agrilife Research, College Station, TX 77843-2128, USA
| | - Deborah A Siegele
- Department of Biology, Texas A&M University, College Station, TX 77843-3258, USA
| | - James C Hu
- Department of Biochemistry and Biophysics, Texas A&M University and Texas Agrilife Research, College Station, TX 77843-2128, USA
| |
Collapse
|
4
|
R L Morlighem JÉ, Huang C, Liao Q, Braga Gomes P, Daniel Pérez C, de Brandão Prieto-da-Silva ÁR, Ming-Yuen Lee S, Rádis-Baptista G. The Holo-Transcriptome of the Zoantharian Protopalythoa variabilis (Cnidaria: Anthozoa): A Plentiful Source of Enzymes for Potential Application in Green Chemistry, Industrial and Pharmaceutical Biotechnology. Mar Drugs 2018; 16:E207. [PMID: 29899267 PMCID: PMC6025448 DOI: 10.3390/md16060207] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2018] [Revised: 06/05/2018] [Accepted: 06/08/2018] [Indexed: 02/08/2023] Open
Abstract
Marine invertebrates, such as sponges, tunicates and cnidarians (zoantharians and scleractinian corals), form functional assemblages, known as holobionts, with numerous microbes. This type of species-specific symbiotic association can be a repository of myriad valuable low molecular weight organic compounds, bioactive peptides and enzymes. The zoantharian Protopalythoa variabilis (Cnidaria: Anthozoa) is one such example of a marine holobiont that inhabits the coastal reefs of the tropical Atlantic coast and is an interesting source of secondary metabolites and biologically active polypeptides. In the present study, we analyzed the entire holo-transcriptome of P. variabilis, looking for enzyme precursors expressed in the zoantharian-microbiota assemblage that are potentially useful as industrial biocatalysts and biopharmaceuticals. In addition to hundreds of predicted enzymes that fit into the classes of hydrolases, oxidoreductases and transferases that were found, novel enzyme precursors with multiple activities in single structures and enzymes with incomplete Enzyme Commission numbers were revealed. Our results indicated the predictive expression of thirteen multifunctional enzymes and 694 enzyme sequences with partially characterized activities, distributed in 23 sub-subclasses. These predicted enzyme structures and activities can prospectively be harnessed for applications in diverse areas of industrial and pharmaceutical biotechnology.
Collapse
Affiliation(s)
- Jean-Étienne R L Morlighem
- Northeast Biotechnology Network (RENORBIO), Post-Graduation Program in Biotechnology, Federal University of Ceará, Fortaleza 60440-900, Brazil.
- Laboratory of Biochemistry and Biotechnology, Institute for Marine Sciences, Federal University of Ceará, Fortaleza 60165-081, Brazil.
| | - Chen Huang
- State Key Laboratory of Quality Research in Chinese Medicine and Institute of Chinese Medical Sciences, University of Macau, Macau 519020, China.
| | - Qiwen Liao
- State Key Laboratory of Quality Research in Chinese Medicine and Institute of Chinese Medical Sciences, University of Macau, Macau 519020, China.
| | - Paula Braga Gomes
- Department of Biology, Federal Rural University of Pernambuco, Recife 52171-900, Brazil.
| | - Carlos Daniel Pérez
- Academic Center in Vitória, Federal University of Pernambuco, Vitória de Santo Antão 50670-901, Pernambuco, Brazil.
| | | | - Simon Ming-Yuen Lee
- State Key Laboratory of Quality Research in Chinese Medicine and Institute of Chinese Medical Sciences, University of Macau, Macau 519020, China.
| | - Gandhi Rádis-Baptista
- Northeast Biotechnology Network (RENORBIO), Post-Graduation Program in Biotechnology, Federal University of Ceará, Fortaleza 60440-900, Brazil.
- Laboratory of Biochemistry and Biotechnology, Institute for Marine Sciences, Federal University of Ceará, Fortaleza 60165-081, Brazil.
| |
Collapse
|
5
|
Cheng L, Jiang Y, Ju H, Sun J, Peng J, Zhou M, Hu Y. InfAcrOnt: calculating cross-ontology term similarities using information flow by a random walk. BMC Genomics 2018; 19:919. [PMID: 29363423 PMCID: PMC5780854 DOI: 10.1186/s12864-017-4338-6] [Citation(s) in RCA: 72] [Impact Index Per Article: 12.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/08/2023] Open
Abstract
Background Since the establishment of the first biomedical ontology Gene Ontology (GO), the number of biomedical ontology has increased dramatically. Nowadays over 300 ontologies have been built including extensively used Disease Ontology (DO) and Human Phenotype Ontology (HPO). Because of the advantage of identifying novel relationships between terms, calculating similarity between ontology terms is one of the major tasks in this research area. Though similarities between terms within each ontology have been studied with in silico methods, term similarities across different ontologies were not investigated as deeply. The latest method took advantage of gene functional interaction network (GFIN) to explore such inter-ontology similarities of terms. However, it only used gene interactions and failed to make full use of the connectivity among gene nodes of the network. In addition, all existent methods are particularly designed for GO and their performances on the extended ontology community remain unknown. Results We proposed a method InfAcrOnt to infer similarities between terms across ontologies utilizing the entire GFIN. InfAcrOnt builds a term-gene-gene network which comprised ontology annotations and GFIN, and acquires similarities between terms across ontologies through modeling the information flow within the network by random walk. In our benchmark experiments on sub-ontologies of GO, InfAcrOnt achieves a high average area under the receiver operating characteristic curve (AUC) (0.9322 and 0.9309) and low standard deviations (1.8746e-6 and 3.0977e-6) in both human and yeast benchmark datasets exhibiting superior performance. Meanwhile, comparisons of InfAcrOnt results and prior knowledge on pair-wise DO-HPO terms and pair-wise DO-GO terms show high correlations. Conclusions The experiment results show that InfAcrOnt significantly improves the performance of inferring similarities between terms across ontologies in benchmark set. Electronic supplementary material The online version of this article (10.1186/s12864-017-4338-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Liang Cheng
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, People's Republic of China
| | - Yue Jiang
- Hospital for Sick Children, Toronto, M5G 1X8, Canada
| | - Hong Ju
- Department of Information Engineering, Heilongjiang Biological Science and Technology Career Academy, Harbin, 150081, People's Republic of China
| | - Jie Sun
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, People's Republic of China
| | - Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xian, 710072, People's Republic of China
| | - Meng Zhou
- College of Bioinformatics Science and Technology, Harbin Medical University, Harbin, 150081, People's Republic of China.
| | - Yang Hu
- School of Life Science and Technology, Harbin Institute of Technology, Harbin, 150088, People's Republic of China.
| |
Collapse
|
6
|
Peng J, Wang H, Lu J, Hui W, Wang Y, Shang X. Identifying term relations cross different gene ontology categories. BMC Bioinformatics 2017; 18:573. [PMID: 29297309 PMCID: PMC5751813 DOI: 10.1186/s12859-017-1959-3] [Citation(s) in RCA: 39] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
Background The Gene Ontology (GO) is a community-based bioinformatics resource that employs ontologies to represent biological knowledge and describes information about gene and gene product function. GO includes three independent categories: molecular function, biological process and cellular component. For better biological reasoning, identifying the biological relationships between terms in different categories are important. However, the existing measurements to calculate similarity between terms in different categories are either developed by using the GO data only or only take part of combined gene co-function network information. Results We propose an iterative ranking-based method called CroGO2 to measure the cross-categories GO term similarities by incorporating level information of GO terms with both direct and indirect interactions in the gene co-function network. Conclusions The evaluation test shows that CroGO2 performs better than the existing methods. A genome-specific term association network for yeast is also generated by connecting terms with the high confidence score. The linkages in the term association network could be supported by the literature. Given a gene set, the related terms identified by using the association network have overlap with the related terms identified by GO enrichment analysis.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Honggang Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | - Junya Lu
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Weiwei Hui
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China
| | - Yadong Wang
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China.
| | - Xuequn Shang
- School of Computer Science, Northwestern Polytechnical University, Xi'an, China.
| |
Collapse
|
7
|
Abstract
The Gene Ontology (GO) (Ashburner et al., Nat Genet 25(1):25-29, 2000) is a powerful tool in the informatics arsenal of methods for evaluating annotations in a protein dataset. From identifying the nearest well annotated homologue of a protein of interest to predicting where misannotation has occurred to knowing how confident you can be in the annotations assigned to those proteins is critical. In this chapter we explore what makes an enzyme unique and how we can use GO to infer aspects of protein function based on sequence similarity. These can range from identification of misannotation or other errors in a predicted function to accurate function prediction for an enzyme of entirely unknown function. Although GO annotation applies to any gene products, we focus here a describing our approach for hierarchical classification of enzymes in the Structure-Function Linkage Database (SFLD) (Akiva et al., Nucleic Acids Res 42(Database issue):D521-530, 2014) as a guide for informed utilisation of annotation transfer based on GO terms.
Collapse
Affiliation(s)
- Gemma L Holliday
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA.
| | - Rebecca Davidson
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| | - Eyal Akiva
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| | - Patricia C Babbitt
- Department of Bioengineering and Therapeutic Sciences, University of California San Francisco, 1700 4th Street, San Francisco, CA, 94158, USA
| |
Collapse
|
8
|
Eyres I, Boschetti C, Crisp A, Smith TP, Fontaneto D, Tunnacliffe A, Barraclough TG. Horizontal gene transfer in bdelloid rotifers is ancient, ongoing and more frequent in species from desiccating habitats. BMC Biol 2015; 13:90. [PMID: 26537913 PMCID: PMC4632278 DOI: 10.1186/s12915-015-0202-9] [Citation(s) in RCA: 52] [Impact Index Per Article: 5.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2015] [Accepted: 10/20/2015] [Indexed: 11/26/2022] Open
Abstract
Background Although prevalent in prokaryotes, horizontal gene transfer (HGT) is rarer in multicellular eukaryotes. Bdelloid rotifers are microscopic animals that contain a higher proportion of horizontally transferred, non-metazoan genes in their genomes than typical of animals. It has been hypothesized that bdelloids incorporate foreign DNA when they repair their chromosomes following double-strand breaks caused by desiccation. HGT might thereby contribute to species divergence and adaptation, as in prokaryotes. If so, we expect that species should differ in their complement of foreign genes, rather than sharing the same set of foreign genes inherited from a common ancestor. Furthermore, there should be more foreign genes in species that desiccate more frequently. We tested these hypotheses by surveying HGT in four congeneric species of bdelloids from different habitats: two from permanent aquatic habitats and two from temporary aquatic habitats that desiccate regularly. Results Transcriptomes of all four species contain many genes with a closer match to non-metazoan genes than to metazoan genes. Whole genome sequencing of one species confirmed the presence of these foreign genes in the genome. Nearly half of foreign genes are shared between all four species and an outgroup from another family, but many hundreds are unique to particular species, which indicates that HGT is ongoing. Using a dated phylogeny, we estimate an average of 12.8 gains versus 2.0 losses of foreign genes per million years. Consistent with the desiccation hypothesis, the level of HGT is higher in the species that experience regular desiccation events than those that do not. However, HGT still contributed hundreds of foreign genes to the species from permanently aquatic habitats. Foreign genes were mainly enzymes with various annotated functions that include catabolism of complex polysaccharides and stress responses. We found evidence of differential loss of ancestral foreign genes previously associated with desiccation protection in the two non-desiccating species. Conclusions Nearly half of foreign genes were acquired before the divergence of bdelloid families over 60 Mya. Nonetheless, HGT is ongoing in bdelloids and has contributed to putative functional differences among species. Variation among our study species is consistent with the hypothesis that desiccating habitats promote HGT. Electronic supplementary material The online version of this article (doi:10.1186/s12915-015-0202-9) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Isobel Eyres
- Department of Life Sciences, Imperial College London, Silwood Park Campus, Ascot, SL5 7PY, UK.,Department of Animal and Plant Sciences, University of Sheffield, Alfred Denny Building, Western Bank, Sheffield, S10 2TN, UK
| | - Chiara Boschetti
- Department of Chemical Engineering and Biotechnology, University of Cambridge, New Museums Site, Pembroke Street, Cambridge, CB2 3RA, UK
| | - Alastair Crisp
- Department of Chemical Engineering and Biotechnology, University of Cambridge, New Museums Site, Pembroke Street, Cambridge, CB2 3RA, UK
| | - Thomas P Smith
- Department of Life Sciences, Imperial College London, Silwood Park Campus, Ascot, SL5 7PY, UK
| | - Diego Fontaneto
- National Research Council, Institute of Ecosystem Study, Largo Tonolli 50, 28922, Verbania Pallanza, Italy
| | - Alan Tunnacliffe
- Department of Chemical Engineering and Biotechnology, University of Cambridge, New Museums Site, Pembroke Street, Cambridge, CB2 3RA, UK
| | - Timothy G Barraclough
- Department of Life Sciences, Imperial College London, Silwood Park Campus, Ascot, SL5 7PY, UK.
| |
Collapse
|
9
|
Wang T, Mori H, Zhang C, Kurokawa K, Xing XH, Yamada T. DomSign: a top-down annotation pipeline to enlarge enzyme space in the protein universe. BMC Bioinformatics 2015; 16:96. [PMID: 25888481 PMCID: PMC4389672 DOI: 10.1186/s12859-015-0499-y] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2014] [Accepted: 02/18/2015] [Indexed: 12/27/2022] Open
Abstract
Background Computational predictions of catalytic function are vital for in-depth understanding of enzymes. Because several novel approaches performing better than the common BLAST tool are rarely applied in research, we hypothesized that there is a large gap between the number of known annotated enzymes and the actual number in the protein universe, which significantly limits our ability to extract additional biologically relevant functional information from the available sequencing data. To reliably expand the enzyme space, we developed DomSign, a highly accurate domain signature–based enzyme functional prediction tool to assign Enzyme Commission (EC) digits. Results DomSign is a top-down prediction engine that yields results comparable, or superior, to those from many benchmark EC number prediction tools, including BLASTP, when a homolog with an identity >30% is not available in the database. Performance tests showed that DomSign is a highly reliable enzyme EC number annotation tool. After multiple tests, the accuracy is thought to be greater than 90%. Thus, DomSign can be applied to large-scale datasets, with the goal of expanding the enzyme space with high fidelity. Using DomSign, we successfully increased the percentage of EC-tagged enzymes from 12% to 30% in UniProt-TrEMBL. In the Kyoto Encyclopedia of Genes and Genomes bacterial database, the percentage of EC-tagged enzymes for each bacterial genome could be increased from 26.0% to 33.2% on average. Metagenomic mining was also efficient, as exemplified by the application of DomSign to the Human Microbiome Project dataset, recovering nearly one million new EC-labeled enzymes. Conclusions Our results offer preliminarily confirmation of the existence of the hypothesized huge number of “hidden enzymes” in the protein universe, the identification of which could substantially further our understanding of the metabolisms of diverse organisms and also facilitate bioengineering by providing a richer enzyme resource. Furthermore, our results highlight the necessity of using more advanced computational tools than BLAST in protein database annotations to extract additional biologically relevant functional information from the available biological sequences. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0499-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Tianmin Wang
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. .,Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China.
| | - Hiroshi Mori
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. .,Earth-Life Science Institute, Tokyo Institute of Technology, 2-12-1-E3-10 Ookayama, Meguro-ku, Tokyo, 152-8550, Japan.
| | - Chong Zhang
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China.
| | - Ken Kurokawa
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan. .,Earth-Life Science Institute, Tokyo Institute of Technology, 2-12-1-E3-10 Ookayama, Meguro-ku, Tokyo, 152-8550, Japan.
| | - Xin-Hui Xing
- Department of Chemical Engineering, Tsinghua University, Beijing, 100084, China.
| | - Takuji Yamada
- Department of Biological Information, Graduate School of Bioscience and Biotechnology, Tokyo Institute of Technology, 2-12-1 M6-3, Ookayama, Meguro-ku, Tokyo, 152-8550, Japan.
| |
Collapse
|
10
|
Eksi R, Li HD, Menon R, Wen Y, Omenn GS, Kretzler M, Guan Y. Systematically differentiating functions for alternatively spliced isoforms through integrating RNA-seq data. PLoS Comput Biol 2013; 9:e1003314. [PMID: 24244129 PMCID: PMC3820534 DOI: 10.1371/journal.pcbi.1003314] [Citation(s) in RCA: 68] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/18/2013] [Accepted: 09/19/2013] [Indexed: 12/13/2022] Open
Abstract
Integrating large-scale functional genomic data has significantly accelerated our understanding of gene functions. However, no algorithm has been developed to differentiate functions for isoforms of the same gene using high-throughput genomic data. This is because standard supervised learning requires ‘ground-truth’ functional annotations, which are lacking at the isoform level. To address this challenge, we developed a generic framework that interrogates public RNA-seq data at the transcript level to differentiate functions for alternatively spliced isoforms. For a specific function, our algorithm identifies the ‘responsible’ isoform(s) of a gene and generates classifying models at the isoform level instead of at the gene level. Through cross-validation, we demonstrated that our algorithm is effective in assigning functions to genes, especially the ones with multiple isoforms, and robust to gene expression levels and removal of homologous gene pairs. We identified genes in the mouse whose isoforms are predicted to have disparate functionalities and experimentally validated the ‘responsible’ isoforms using data from mammary tissue. With protein structure modeling and experimental evidence, we further validated the predicted isoform functional differences for the genes Cdkn2a and Anxa6. Our generic framework is the first to predict and differentiate functions for alternatively spliced isoforms, instead of genes, using genomic data. It is extendable to any base machine learner and other species with alternatively spliced isoforms, and shifts the current gene-centered function prediction to isoform-level predictions. In mammalian genomes, a single gene can be alternatively spliced into multiple isoforms which greatly increase the functional diversity of the genome. In the human, more than 95% of multi-exon genes undergo alternative splicing. It is hard to computationally differentiate the functions for the splice isoforms of the same gene, because they are almost always annotated with the same functions and share similar sequences. In this paper, we developed a generic framework to identify the ‘responsible’ isoform(s) for each function that the gene carries out, and therefore predict functional assignment on the isoform level instead of on the gene level. Within this generic framework, we implemented and evaluated several related algorithms for isoform function prediction. We tested these algorithms through both computational evaluation and experimental validation of the predicted ‘responsible’ isoform(s) and the predicted disparate functions of the isoforms of Cdkn2a and of Anxa6. Our algorithm represents the first effort to predict and differentiate isoforms through large-scale genomic data integration.
Collapse
Affiliation(s)
- Ridvan Eksi
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Hong-Dong Li
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Rajasree Menon
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Yuchen Wen
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Gilbert S. Omenn
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail: (GSO); (MK); (YG)
| | - Matthias Kretzler
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail: (GSO); (MK); (YG)
| | - Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, Michigan, United States of America
- * E-mail: (GSO); (MK); (YG)
| |
Collapse
|
11
|
Hill DP, Adams N, Bada M, Batchelor C, Berardini TZ, Dietze H, Drabkin HJ, Ennis M, Foulger RE, Harris MA, Hastings J, Kale NS, de Matos P, Mungall CJ, Owen G, Roncaglia P, Steinbeck C, Turner S, Lomax J. Dovetailing biology and chemistry: integrating the Gene Ontology with the ChEBI chemical ontology. BMC Genomics 2013; 14:513. [PMID: 23895341 PMCID: PMC3733925 DOI: 10.1186/1471-2164-14-513] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/07/2013] [Accepted: 07/23/2013] [Indexed: 11/30/2022] Open
Abstract
Background The Gene Ontology (GO) facilitates the description of the action of gene products in a biological context. Many GO terms refer to chemical entities that participate in biological processes. To facilitate accurate and consistent systems-wide biological representation, it is necessary to integrate the chemical view of these entities with the biological view of GO functions and processes. We describe a collaborative effort between the GO and the Chemical Entities of Biological Interest (ChEBI) ontology developers to ensure that the representation of chemicals in the GO is both internally consistent and in alignment with the chemical expertise captured in ChEBI. Results We have examined and integrated the ChEBI structural hierarchy into the GO resource through computationally-assisted manual curation of both GO and ChEBI. Our work has resulted in the creation of computable definitions of GO terms that contain fully defined semantic relationships to corresponding chemical terms in ChEBI. Conclusions The set of logical definitions using both the GO and ChEBI has already been used to automate aspects of GO development and has the potential to allow the integration of data across the domains of biology and chemistry. These logical definitions are available as an extended version of the ontology from http://purl.obolibrary.org/obo/go/extensions/go-plus.owl.
Collapse
Affiliation(s)
- David P Hill
- Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME 04609, USA.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
12
|
Abstract
Here we assessed the use of domain families for predicting the functions of whole proteins. These 'functional families' (FunFams) were derived using a protocol that combines sequence clustering with supervised cluster evaluation, relying on available high-quality Gene Ontology (GO) annotation data in the latter step. In essence, the protocol groups domain sequences belonging to the same superfamily into families based on the GO annotations of their parent proteins. An initial test based on enzyme sequences confirmed that the FunFams resemble enzyme (domain) families much better than do families produced by sequence clustering alone. For the CAFA 2011 experiment, we further associated the FunFams with GO terms probabilistically. All target proteins were first submitted to domain superfamily assignment, followed by FunFam assignment and, eventually, function assignment. The latter included an integration step for multi-domain target proteins. The CAFA results put our domain-based approach among the top ten of 31 competing groups and 56 prediction methods, confirming that it outperforms simple pairwise whole-protein sequence comparisons.
Collapse
Affiliation(s)
- Robert Rentzsch
- Robert Koch Institut, Research Group Bioinformatics Ng4, Nordufer 20, 13353 Berlin, Germany.
| | | |
Collapse
|
13
|
Peng J, Chen J, Wang Y. Identifying cross-category relations in gene ontology and constructing genome-specific term association networks. BMC Bioinformatics 2013; 14 Suppl 2:S15. [PMID: 23368677 PMCID: PMC3549802 DOI: 10.1186/1471-2105-14-s2-s15] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Background Gene Ontology (GO) has been widely used in biological databases, annotation projects, and computational analyses. Although the three GO categories are structured as independent ontologies, the biological relationships across the categories are not negligible for biological reasoning and knowledge integration. However, the existing cross-category ontology term similarity measures are either developed by utilizing the GO data only or based on manually curated term name similarities, ignoring the fact that GO is evolving quickly and the gene annotations are far from complete. Results In this paper we introduce a new cross-category similarity measurement called CroGO by incorporating genome-specific gene co-function network data. The performance study showed that our measurement outperforms the existing algorithms. We also generated genome-specific term association networks for yeast and human. An enrichment based test showed our networks are better than those generated by the other measures. Conclusions The genome-specific term association networks constructed using CroGO provided a platform to enable a more consistent use of GO. In the networks, the frequently occurred MF-centered hub indicates that a molecular function may be shared by different genes in multiple biological processes, or a set of genes with the same functions may participate in distinct biological processes. And common subgraphs in multiple organisms also revealed conserved GO term relationships. Software and data are available online at http://www.msu.edu/˜jinchen/CroGO.
Collapse
Affiliation(s)
- Jiajie Peng
- School of Computer Science and Technology, Harbin Institute of Technology, Harbin, China
| | | | | |
Collapse
|
14
|
Pethica RB, Levitt M, Gough J. Evolutionarily consistent families in SCOP: sequence, structure and function. BMC STRUCTURAL BIOLOGY 2012; 12:27. [PMID: 23078280 PMCID: PMC3495643 DOI: 10.1186/1472-6807-12-27] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/01/2012] [Accepted: 10/03/2012] [Indexed: 11/10/2022]
Abstract
Background SCOP is a hierarchical domain classification system for proteins of known structure. The superfamily level has a clear definition: Protein domains belong to the same superfamily if there is structural, functional and sequence evidence for a common evolutionary ancestor. Superfamilies are sub-classified into families, however, there is not such a clear basis for the family level groupings. Do SCOP families group together domains with sequence similarity, do they group domains with similar structure or by common function? It is these questions we answer, but most importantly, whether each family represents a distinct phylogenetic group within a superfamily. Results Several phylogenetic trees were generated for each superfamily: one derived from a multiple sequence alignment, one based on structural distances, and the final two from presence/absence of GO terms or EC numbers assigned to domains. The topologies of the resulting trees and confidence values were compared to the SCOP family classification. Conclusions We show that SCOP family groupings are evolutionarily consistent to a very high degree with respect to classical sequence phylogenetics. The trees built from (automatically generated) structural distances correlate well, but are not always consistent with SCOP (hand annotated) groupings. Trees derived from functional data are less consistent with the family level than those from structure or sequence, though the majority still agree. Much of GO and EC annotation applies directly to one family or subset of the family; relatively few terms apply at the superfamily level. Maximum sequence diversity within a family is on average 22% but close to zero for superfamilies.
Collapse
Affiliation(s)
- Ralph B Pethica
- Department of Computer Science, University of Bristol, The Merchant Venturers Building, Room 3,16, Woodland Road, Bristol, UK.
| | | | | |
Collapse
|
15
|
Guan Y, Gorenshteyn D, Burmeister M, Wong AK, Schimenti JC, Handel MA, Bult CJ, Hibbs MA, Troyanskaya OG. Tissue-specific functional networks for prioritizing phenotype and disease genes. PLoS Comput Biol 2012; 8:e1002694. [PMID: 23028291 PMCID: PMC3459891 DOI: 10.1371/journal.pcbi.1002694] [Citation(s) in RCA: 87] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2012] [Accepted: 08/02/2012] [Indexed: 12/16/2022] Open
Abstract
Integrated analyses of functional genomics data have enormous potential for identifying phenotype-associated genes. Tissue-specificity is an important aspect of many genetic diseases, reflecting the potentially different roles of proteins and pathways in diverse cell lineages. Accounting for tissue specificity in global integration of functional genomics data is challenging, as “functionality” and “functional relationships” are often not resolved for specific tissue types. We address this challenge by generating tissue-specific functional networks, which can effectively represent the diversity of protein function for more accurate identification of phenotype-associated genes in the laboratory mouse. Specifically, we created 107 tissue-specific functional relationship networks through integration of genomic data utilizing knowledge of tissue-specific gene expression patterns. Cross-network comparison revealed significantly changed genes enriched for functions related to specific tissue development. We then utilized these tissue-specific networks to predict genes associated with different phenotypes. Our results demonstrate that prediction performance is significantly improved through using the tissue-specific networks as compared to the global functional network. We used a testis-specific functional relationship network to predict genes associated with male fertility and spermatogenesis phenotypes, and experimentally confirmed one top prediction, Mbyl1. We then focused on a less-common genetic disease, ataxia, and identified candidates uniquely predicted by the cerebellum network, which are supported by both literature and experimental evidence. Our systems-level, tissue-specific scheme advances over traditional global integration and analyses and establishes a prototype to address the tissue-specific effects of genetic perturbations, diseases and drugs. Tissue specificity is an important aspect of many genetic diseases, reflecting the potentially different roles of proteins and pathways in diverse cell lineages. We propose an effective strategy to model tissue-specific functional relationship networks in the laboratory mouse. We integrated large scale genomics datasets as well as low-throughput tissue-specific expression profiles to estimate the probability that two proteins are co-functioning in the tissue under study. These networks can accurately reflect the diversity of protein functions across different organs and tissue compartments. By computationally exploring the tissue-specific networks, we can accurately predict novel phenotype-related gene candidates. We experimentally confirmed a top candidate gene, Mybl1, to affect several male fertility phenotypes, predicted based on male-reproductive system-specific networks and we predicted candidates related to a rare genetic disease ataxia, which are supported by experimental and literature evidence. The above results demonstrate the power of modeling tissue-specific dynamics of co-functionality through computational approaches.
Collapse
Affiliation(s)
- Yuanfang Guan
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Department of Internal Medicine, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Dmitriy Gorenshteyn
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Margit Burmeister
- Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, Michigan, United States of America
- Molecular & Behavioral Neuroscience Institution, Department of Psychiatry, and Department of Human Genetics, University of Michigan, Ann Arbor, Michigan, United States of America
| | - Aaron K. Wong
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
| | - John C. Schimenti
- Department of Biomedical Sciences, College of Veterinary Medicine, Cornell University, Ithaca, New York, United States of America
| | - Mary Ann Handel
- The Jackson Laboratory, Bar Harbor, Maine, United States of America
| | - Carol J. Bult
- The Jackson Laboratory, Bar Harbor, Maine, United States of America
| | - Matthew A. Hibbs
- The Jackson Laboratory, Bar Harbor, Maine, United States of America
- Trinity University, Computer Science Department, San Antonio, Texas, United States of America
- * E-mail: (MAH); (OGT)
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail: (MAH); (OGT)
| |
Collapse
|
16
|
Škunca N, Altenhoff A, Dessimoz C. Quality of computationally inferred gene ontology annotations. PLoS Comput Biol 2012; 8:e1002533. [PMID: 22693439 PMCID: PMC3364937 DOI: 10.1371/journal.pcbi.1002533] [Citation(s) in RCA: 97] [Impact Index Per Article: 8.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2011] [Accepted: 04/01/2012] [Indexed: 01/10/2023] Open
Abstract
Gene Ontology (GO) has established itself as the undisputed standard for protein function annotation. Most annotations are inferred electronically, i.e. without individual curator supervision, but they are widely considered unreliable. At the same time, we crucially depend on those automated annotations, as most newly sequenced genomes are non-model organisms. Here, we introduce a methodology to systematically and quantitatively evaluate electronic annotations. By exploiting changes in successive releases of the UniProt Gene Ontology Annotation database, we assessed the quality of electronic annotations in terms of specificity, reliability, and coverage. Overall, we not only found that electronic annotations have significantly improved in recent years, but also that their reliability now rivals that of annotations inferred by curators when they use evidence other than experiments from primary literature. This work provides the means to identify the subset of electronic annotations that can be relied upon—an important outcome given that >98% of all annotations are inferred without direct curation. In the UniProt Gene Ontology Annotation database, the largest repository of functional annotations, over 98% of all function annotations are inferred in silico, without curator oversight. Yet these “electronic GO annotations” are generally perceived as unreliable; they are disregarded in many studies. In this article, we introduce novel methodology to systematically evaluate the quality of electronic annotations. We then provide the first comprehensive assessment of the reliability of electronic GO annotations. Overall, we found that electronic annotations are more reliable than generally believed, to an extent that they are competitive with annotations inferred by curators when they use evidence other than experiments from primary literature. But we also report significant variations among inference methods, types of annotations, and organisms. This work provides guidance for Gene Ontology users and lays the foundations for improving computational approaches to GO function inference.
Collapse
Affiliation(s)
- Nives Škunca
- Ruđer Bošković Institute, Division of Electronics, Zagreb, Croatia
- ETH Zurich, Computer Science, Zurich, Switzerland
| | - Adrian Altenhoff
- ETH Zurich, Computer Science, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
| | - Christophe Dessimoz
- ETH Zurich, Computer Science, Zurich, Switzerland
- Swiss Institute of Bioinformatics, Zurich, Switzerland
- EMBL-European Bioinformatics Institute, Hinxton, Cambridge, United Kingdom
- * E-mail:
| |
Collapse
|
17
|
Torshin IY. On solvability, regularity, and locality of the problem of genome annotation. PATTERN RECOGNITION AND IMAGE ANALYSIS 2010. [DOI: 10.1134/s1054661810030156] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/23/2022]
|
18
|
Guan Y, Myers CL, Lu R, Lemischka IR, Bult CJ, Troyanskaya OG. A genomewide functional network for the laboratory mouse. PLoS Comput Biol 2008; 4:e1000165. [PMID: 18818725 PMCID: PMC2527685 DOI: 10.1371/journal.pcbi.1000165] [Citation(s) in RCA: 98] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/14/2008] [Accepted: 07/21/2008] [Indexed: 11/19/2022] Open
Abstract
Establishing a functional network is invaluable to our understanding of gene function, pathways, and systems-level properties of an organism and can be a powerful resource in directing targeted experiments. In this study, we present a functional network for the laboratory mouse based on a Bayesian integration of diverse genetic and functional genomic data. The resulting network includes probabilistic functional linkages among 20,581 protein-coding genes. We show that this network can accurately predict novel functional assignments and network components and present experimental evidence for predictions related to Nanog homeobox (Nanog), a critical gene in mouse embryonic stem cell pluripotency. An analysis of the global topology of the mouse functional network reveals multiple biologically relevant systems-level features of the mouse proteome. Specifically, we identify the clustering coefficient as a critical characteristic of central modulators that affect diverse pathways as well as genes associated with different phenotype traits and diseases. In addition, a cross-species comparison of functional interactomes on a genomic scale revealed distinct functional characteristics of conserved neighborhoods as compared to subnetworks specific to higher organisms. Thus, our global functional network for the laboratory mouse provides the community with a key resource for discovering protein functions and novel pathway components as well as a tool for exploring systems-level topological and evolutionary features of cellular interactomes. To facilitate exploration of this network by the biomedical research community, we illustrate its application in function and disease gene discovery through an interactive, Web-based, publicly available interface at http://mouseNET.princeton.edu. Functionally related proteins interact in diverse ways to carry out biological processes, and each protein often participates in multiple pathways. Proteins are therefore organized into a complex network through which different functions of the cell are carried out. An accurate description of such a network is invaluable to our understanding of both the system-level features of a cell and those of an individual biological process. In this study, we used a probabilistic model to combine information from diverse genome-scale studies as well as individual investigations to generate a global functional network for mouse. Our analysis of the global topology of this network reveals biologically relevant systems-level characteristics of the mouse proteome, including conservation of functional neighborhoods and network features characteristic of known disease genes and key transcriptional regulators. We have made this network publicly available for search and dynamic exploration by researchers in the community. Our Web interface enables users to easily generate hypotheses regarding potential functional roles of uncharacterized proteins, investigate possible links between their proteins of interest and disease, and identify new players in specific biological processes.
Collapse
Affiliation(s)
- Yuanfang Guan
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory, Princeton University, Princeton, New Jersey, United States of America
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Chad L. Myers
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
| | - Rong Lu
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Ihor R. Lemischka
- Department of Molecular Biology, Princeton University, Princeton, New Jersey, United States of America
| | - Carol J. Bult
- The Jackson Laboratory, Bar Harbor, Maine, United States of America
| | - Olga G. Troyanskaya
- Lewis-Sigler Institute for Integrative Genomics, Carl Icahn Laboratory, Princeton University, Princeton, New Jersey, United States of America
- Department of Computer Science, Princeton University, Princeton, New Jersey, United States of America
- * E-mail:
| |
Collapse
|
19
|
Guan Y, Myers CL, Hess DC, Barutcuoglu Z, Caudy AA, Troyanskaya OG. Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol 2008; 9 Suppl 1:S3. [PMID: 18613947 PMCID: PMC2447537 DOI: 10.1186/gb-2008-9-s1-s3] [Citation(s) in RCA: 76] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/20/2022] Open
Abstract
Background: The wide availability of genome-scale data for several organisms has stimulated interest in computational approaches to gene function prediction. Diverse machine learning methods have been applied to unicellular organisms with some success, but few have been extensively tested on higher level, multicellular organisms. A recent mouse function prediction project (MouseFunc) brought together nine bioinformatics teams applying a diverse array of methodologies to mount the first large-scale effort to predict gene function in the laboratory mouse. Results: In this paper, we describe our contribution to this project, an ensemble framework based on the support vector machine that integrates diverse datasets in the context of the Gene Ontology hierarchy. We carry out a detailed analysis of the performance of our ensemble and provide insights into which methods work best under a variety of prediction scenarios. In addition, we applied our method to Saccharomyces cerevisiae and have experimentally confirmed functions for a novel mitochondrial protein. Conclusion: Our method consistently performs among the top methods in the MouseFunc evaluation. Furthermore, it exhibits good classification performance across a variety of cellular processes and functions in both a multicellular organism and a unicellular organism, indicating its ability to discover novel biology in diverse settings.
Collapse
Affiliation(s)
- Yuanfang Guan
- Department of Molecular Biology, Princeton University, Princeton, NJ 08544, USA
| | | | | | | | | | | |
Collapse
|
20
|
Aidinis V, Chandras C, Manoloukos M, Thanassopoulou A, Kranidioti K, Armaka M, Douni E, Kontoyiannis DL, Zouberakis M, Kollias G. MUGEN mouse database; animal models of human immunological diseases. Nucleic Acids Res 2007; 36:D1048-54. [PMID: 17932065 PMCID: PMC2238830 DOI: 10.1093/nar/gkm838] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
Abstract
The MUGEN mouse database (MMdb) (www.mugen-noe.org/database/) is a database of murine models of immune processes and immunological diseases. Its aim is to share and publicize information on mouse strain characteristics and availability from participating institutions. MMdb's basic classification of models is based on three major research application categories: Models of Human Disease, Models of Immune Processes and Transgenic Tools. Data on mutant strains includes detailed information on affected gene(s), mutant allele(s) and genetic background (DNA origin, gene targeted, host and backcross strain background). Each gene/transgene index also includes IDs and direct links to Ensembl, ArrayExpress, EURExpress and NCBI's Entrez Gene database. Phenotypic description is standardized and hierarchically structured, based on MGI's mammalian phenotypic ontology terms. Availability (e.g. live mice, cryopreserved embryos, sperm and ES cells) is clearly indicated, along with handling and genotyping details (in the form of documents or hyperlinks) and all relevant contact information (including EMMA and Jax/IMSR hyperlinks where available). MMdb's design offers a user-friendly query interface and provides instant access to the list of mutant strains and genes. Database access is free of charge and there are no registration requirements for data querying.
Collapse
Affiliation(s)
- V Aidinis
- B.S.R.C. Alexander Fleming, 34 Fleming Street, 16672, Vari, Greece.
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
21
|
McCarthy FM, Cooksey AM, Wang N, Bridges SM, Pharr GT, Burgess SC. Modeling a whole organ using proteomics: the avian bursa of Fabricius. Proteomics 2006; 6:2759-71. [PMID: 16596704 DOI: 10.1002/pmic.200500648] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
While advances in proteomics have improved proteome coverage and enhanced biological modeling, modeling function in multicellular organisms requires understanding how cells interact. Here we used the chicken bursa of Fabricius, a common experimental system for B cell function, to model organ function from proteomics data. The bursa has two major functional cell types: B cells and the supporting stromal cells. We used differential detergent fractionation-multidimensional protein identification technology (DDF-MudPIT) to identify 5198 proteins from all cellular compartments. Of these, 1753 were B cell specific, 1972 were stroma specific and 1473 were shared between the two. By modeling programmed cell death (PCD), cell differentiation and proliferation, and transcriptional activation, we have improved functional annotation of chicken proteins and placed chicken-specific death receptors into the PCD process using phylogenetics. We have identified 114 transcription factors (TFs); 42 of the bursal B cell TFs have not been reported before in any B cells. We have also improved the structural annotation of a newly sequenced genome by confirming the in vivo expression of 4006 "predicted", and 6623 ab initio, ORFs. Finally, we have developed a novel method for facilitating structural annotation, "expressed peptide sequence tags" (ePSTs) and demonstrate its utility by identifying 521 potential novel proteins from the chicken "unassigned chromosome".
Collapse
Affiliation(s)
- Fiona M McCarthy
- Department of Basic Science, College of Veterinary Medicine, Mississippi State University, Mississippi State, MS 39762-6100, USA.
| | | | | | | | | | | |
Collapse
|
22
|
Abstract
Computational characterization of proteins is a necessary first step in understanding the biologic role of a protein. The composite architecture of mammalian proteins makes the prediction of the biologic role rather difficult. Nevertheless, integration of many different prediction methods allows for a more accurate representation. Information on the 3D structure of a protein improves the reliability of predictions of many features. This article reviews existing methods used to characterize proteins and several tools that provide an integrated access to different types of information. The authors point out the increasing importance of structural constraints and an increasing need to integrate different approaches.
Collapse
Affiliation(s)
- Jadwiga Bienkowska
- Serono Reproductive Biology Institute, One Technology Pl., Rockland, MA 02370, USA.
| |
Collapse
|
23
|
The Mammalian Phenotype Ontology as a tool for annotating, analyzing and comparing phenotypic information. Genome Biol 2004; 6:R7. [PMID: 15642099 PMCID: PMC549068 DOI: 10.1186/gb-2004-6-1-r7] [Citation(s) in RCA: 305] [Impact Index Per Article: 15.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2004] [Revised: 11/15/2004] [Accepted: 11/17/2004] [Indexed: 11/26/2022] Open
Abstract
The Mammalian Phenotype (MP) Ontology enables robust annotation of mammalian phenotypes in the context of mutations, quantitative trait loci and strains that are used as models of human biology and disease. The Mammalian Phenotype (MP) Ontology enables robust annotation of mammalian phenotypes in the context of mutations, quantitative trait loci and strains that are used as models of human biology and disease. The MP Ontology supports different levels and richness of phenotypic knowledge and flexible annotations to individual genotypes. It continues to develop dynamically via collaborative input from research groups, mutagenesis consortia, and biological domain experts. The MP Ontology is currently used by the Mouse Genome Database and Rat Genome Database to represent phenotypic data.
Collapse
|
24
|
Schofield PN, Bard JBL, Booth C, Boniver J, Covelli V, Delvenne P, Ellender M, Engstrom W, Goessner W, Gruenberger M, Hoefler H, Hopewell J, Mancuso M, Mothersill C, Potten CS, Quintanilla-Fend L, Rozell B, Sariola H, Sundberg JP, Ward A. Pathbase: a database of mutant mouse pathology. Nucleic Acids Res 2004; 32:D512-5. [PMID: 14681470 PMCID: PMC308858 DOI: 10.1093/nar/gkh124] [Citation(s) in RCA: 39] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Pathbase is a database that stores images of the abnormal histology associated with spontaneous and induced mutations of both embryonic and adult mice including those produced by transgenesis, targeted mutagenesis and chemical mutagenesis. Images of normal mouse histology and strain-dependent background lesions are also available. The database and the images are publicly accessible (http://www.pathbase.net) and linked by anatomical site, gene and other identifiers to relevant databases; there are also facilities for public comment and record annotation. The database is structured around a novel ontology of mouse disorders (MPATH) and provides high-resolution downloadable images of normal and diseased tissues that are searchable through orthogonal ontologies for pathology, developmental stage, anatomy and gene attributes (GO terms), together with controlled vocabularies for type of genetic manipulation or mutation, genotype and free text annotation for mouse strain and additional attributes. The database is actively curated and data records assessed by pathologists in the Pathbase Consortium before publication. The database interface is designed to have optimal browser and platform compatibility and to interact directly with other web-based mouse genetic resources.
Collapse
Affiliation(s)
- Paul N Schofield
- Department of Anatomy, University of Cambridge, Downing Street, Cambridge CB2 3DY, UK.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
25
|
Hennig S, Groth D, Lehrach H. Automated Gene Ontology annotation for anonymous sequence data. Nucleic Acids Res 2003; 31:3712-5. [PMID: 12824400 PMCID: PMC168988 DOI: 10.1093/nar/gkg582] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Gene Ontology (GO) is the most widely accepted attempt to construct a unified and structured vocabulary for the description of genes and their products in any organism. Annotation by GO terms is performed in most of the current genome projects, which besides generality has the advantage of being very convenient for computer based classification methods. However, direct use of GO in small sequencing projects is not easy, especially for species not commonly represented in public databases. We present a software package (GOblet), which performs annotation based on GO terms for anonymous cDNA or protein sequences. It uses the species independent GO structure and vocabulary together with a series of protein databases collected from various sites, to perform a detailed GO annotation by sequence similarity searches. The sensitivity and the reference protein sets can be selected by the user. GOblet runs automatically and is available as a public service on our web server. The paper also addresses the reliability of automated GO annotations by using a reference set of more than 6000 human proteins. The GOblet server is accessible at http://goblet.molgen.mpg.de.
Collapse
Affiliation(s)
- Steffen Hennig
- Max-Planck Institute for Molecular Genetics, Ihnestrasse 73, D-14195 Berlin, Germany.
| | | | | |
Collapse
|
26
|
Schriml LM, Hill DP, Blake JA, Bono H, Wynshaw-Boris A, Pavan WJ, Ring BZ, Beisel K, Setou M, Okazaki Y. Human disease genes and their cloned mouse orthologs: exploration of the FANTOM2 cDNA sequence data set. Genome Res 2003; 13:1496-500. [PMID: 12819148 PMCID: PMC403698 DOI: 10.1101/gr.979503] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
The FANTOM2 cDNA sequence data set is an excellent model to demonstrate the power of large-scale cDNA sequencing, with the goal of providing a full-length transcript sequence for each mouse gene. This data set enhances the use of the mouse as a model for human disease. Here we identify mouse cDNA sequences in the FANTOM2 data set for a set of 67 human disease genes that as of May 2002 had no corresponding mouse cDNA annotated in the Mouse Genome Informatics (MGI) database. These 67 human disease genes include genes related to neurological and eye disorders and cancer. We also present a list of the human disease genes and their cloned mouse orthologs found in two public databases, LocusLink and MGI. Allelic variant and gene functional information available in MGI provides additional information relative to these mouse models, whereas computed sequence-based connections at NCBI support facile navigation through multiple genomes.
Collapse
Affiliation(s)
- Lynn M Schriml
- National Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland 20894, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Camon E, Magrane M, Barrell D, Binns D, Fleischmann W, Kersey P, Mulder N, Oinn T, Maslen J, Cox A, Apweiler R. The Gene Ontology Annotation (GOA) project: implementation of GO in SWISS-PROT, TrEMBL, and InterPro. Genome Res 2003; 13:662-72. [PMID: 12654719 PMCID: PMC430163 DOI: 10.1101/gr.461403] [Citation(s) in RCA: 255] [Impact Index Per Article: 12.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
Abstract
Gene Ontology Annotation (GOA) is a project run by the European Bioinformatics Institute (EBI) that aims to provide assignments of terms from the Gene Ontology (GO) resource to gene products in a number of its databases (http://www.ebi.ac.uk/GOA). In the first stage of this project, GO assignments have been applied to a data set representing the complete human proteome by a combination of electronic mappings and manual curation. This vocabulary has also been applied to the nonredundant proteome sets for all other completely sequenced organisms as well as to proteins from a wide range of organisms where the proteome is not yet complete.
Collapse
Affiliation(s)
- Evelyn Camon
- EMBL Outstation-European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK
| | | | | | | | | | | | | | | | | | | | | |
Collapse
|
28
|
Raychaudhuri S, Altman RB. A literature-based method for assessing the functional coherence of a gene group. Bioinformatics 2003; 19:396-401. [PMID: 12584126 PMCID: PMC2669934 DOI: 10.1093/bioinformatics/btg002] [Citation(s) in RCA: 34] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
MOTIVATION Many experimental and algorithmic approaches in biology generate groups of genes that need to be examined for related functional properties. For example, gene expression profiles are frequently organized into clusters of genes that may share functional properties. We evaluate a method, neighbor divergence per gene (NDPG), that uses scientific literature to assess whether a group of genes are functionally related. The method requires only a corpus of documents and an index connecting the documents to genes. RESULTS We evaluate NDPG on 2796 functional groups generated by the Gene Ontology consortium in four organisms: mouse, fly, worm and yeast. NDPG finds functional coherence in 96, 92, 82 and 45% of the groups (at 99.9% specificity) in yeast, mouse, fly and worm respectively.
Collapse
Affiliation(s)
- Soumya Raychaudhuri
- Department of Genetics, Stanford Medical Informatics, 251 Campus Drive, MSOB X-215, Stanford University, Stanford, CA 94305-5479, USA
| | | |
Collapse
|
29
|
Okazaki Y, Furuno M, Kasukawa T, Adachi J, Bono H, Kondo S, Nikaido I, Osato N, Saito R, Suzuki H, Yamanaka I, Kiyosawa H, Yagi K, Tomaru Y, Hasegawa Y, Nogami A, Schönbach C, Gojobori T, Baldarelli R, Hill DP, Bult C, Hume DA, Quackenbush J, Schriml LM, Kanapin A, Matsuda H, Batalov S, Beisel KW, Blake JA, Bradt D, Brusic V, Chothia C, Corbani LE, Cousins S, Dalla E, Dragani TA, Fletcher CF, Forrest A, Frazer KS, Gaasterland T, Gariboldi M, Gissi C, Godzik A, Gough J, Grimmond S, Gustincich S, Hirokawa N, Jackson IJ, Jarvis ED, Kanai A, Kawaji H, Kawasawa Y, Kedzierski RM, King BL, Konagaya A, Kurochkin IV, Lee Y, Lenhard B, Lyons PA, Maglott DR, Maltais L, Marchionni L, McKenzie L, Miki H, Nagashima T, Numata K, Okido T, Pavan WJ, Pertea G, Pesole G, Petrovsky N, Pillai R, Pontius JU, Qi D, Ramachandran S, Ravasi T, Reed JC, Reed DJ, Reid J, Ring BZ, Ringwald M, Sandelin A, Schneider C, Semple CAM, Setou M, Shimada K, Sultana R, Takenaka Y, Taylor MS, Teasdale RD, Tomita M, Verardo R, Wagner L, Wahlestedt C, Wang Y, Watanabe Y, Wells C, Wilming LG, Wynshaw-Boris A, Yanagisawa M, Yang I, Yang L, Yuan Z, Zavolan M, Zhu Y, Zimmer A, Carninci P, Hayatsu N, Hirozane-Kishikawa T, Konno H, Nakamura M, Sakazume N, Sato K, Shiraki T, Waki K, Kawai J, Aizawa K, Arakawa T, Fukuda S, Hara A, Hashizume W, Imotani K, Ishii Y, Itoh M, Kagawa I, Miyazaki A, Sakai K, Sasaki D, Shibata K, Shinagawa A, Yasunishi A, Yoshino M, Waterston R, Lander ES, Rogers J, Birney E, Hayashizaki Y. Analysis of the mouse transcriptome based on functional annotation of 60,770 full-length cDNAs. Nature 2002; 420:563-73. [PMID: 12466851 DOI: 10.1038/nature01266] [Citation(s) in RCA: 1226] [Impact Index Per Article: 55.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2002] [Accepted: 10/28/2002] [Indexed: 01/10/2023]
Abstract
Only a small proportion of the mouse genome is transcribed into mature messenger RNA transcripts. There is an international collaborative effort to identify all full-length mRNA transcripts from the mouse, and to ensure that each is represented in a physical collection of clones. Here we report the manual annotation of 60,770 full-length mouse complementary DNA sequences. These are clustered into 33,409 'transcriptional units', contributing 90.1% of a newly established mouse transcriptome database. Of these transcriptional units, 4,258 are new protein-coding and 11,665 are new non-coding messages, indicating that non-coding RNA is a major component of the transcriptome. 41% of all transcriptional units showed evidence of alternative splicing. In protein-coding transcripts, 79% of splice variations altered the protein product. Whole-transcriptome analyses resulted in the identification of 2,431 sense-antisense pairs. The present work, completely supported by physical clones, provides the most comprehensive survey of a mammalian transcriptome so far, and is a valuable resource for functional genomics.
Collapse
MESH Headings
- Alternative Splicing/genetics
- Amino Acid Motifs
- Animals
- Chromosomes, Mammalian/genetics
- Cloning, Molecular
- DNA, Complementary/genetics
- Databases, Genetic
- Expressed Sequence Tags
- Genes/genetics
- Genomics/methods
- Humans
- Membrane Proteins/genetics
- Mice/genetics
- Physical Chromosome Mapping
- Protein Structure, Tertiary
- Proteome/chemistry
- Proteome/genetics
- RNA, Antisense/genetics
- RNA, Messenger/analysis
- RNA, Messenger/genetics
- RNA, Untranslated/analysis
- RNA, Untranslated/genetics
- Transcription Initiation Site
- Transcription, Genetic/genetics
Collapse
Affiliation(s)
- Y Okazaki
- [1] Laboratory for Genome Exploration Research Group, RIKEN Genomic Sciences Center, RIKEN Yokohama Institute 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama, Kanagawa, 230-0045, Japan
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
30
|
|
31
|
Blake JA, Richardson JE, Bult CJ, Kadin JA, Eppig JT. The Mouse Genome Database (MGD): the model organism database for the laboratory mouse. Nucleic Acids Res 2002; 30:113-5. [PMID: 11752269 PMCID: PMC99116 DOI: 10.1093/nar/30.1.113] [Citation(s) in RCA: 110] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
The Mouse Genome Database (MGD) is the community database resource for the laboratory mouse, a key model organism for interpreting the human genome and for understanding human biology and disease (http://www.informatics.jax.org). MGD strives to provide a highly curated, highly integrated information resource that not only includes the consensus view of current knowledge about the mouse, but also provides comparative genomic information particularly for human and rat genomes. MGD includes extensive information about mouse genes, supporting all gene attribute assertions with experimental data, statements of evidence and citation. Detailed information about alleles and mouse mutants includes genotype, molecular variant and phenotype descriptions. Extensive collaboration with other data providers such as NCBI, RIKEN and SWISS-PROT provides standardization of gene:sequence associations and robust interconnections between large information systems based on shared sequence curation. Recent integration of large datasets of mouse full-length cDNAs and radiation-hybrid mapped ESTs, the continued development and use of extensive structured vocabularies and the expansion of the representation of phenotypes highlight this year's developments.
Collapse
Affiliation(s)
- Judith A Blake
- The Jackson Laboratory, 600 Main Street, Bar Harbor, ME 04609, USA.
| | | | | | | | | |
Collapse
|
32
|
Current Awareness on Comparative and Functional Genomics. Comp Funct Genomics 2001. [PMCID: PMC2447222 DOI: 10.1002/cfg.60] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open
|