1
|
Wang S, Pan C, Sheng H, Yang M, Yang C, Feng X, Hu C, Ma Y. Construction of a molecular regulatory network related to fat deposition by multi-tissue transcriptome sequencing of Jiaxian red cattle. iScience 2023; 26:108346. [PMID: 38026203 PMCID: PMC10665818 DOI: 10.1016/j.isci.2023.108346] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2023] [Revised: 09/26/2023] [Accepted: 10/23/2023] [Indexed: 12/01/2023] Open
Abstract
Intramuscular fat (IMF) refers to the fat that accumulates between muscle bundles or within muscle cells, whose content significantly impacts the taste, tenderness, and flavor of meat products, making it a crucial economic characteristic in livestock production. However, the intricate mechanisms governing IMF deposition, involving non-coding RNAs (ncRNAs), genes, and complex regulatory networks, remain largely enigmatic. Identifying adipose tissue-specific genes and ncRNAs is paramount to unravel these molecular mysteries. This study, conducted on Jiaxian red cattle, harnessed whole transcriptome sequencing to unearth the nuances of circRNAs and miRNAs across seven distinct tissues. The interplay of these ncRNAs was assessed through differential expression analysis and network analysis. These findings are not only pivotal in unveiling the intricacies of fat deposition mechanisms but also lay a robust foundation for future research, setting the stage for enhancing IMF content in Jiaxian red cattle breeding.
Collapse
Affiliation(s)
- Shuzhe Wang
- Key Laboratory of Ruminant Molecular and Cellular Breeding of Ningxia Hui Autonomous Region, College of Animal Science and Technology, Ningxia University, Yinchuan 750021, China
| | - Cuili Pan
- Key Laboratory of Ruminant Molecular and Cellular Breeding of Ningxia Hui Autonomous Region, College of Animal Science and Technology, Ningxia University, Yinchuan 750021, China
| | - Hui Sheng
- Key Laboratory of Ruminant Molecular and Cellular Breeding of Ningxia Hui Autonomous Region, College of Animal Science and Technology, Ningxia University, Yinchuan 750021, China
| | - Mengli Yang
- Key Laboratory of Ruminant Molecular and Cellular Breeding of Ningxia Hui Autonomous Region, College of Animal Science and Technology, Ningxia University, Yinchuan 750021, China
| | - Chaoyun Yang
- Xichang College, Liangshan Prefecture, Sichuan Province, China
| | - Xue Feng
- Key Laboratory of Ruminant Molecular and Cellular Breeding of Ningxia Hui Autonomous Region, College of Animal Science and Technology, Ningxia University, Yinchuan 750021, China
| | - Chunli Hu
- Key Laboratory of Ruminant Molecular and Cellular Breeding of Ningxia Hui Autonomous Region, College of Animal Science and Technology, Ningxia University, Yinchuan 750021, China
| | - Yun Ma
- Key Laboratory of Ruminant Molecular and Cellular Breeding of Ningxia Hui Autonomous Region, College of Animal Science and Technology, Ningxia University, Yinchuan 750021, China
| |
Collapse
|
2
|
Di Persia L, Lopez T, Arce A, Milone DH, Stegmayer G. exp2GO: Improving Prediction of Functions in the Gene Ontology With Expression Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:999-1008. [PMID: 35417352 DOI: 10.1109/tcbb.2022.3167245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
The computational methods for the prediction of gene function annotations aim to automatically find associations between a gene and a set of Gene Ontology (GO) terms describing its functions. Since the hand-made curation process of novel annotations and the corresponding wet experiments validations are very time-consuming and costly procedures, there is a need for computational tools that can reliably predict likely annotations and boost the discovery of new gene functions. This work proposes a novel method for predicting annotations based on the inference of GO similarities from expression similarities. The novel method was benchmarked against other methods on several public biological datasets, obtaining the best comparative results. exp2GO effectively improved the prediction of GO annotations in comparison to state-of-the-art methods. Furthermore, the proposal was validated with a full genome case where it was capable of predicting relevant and accurate biological functions. The repository of this project withh full data and code is available at https://github.com/sinc-lab/exp2GO.
Collapse
|
3
|
Chen Y, Qin Y, Fu Y, Gao Z, Deng Y. Integrated Analysis of Bulk RNA-Seq and Single-Cell RNA-Seq Unravels the Influences of SARS-CoV-2 Infections to Cancer Patients. Int J Mol Sci 2022; 23:ijms232415698. [PMID: 36555339 PMCID: PMC9779348 DOI: 10.3390/ijms232415698] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2022] [Revised: 12/02/2022] [Accepted: 12/06/2022] [Indexed: 12/14/2022] Open
Abstract
Severe acute respiratory syndrome coronavirus 2 (SARS-CoV-2) is a highly contagious and pathogenic coronavirus that emerged in late 2019 and caused a pandemic of respiratory illness termed as coronavirus disease 2019 (COVID-19). Cancer patients are more susceptible to SARS-CoV-2 infection. The treatment of cancer patients infected with SARS-CoV-2 is more complicated, and the patients are at risk of poor prognosis compared to other populations. Patients infected with SARS-CoV-2 are prone to rapid development of acute respiratory distress syndrome (ARDS) of which pulmonary fibrosis (PF) is considered a sequelae. Both ARDS and PF are factors that contribute to poor prognosis in COVID-19 patients. However, the molecular mechanisms among COVID-19, ARDS and PF in COVID-19 patients with cancer are not well-understood. In this study, the common differentially expressed genes (DEGs) between COVID-19 patients with and without cancer were identified. Based on the common DEGs, a series of analyses were performed, including Gene Ontology (GO) and pathway analysis, protein-protein interaction (PPI) network construction and hub gene extraction, transcription factor (TF)-DEG regulatory network construction, TF-DEG-miRNA coregulatory network construction and drug molecule identification. The candidate drug molecules (e.g., Tamibarotene CTD 00002527) obtained by this study might be helpful for effective therapeutic targets in COVID-19 patients with cancer. In addition, the common DEGs among ARDS, PF and COVID-19 patients with and without cancer are TNFSF10 and IFITM2. These two genes may serve as potential therapeutic targets in the treatment of COVID-19 patients with cancer. Changes in the expression levels of TNFSF10 and IFITM2 in CD14+/CD16+ monocytes may affect the immune response of COVID-19 patients. Specifically, changes in the expression level of TNFSF10 in monocytes can be considered as an immune signature in COVID-19 patients with hematologic cancer. Targeting N6-methyladenosine (m6A) pathways (e.g., METTL3/SERPINA1 axis) to restrict SARS-CoV-2 reproduction has therapeutic potential for COVID-19 patients.
Collapse
Affiliation(s)
- Yu Chen
- Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu, HI 96813, USA
- Department of Molecular Biosciences and Bioengineering, College of Tropical Agriculture and Human Resources, University of Hawaii at Manoa, Honolulu, HI 96822, USA
| | - Yujia Qin
- Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu, HI 96813, USA
| | - Yuanyuan Fu
- Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu, HI 96813, USA
| | - Zitong Gao
- Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu, HI 96813, USA
- Department of Molecular Biosciences and Bioengineering, College of Tropical Agriculture and Human Resources, University of Hawaii at Manoa, Honolulu, HI 96822, USA
| | - Youping Deng
- Department of Quantitative Health Sciences, John A. Burns School of Medicine, University of Hawaii at Manoa, Honolulu, HI 96813, USA
- Correspondence: ; Tel.: +1-8086921664
| |
Collapse
|
4
|
Processes in DNA damage response from a whole-cell multi-omics perspective. iScience 2022; 25:105341. [PMID: 36339253 PMCID: PMC9633746 DOI: 10.1016/j.isci.2022.105341] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/16/2021] [Revised: 08/10/2022] [Accepted: 10/10/2022] [Indexed: 11/09/2022] Open
Abstract
Technological advances have made it feasible to collect multi-condition multi-omic time courses of cellular response to perturbation, but the complexity of these datasets impedes discovery due to challenges in data management, analysis, visualization, and interpretation. Here, we report a whole-cell mechanistic analysis of HL-60 cellular response to bendamustine. We integrate both enrichment and network analysis to show the progression of DNA damage and programmed cell death over time in molecular, pathway, and process-level detail using an interactive analysis framework for multi-omics data. Our framework, Mechanism of Action Generator Involving Network analysis (MAGINE), automates network construction and enrichment analysis across multiple samples and platforms, which can be integrated into our annotated gene-set network to combine the strengths of networks and ontology-driven analysis. Taken together, our work demonstrates how multi-omics integration can be used to explore signaling processes at various resolutions and demonstrates multi-pathway involvement beyond the canonical bendamustine mechanism.
Collapse
|
5
|
Fung KW, Xu J, Ameye F, Burelle L, MacNeil J. Evaluation of the International Classification of Health Interventions (ICHI) in the coding of common surgical procedures. J Am Med Inform Assoc 2021; 29:43-51. [PMID: 34643710 DOI: 10.1093/jamia/ocab220] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/28/2021] [Accepted: 10/27/2021] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE : To evaluate the International Classification of Health Interventions (ICHI) in the clinical and statistical use cases. MATERIALS AND METHODS : We identified 300 most-performed surgical procedures as represented by their display names in an electronic health record. For comparison with existing coding systems, we coded the procedures in ICHI, SNOMED CT, International Classification of Diseases (ICD)-10-PCS, and CCI (Canadian Classification of Health Interventions), using postcoordination (modification of existing codes by adding other codes), when applicable. Failure analysis was done for cases where full representation was not achieved. The ICHI encoding was further evaluated for adequacy to support statistical reporting by the Organisation for Economic Co-operation and Development (OECD) and European Union (EU) categories of surgical procedures. RESULTS : After deduplication, 229 distinct procedures remained. Without postcoordination, ICHI achieved full representation in 52.8%. A further 19.2% could be fully represented with postcoordination. SNOMED CT was the best performing overall, with 94.3% full representation without postcoordination, and 99.6% with postcoordination. Failure analysis showed that "method" and "target" constituted most of the missing information for ICHI encoding. For all OECD/EU surgical categories, ICHI coding was adequate to support statistical reporting. One OECD/EU category ("Hip replacement, secondary") required postcoordination for correct assignment. CONCLUSION : In the clinical use case of capturing information in the electronic health record, ICHI was outperformed by the clinically oriented procedure coding systems (SNOMED CT and CCI), but was comparable to ICD-10-PCS. Postcoordination could be an effective and efficient means of improving coverage. ICHI is generally adequate for the collection of international statistics.
Collapse
Affiliation(s)
- Kin Wah Fung
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Julia Xu
- National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Filip Ameye
- National Institute for Health and Disability Insurance, Brussels, Belgium
| | - Lisa Burelle
- Canadian Institute for Health Information, Ottawa, Canada
| | - Janice MacNeil
- Canadian Institute for Health Information, Ottawa, Canada
| |
Collapse
|
6
|
Zhao Y, Wang J, Guo M, Zhang X, Yu G. Cross-Species Protein Function Prediction with Asynchronous-Random Walk. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2021; 18:1439-1450. [PMID: 31562099 DOI: 10.1109/tcbb.2019.2943342] [Citation(s) in RCA: 6] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
Protein function prediction is a fundamental task in the post-genomic era. Available functional annotations of proteins are incomplete and the annotations of two homologous species are complementary to each other. However, how to effectively leverage mutually complementary annotations of different species to further boost the prediction performance is still not well studied. In this paper, we propose a cross-species protein function prediction approach by performing Asynchronous Random Walk on a heterogeneous network (AsyRW). AsyRW first constructs a heterogeneous network to integrate multiple functional association networks derived from different biological data, established homology-relationships between proteins from different species, known annotations of proteins and Gene Ontology (GO). To account for the intrinsic structures of intra- and inter-species of proteins and that of GO, AsyRW quantifies the individual walk lengths of each network node using the gravity-like theory, and then performs asynchronous-random walk with the individual length to predict associations between proteins and GO terms. Experiments on annotations archived in different years show that individual walk length and asynchronous-random walk can effectively leverage the complementary annotations of different species, AsyRW has a significantly improved performance to other related and competitive methods. The codes of AsyRW are available at: http://mlda.swu.edu.cn/codes.php?name=AsyRW.
Collapse
|
7
|
Moro G, Masseroli M. Gene function finding through cross-organism ensemble learning. BioData Min 2021; 14:14. [PMID: 33579334 PMCID: PMC7879670 DOI: 10.1186/s13040-021-00239-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2020] [Accepted: 01/10/2021] [Indexed: 11/12/2022] Open
Abstract
Background Structured biological information about genes and proteins is a valuable resource to improve discovery and understanding of complex biological processes via machine learning algorithms. Gene Ontology (GO) controlled annotations describe, in a structured form, features and functions of genes and proteins of many organisms. However, such valuable annotations are not always reliable and sometimes are incomplete, especially for rarely studied organisms. Here, we present GeFF (Gene Function Finder), a novel cross-organism ensemble learning method able to reliably predict new GO annotations of a target organism from GO annotations of another source organism evolutionarily related and better studied. Results Using a supervised method, GeFF predicts unknown annotations from random perturbations of existing annotations. The perturbation consists in randomly deleting a fraction of known annotations in order to produce a reduced annotation set. The key idea is to train a supervised machine learning algorithm with the reduced annotation set to predict, namely to rebuild, the original annotations. The resulting prediction model, in addition to accurately rebuilding the original known annotations for an organism from their perturbed version, also effectively predicts new unknown annotations for the organism. Moreover, the prediction model is also able to discover new unknown annotations in different target organisms without retraining.We combined our novel method with different ensemble learning approaches and compared them to each other and to an equivalent single model technique. We tested the method with five different organisms using their GO annotations: Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum. The outcomes demonstrate the effectiveness of the cross-organism ensemble approach, which can be customized with a trade-off between the desired number of predicted new annotations and their precision.A Web application to browse both input annotations used and predicted ones, choosing the ensemble prediction method to use, is publicly available at http://tiny.cc/geff/. Conclusions Our novel cross-organism ensemble learning method provides reliable predicted novel gene annotations, i.e., functions, ranked according to an associated likelihood value. They are very valuable both to speed the annotation curation, focusing it on the prioritized new annotations predicted, and to complement known annotations available.
Collapse
Affiliation(s)
- Gianluca Moro
- DISI - University of Bologna, Via dell'Università, Cesena (FC), Italy.
| | - Marco Masseroli
- DEIB, Politecnico di Milano, Piazza L. Da Vinci 32, Milan, 20133, Italy
| |
Collapse
|
8
|
Zhao Y, Wang J, Chen J, Zhang X, Guo M, Yu G. A Literature Review of Gene Function Prediction by Modeling Gene Ontology. Front Genet 2020; 11:400. [PMID: 32391061 PMCID: PMC7193026 DOI: 10.3389/fgene.2020.00400] [Citation(s) in RCA: 30] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/30/2020] [Indexed: 12/14/2022] Open
Abstract
Annotating the functional properties of gene products, i.e., RNAs and proteins, is a fundamental task in biology. The Gene Ontology database (GO) was developed to systematically describe the functional properties of gene products across species, and to facilitate the computational prediction of gene function. As GO is routinely updated, it serves as the gold standard and main knowledge source in functional genomics. Many gene function prediction methods making use of GO have been proposed. But no literature review has summarized these methods and the possibilities for future efforts from the perspective of GO. To bridge this gap, we review the existing methods with an emphasis on recent solutions. First, we introduce the conventions of GO and the widely adopted evaluation metrics for gene function prediction. Next, we summarize current methods of gene function prediction that apply GO in different ways, such as using hierarchical or flat inter-relationships between GO terms, compressing massive GO terms and quantifying semantic similarities. Although many efforts have improved performance by harnessing GO, we conclude that there remain many largely overlooked but important topics for future research.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing, China
| | - Jian Chen
- State Key Laboratory of Agrobiotechnology and National Maize Improvement Center, China Agricultural University, Beijing, China
| | - Xiangliang Zhang
- CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing, China
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing, China.,CBRC, King Abdullah University of Science and Technology, Thuwal, Saudi Arabia
| |
Collapse
|
9
|
Cohen I, David E(O, Netanyahu NS. Supervised and Unsupervised End-to-End Deep Learning for Gene Ontology Classification of Neural In Situ Hybridization Images. ENTROPY 2019; 21:e21030221. [PMID: 33266936 PMCID: PMC7514702 DOI: 10.3390/e21030221] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Revised: 10/22/2018] [Accepted: 12/19/2018] [Indexed: 11/16/2022]
Abstract
In recent years, large datasets of high-resolution mammalian neural images have become available, which has prompted active research on the analysis of gene expression data. Traditional image processing methods are typically applied for learning functional representations of genes, based on their expressions in these brain images. In this paper, we describe a novel end-to-end deep learning-based method for generating compact representations of in situ hybridization (ISH) images, which are invariant-to-translation. In contrast to traditional image processing methods, our method relies, instead, on deep convolutional denoising autoencoders (CDAE) for processing raw pixel inputs, and generating the desired compact image representations. We provide an in-depth description of our deep learning-based approach, and present extensive experimental results, demonstrating that representations extracted by CDAE can help learn features of functional gene ontology categories for their classification in a highly accurate manner. Our methods improve the previous state-of-the-art classification rate (Liscovitch, et al.) from an average AUC of 0.92 to 0.997, i.e., it achieves 96% reduction in error rate. Furthermore, the representation vectors generated due to our method are more compact in comparison to previous state-of-the-art methods, allowing for a more efficient high-level representation of images. These results are obtained with significantly downsampled images in comparison to the original high-resolution ones, further underscoring the robustness of our proposed method.
Collapse
Affiliation(s)
- Ido Cohen
- Department of Computer Science, Bar-Ilan University, Ramat-Gan 5290002, Israel
| | - Eli (Omid) David
- Department of Computer Science, Bar-Ilan University, Ramat-Gan 5290002, Israel
- Correspondence:
| | - Nathan S. Netanyahu
- Department of Computer Science, Bar-Ilan University, Ramat-Gan 5290002, Israel
- Gonda Brain Research Center, Bar-Ilan University, Ramat-Gan 5290002, Israel
- Center for Automation Research, UMIACS, University of Maryland at College Park, College Park, MD 20742, USA
| |
Collapse
|
10
|
Hadarovich A, Anishchenko I, Tuzikov AV, Kundrotas PJ, Vakser IA. Gene ontology improves template selection in comparative protein docking. Proteins 2018; 87:245-253. [PMID: 30520123 DOI: 10.1002/prot.25645] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2018] [Revised: 10/21/2018] [Accepted: 11/29/2018] [Indexed: 02/06/2023]
Abstract
Structural characterization of protein-protein interactions is essential for our ability to study life processes at the molecular level. Computational modeling of protein complexes (protein docking) is important as the source of their structure and as a way to understand the principles of protein interaction. Rapidly evolving comparative docking approaches utilize target/template similarity metrics, which are often based on the protein structure. Although the structural similarity, generally, yields good performance, other characteristics of the interacting proteins (eg, function, biological process, and localization) may improve the prediction quality, especially in the case of weak target/template structural similarity. For the ranking of a pool of models for each target, we tested scoring functions that quantify similarity of Gene Ontology (GO) terms assigned to target and template proteins in three ontology domains-biological process, molecular function, and cellular component (GO-score). The scoring functions were tested in docking of bound, unbound, and modeled proteins. The results indicate that the combined structural and GO-terms functions improve the scoring, especially in the twilight zone of structural similarity, typical for protein models of limited accuracy.
Collapse
Affiliation(s)
- Anna Hadarovich
- Computational Biology Program, The University of Kansas, Lawrence, Kansas.,United Institute of Informatics Problems, National Academy of Sciences, Minsk, Belarus
| | - Ivan Anishchenko
- Computational Biology Program, The University of Kansas, Lawrence, Kansas
| | - Alexander V Tuzikov
- United Institute of Informatics Problems, National Academy of Sciences, Minsk, Belarus
| | - Petras J Kundrotas
- Computational Biology Program, The University of Kansas, Lawrence, Kansas
| | - Ilya A Vakser
- Computational Biology Program, The University of Kansas, Lawrence, Kansas.,Department of Molecular Biosciences, The University of Kansas, Kansas, Lawrence
| |
Collapse
|
11
|
Zhao Y, Fu G, Wang J, Guo M, Yu G. Gene function prediction based on Gene Ontology Hierarchy Preserving Hashing. Genomics 2018; 111:334-342. [PMID: 29477548 DOI: 10.1016/j.ygeno.2018.02.008] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/20/2017] [Revised: 02/02/2018] [Accepted: 02/16/2018] [Indexed: 12/27/2022]
Abstract
Gene Ontology (GO) uses structured vocabularies (or terms) to describe the molecular functions, biological roles, and cellular locations of gene products in a hierarchical ontology. GO annotations associate genes with GO terms and indicate the given gene products carrying out the biological functions described by the relevant terms. However, predicting correct GO annotations for genes from a massive set of GO terms as defined by GO is a difficult challenge. To combat with this challenge, we introduce a Gene Ontology Hierarchy Preserving Hashing (HPHash) based semantic method for gene function prediction. HPHash firstly measures the taxonomic similarity between GO terms. It then uses a hierarchy preserving hashing technique to keep the hierarchical order between GO terms, and to optimize a series of hashing functions to encode massive GO terms via compact binary codes. After that, HPHash utilizes these hashing functions to project the gene-term association matrix into a low-dimensional one and performs semantic similarity based gene function prediction in the low-dimensional space. Experimental results on three model species (Homo sapiens, Mus musculus and Rattus norvegicus) for interspecies gene function prediction show that HPHash performs better than other related approaches and it is robust to the number of hash functions. In addition, we also take HPHash as a plugin for BLAST based gene function prediction. From the experimental results, HPHash again significantly improves the prediction performance. The codes of HPHash are available at: http://mlda.swu.edu.cn/codes.php?name=HPHash.
Collapse
Affiliation(s)
- Yingwen Zhao
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Guangyuan Fu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Jun Wang
- College of Computer and Information Science, Southwest University, Chongqing 400715, China
| | - Maozu Guo
- School of Electrical and Information Engineering, Beijing University of Civil Engineering and Architecture, Beijing 100044, China; Beijing Key Laboratory of Intelligent Processing for Building Big Data, Beijing 100044, China.
| | - Guoxian Yu
- College of Computer and Information Science, Southwest University, Chongqing 400715, China.
| |
Collapse
|
12
|
Protein Function Prediction Using Deep Restricted Boltzmann Machines. BIOMED RESEARCH INTERNATIONAL 2017; 2017:1729301. [PMID: 28744460 PMCID: PMC5506480 DOI: 10.1155/2017/1729301] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/30/2017] [Accepted: 05/30/2017] [Indexed: 11/17/2022]
Abstract
Accurately annotating biological functions of proteins is one of the key tasks in the postgenome era. Many machine learning based methods have been applied to predict functional annotations of proteins, but this task is rarely solved by deep learning techniques. Deep learning techniques recently have been successfully applied to a wide range of problems, such as video, images, and nature language processing. Inspired by these successful applications, we investigate deep restricted Boltzmann machines (DRBM), a representative deep learning technique, to predict the missing functional annotations of partially annotated proteins. Experimental results on Homo sapiens, Saccharomyces cerevisiae, Mus musculus, and Drosophila show that DRBM achieves better performance than other related methods across different evaluation metrics, and it also runs faster than these comparing methods.
Collapse
|
13
|
Fortelny N, Butler GS, Overall CM, Pavlidis P. Protease-Inhibitor Interaction Predictions: Lessons on the Complexity of Protein-Protein Interactions. Mol Cell Proteomics 2017; 16:1038-1051. [PMID: 28385878 PMCID: PMC5461536 DOI: 10.1074/mcp.m116.065706] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/21/2016] [Revised: 03/24/2017] [Indexed: 01/18/2023] Open
Abstract
Protein interactions shape proteome function and thus biology. Identification of protein interactions is a major goal in molecular biology, but biochemical methods, although improving, remain limited in coverage and accuracy. Whereas computational predictions can guide biochemical experiments, low validation rates of predictions remain a major limitation. Here, we investigated computational methods in the prediction of a specific type of interaction, the inhibitory interactions between proteases and their inhibitors. Proteases generate thousands of proteoforms that dynamically shape the functional state of proteomes. Despite the important regulatory role of proteases, knowledge of their inhibitors remains largely incomplete with the vast majority of proteases lacking an annotated inhibitor. To link inhibitors to their target proteases on a large scale, we applied computational methods to predict inhibitory interactions between proteases and their inhibitors based on complementary data, including coexpression, phylogenetic similarity, structural information, co-annotation, and colocalization, and also surveyed general protein interaction networks for potential inhibitory interactions. In testing nine predicted interactions biochemically, we validated the inhibition of kallikrein 5 by serpin B12. Despite the use of a wide array of complementary data, we found a high false positive rate of computational predictions in biochemical follow-up. Based on a protease-specific definition of true negatives derived from the biochemical classification of proteases and inhibitors, we analyzed prediction accuracy of individual features, thereby we identified feature-specific limitations, which also affected general protein interaction prediction methods. Interestingly, proteases were often not coexpressed with most of their functional inhibitors, contrary to what is commonly assumed and extrapolated predominantly from cell culture experiments. Predictions of inhibitory interactions were indeed more challenging than predictions of nonproteolytic and noninhibitory interactions. In summary, we describe a novel and well-defined but difficult protein interaction prediction task and thereby highlight limitations of computational interaction prediction methods.
Collapse
Affiliation(s)
- Nikolaus Fortelny
- From the ‡Department of Biochemistry and Molecular Biology
- §Michael Smith Laboratories
- ¶Centre for Blood Research
| | - Georgina S Butler
- ¶Centre for Blood Research
- ‖Department of Oral Biological and Medical Sciences, Faculty of Dentistry
| | - Christopher M Overall
- From the ‡Department of Biochemistry and Molecular Biology
- ¶Centre for Blood Research
- ‖Department of Oral Biological and Medical Sciences, Faculty of Dentistry
| | - Paul Pavlidis
- §Michael Smith Laboratories;
- **Department of Psychiatry, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
14
|
Jiang B, Kloster K, Gleich DF, Gribskov M. AptRank: an adaptive PageRank model for protein function prediction on bi-relational graphs. Bioinformatics 2017; 33:1829-1836. [PMID: 28200073 DOI: 10.1093/bioinformatics/btx029] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2016] [Accepted: 02/14/2017] [Indexed: 11/15/2022] Open
Affiliation(s)
- Biaobin Jiang
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
| | - Kyle Kloster
- Department of Mathematics, Purdue University, West Lafayette, IN, USA
| | - David F Gleich
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| | - Michael Gribskov
- Department of Biological Sciences, Purdue University, West Lafayette, IN, USA
- Department of Computer Science, Purdue University, West Lafayette, IN, USA
| |
Collapse
|
15
|
Abstract
BACKGROUND Gene Ontology (GO) is a collaborative project that maintains and develops controlled vocabulary (or terms) to describe the molecular function, biological roles and cellular location of gene products in a hierarchical ontology. GO also provides GO annotations that associate genes with GO terms. GO consortium independently and collaboratively annotate terms to gene products, mainly from model organisms (or species) they are interested in. Due to experiment ethics, research interests of biologists and resources limitations, homologous genes from different species currently are annotated with different terms. These differences can be more attributed to incomplete annotations of genes than to functional difference between them. RESULTS Semantic similarity between genes is derived from GO hierarchy and annotations of genes. It is positively correlated with the similarity derived from various types of biological data and has been applied to predict gene function. In this paper, we investigate whether it is possible to replenish annotations of incompletely annotated genes by using semantic similarity between genes from two species with homology. For this investigation, we utilize three representative semantic similarity metrics to compute similarity between genes from two species. Next, we determine the k nearest neighborhood genes from the two species based on the chosen metric and then use terms annotated to k neighbors of a gene to replenish annotations of that gene. We perform experiments on archived (from Jan-2014 to Jan-2016) GO annotations of four species (Human, Mouse, Danio rerio and Arabidopsis thaliana) to assess the contribution of semantic similarity between genes from different species. The experimental results demonstrate that: (1) semantic similarity between genes from homologous species contributes much more on the improved accuracy (by 53.22%) than genes from single species alone, and genes from two species with low homology; (2) GO annotations of genes from homologous species are complementary to each other. CONCLUSIONS Our study shows that semantic similarity based interspecies gene function annotation from homologous species is more prominent than traditional intraspecies approaches. This work can promote more research on semantic similarity based function prediction across species.
Collapse
Affiliation(s)
- Guoxian Yu
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| | - Wei Luo
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| | - Guangyuan Fu
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| | - Jun Wang
- College of Computer and Information Sciences, Southwest University, Chongqing, China
| |
Collapse
|
16
|
Domeniconi G, Masseroli M, Moro G, Pinoli P. Cross-organism learning method to discover new gene functionalities. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2016; 126:20-34. [PMID: 26724853 DOI: 10.1016/j.cmpb.2015.12.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 06/04/2015] [Revised: 11/16/2015] [Accepted: 12/08/2015] [Indexed: 06/05/2023]
Abstract
BACKGROUND Knowledge of gene and protein functions is paramount for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. Analyses for biomedical knowledge discovery greatly benefit from the availability of gene and protein functional feature descriptions expressed through controlled terminologies and ontologies, i.e., of gene and protein biomedical controlled annotations. In the last years, several databases of such annotations have become available; yet, these valuable annotations are incomplete, include errors and only some of them represent highly reliable human curated information. Computational techniques able to reliably predict new gene or protein annotations with an associated likelihood value are thus paramount. METHODS Here, we propose a novel cross-organisms learning approach to reliably predict new functionalities for the genes of an organism based on the known controlled annotations of the genes of another, evolutionarily related and better studied, organism. We leverage a new representation of the annotation discovery problem and a random perturbation of the available controlled annotations to allow the application of supervised algorithms to predict with good accuracy unknown gene annotations. Taking advantage of the numerous gene annotations available for a well-studied organism, our cross-organisms learning method creates and trains better prediction models, which can then be applied to predict new gene annotations of a target organism. RESULTS We tested and compared our method with the equivalent single organism approach on different gene annotation datasets of five evolutionarily related organisms (Homo sapiens, Mus musculus, Bos taurus, Gallus gallus and Dictyostelium discoideum). Results show both the usefulness of the perturbation method of available annotations for better prediction model training and a great improvement of the cross-organism models with respect to the single-organism ones, without influence of the evolutionary distance between the considered organisms. The generated ranked lists of reliably predicted annotations, which describe novel gene functionalities and have an associated likelihood value, are very valuable both to complement available annotations, for better coverage in biomedical knowledge discovery analyses, and to quicken the annotation curation process, by focusing it on the prioritized novel annotations predicted.
Collapse
Affiliation(s)
- Giacomo Domeniconi
- DISI, Università degli Studi di Bologna, Via Venezia 52, 47521 Cesena, Italy.
| | - Marco Masseroli
- DEIB, Politecnico di Milano, Piazza L. Da Vinci 32, 20133 Milan, Italy.
| | - Gianluca Moro
- DISI, Università degli Studi di Bologna, Via Venezia 52, 47521 Cesena, Italy.
| | - Pietro Pinoli
- DEIB, Politecnico di Milano, Piazza L. Da Vinci 32, 20133 Milan, Italy.
| |
Collapse
|
17
|
Chicco D, Masseroli M. Ontology-Based Prediction and Prioritization of Gene Functional Annotations. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:248-260. [PMID: 27045825 DOI: 10.1109/tcbb.2015.2459694] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Genes and their protein products are essential molecular units of a living organism. The knowledge of their functions is key for the understanding of physiological and pathological biological processes, as well as in the development of new drugs and therapies. The association of a gene or protein with its functions, described by controlled terms of biomolecular terminologies or ontologies, is named gene functional annotation. Very many and valuable gene annotations expressed through terminologies and ontologies are available. Nevertheless, they might include some erroneous information, since only a subset of annotations are reviewed by curators. Furthermore, they are incomplete by definition, given the rapidly evolving pace of biomolecular knowledge. In this scenario, computational methods that are able to quicken the annotation curation process and reliably suggest new annotations are very important. Here, we first propose a computational pipeline that uses different semantic and machine learning methods to predict novel ontology-based gene functional annotations; then, we introduce a new semantic prioritization rule to categorize the predicted annotations by their likelihood of being correct. Our tests and validations proved the effectiveness of our pipeline and prioritization of predicted annotations, by selecting as most likely manifold predicted annotations that were later confirmed.
Collapse
|
18
|
Yu G, Fu G, Wang J, Zhu H. Predicting Protein Function via Semantic Integration of Multiple Networks. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:220-232. [PMID: 26800544 DOI: 10.1109/tcbb.2015.2459713] [Citation(s) in RCA: 25] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Determining the biological functions of proteins is one of the key challenges in the post-genomic era. The rapidly accumulated large volumes of proteomic and genomic data drives to develop computational models for automatically predicting protein function in large scale. Recent approaches focus on integrating multiple heterogeneous data sources and they often get better results than methods that use single data source alone. In this paper, we investigate how to integrate multiple biological data sources with the biological knowledge, i.e., Gene Ontology (GO), for protein function prediction. We propose a method, called SimNet, to Semantically integrate multiple functional association Networks derived from heterogenous data sources. SimNet firstly utilizes GO annotations of proteins to capture the semantic similarity between proteins and introduces a semantic kernel based on the similarity. Next, SimNet constructs a composite network, obtained as a weighted summation of individual networks, and aligns the network with the kernel to get the weights assigned to individual networks. Then, it applies a network-based classifier on the composite network to predict protein function. Experiment results on heterogenous proteomic data sources of Yeast, Human, Mouse, and Fly show that, SimNet not only achieves better (or comparable) results than other related competitive approaches, but also takes much less time. The Matlab codes of SimNet are available at https://sites.google.com/site/guoxian85/simnet.
Collapse
|
19
|
Masseroli M, Canakoglu A, Ceri S. Integration and Querying of Genomic and Proteomic Semantic Annotations for Biomedical Knowledge Extraction. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2016; 13:209-219. [PMID: 27045824 DOI: 10.1109/tcbb.2015.2453944] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/05/2023]
Abstract
Understanding complex biological phenomena involves answering complex biomedical questions on multiple biomolecular information simultaneously, which are expressed through multiple genomic and proteomic semantic annotations scattered in many distributed and heterogeneous data sources; such heterogeneity and dispersion hamper the biologists' ability of asking global queries and performing global evaluations. To overcome this problem, we developed a software architecture to create and maintain a Genomic and Proteomic Knowledge Base (GPKB), which integrates several of the most relevant sources of such dispersed information (including Entrez Gene, UniProt, IntAct, Expasy Enzyme, GO, GOA, BioCyc, KEGG, Reactome, and OMIM). Our solution is general, as it uses a flexible, modular, and multilevel global data schema based on abstraction and generalization of integrated data features, and a set of automatic procedures for easing data integration and maintenance, also when the integrated data sources evolve in data content, structure, and number. These procedures also assure consistency, quality, and provenance tracking of all integrated data, and perform the semantic closure of the hierarchical relationships of the integrated biomedical ontologies. At http://www.bioinformatics.deib.polimi.it/GPKB/, a Web interface allows graphical easy composition of queries, although complex, on the knowledge base, supporting also semantic query expansion and comprehensive explorative search of the integrated data to better sustain biomedical knowledge extraction.
Collapse
|
20
|
Glass K, Girvan M. Finding New Order in Biological Functions from the Network Structure of Gene Annotations. PLoS Comput Biol 2015; 11:e1004565. [PMID: 26588252 PMCID: PMC4654495 DOI: 10.1371/journal.pcbi.1004565] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2015] [Accepted: 09/23/2015] [Indexed: 11/19/2022] Open
Abstract
The Gene Ontology (GO) provides biologists with a controlled terminology that describes how genes are associated with functions and how functional terms are related to one another. These term-term relationships encode how scientists conceive the organization of biological functions, and they take the form of a directed acyclic graph (DAG). Here, we propose that the network structure of gene-term annotations made using GO can be employed to establish an alternative approach for grouping functional terms that captures intrinsic functional relationships that are not evident in the hierarchical structure established in the GO DAG. Instead of relying on an externally defined organization for biological functions, our approach connects biological functions together if they are performed by the same genes, as indicated in a compendium of gene annotation data from numerous different sources. We show that grouping terms by this alternate scheme provides a new framework with which to describe and predict the functions of experimentally identified sets of genes.
Collapse
Affiliation(s)
- Kimberly Glass
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute and Harvard T. H. Chan School of Public Health, Boston, Massachusetts, United States of America
- Channing Division of Network Medicine, Department of Medicine, Brigham and Women’s Hospital and Harvard Medical School, Boston, Massachusetts, United States of America
- Physics Department, University of Maryland, College Park, Maryland, United States of America
- * E-mail:
| | - Michelle Girvan
- Physics Department, University of Maryland, College Park, Maryland, United States of America
- Institute for Physical Science and Technology, University of Maryland, College Park, Maryland, United States of America
- Santa Fe Institute, Santa Fe, New Mexico, United States of America
| |
Collapse
|
21
|
Yang W, Dierking K, Schulenburg H. WormExp: a web-based application for a Caenorhabditis elegans-specific gene expression enrichment analysis. Bioinformatics 2015; 32:943-5. [DOI: 10.1093/bioinformatics/btv667] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2015] [Accepted: 11/06/2015] [Indexed: 11/13/2022] Open
Abstract
Abstract
Motivation: A particular challenge of the current omics age is to make sense of the inferred differential expression of genes and proteins. The most common approach is to perform a gene ontology (GO) enrichment analysis, thereby relying on a database that has been extracted from a variety of organisms and that can therefore only yield reliable information on evolutionary conserved functions.
Results: We here present a web-based application for a taxon-specific gene set exploration and enrichment analysis, which is expected to yield novel functional insights into newly determined gene sets. The approach is based on the complete collection of curated high-throughput gene expression data sets for the model nematode Caenorhabditis elegans, including 1786 gene sets from more than 350 studies.
Availability and implementation: WormExp is available at http://wormexp.zoologie.uni-kiel.de.
Contacts: hschulenburg@zoologie.uni-kiel.de
Supplementary information: Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Wentao Yang
- Evolutionary Ecology and Genetics, Zoological Institute, CAU Kiel, 24118 Kiel, Germany
| | - Katja Dierking
- Evolutionary Ecology and Genetics, Zoological Institute, CAU Kiel, 24118 Kiel, Germany
| | - Hinrich Schulenburg
- Evolutionary Ecology and Genetics, Zoological Institute, CAU Kiel, 24118 Kiel, Germany
| |
Collapse
|
22
|
Jennen DGJ, van Leeuwen DM, Hendrickx DM, Gottschalk RWH, van Delft JHM, Kleinjans JCS. Bayesian Network Inference Enables Unbiased Phenotypic Anchoring of Transcriptomic Responses to Cigarette Smoke in Humans. Chem Res Toxicol 2015; 28:1936-48. [PMID: 26360787 DOI: 10.1021/acs.chemrestox.5b00145] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/28/2022]
Abstract
Microarray-based transcriptomic analysis has been demonstrated to hold the opportunity to study the effects of human exposure to, e.g., chemical carcinogens at the whole genome level, thus yielding broad-ranging molecular information on possible carcinogenic effects. Since genes do not operate individually but rather through concerted interactions, analyzing and visualizing networks of genes should provide important mechanistic information, especially upon connecting them to functional parameters, such as those derived from measurements of biomarkers for exposure and carcinogenic risk. Conventional methods such as hierarchical clustering and correlation analyses are frequently used to address these complex interactions but are limited as they do not provide directional causal dependence relationships. Therefore, our aim was to apply Bayesian network inference with the purpose of phenotypic anchoring of modified gene expressions. We investigated a use case on transcriptomic responses to cigarette smoking in humans, in association with plasma cotinine levels as biomarkers of exposure and aromatic DNA-adducts in blood cells as biomarkers of carcinogenic risk. Many of the genes that appear in the Bayesian networks surrounding plasma cotinine, and to a lesser extent around aromatic DNA-adducts, hold biologically relevant functions in inducing severe adverse effects of smoking. In conclusion, this study shows that Bayesian network inference enables unbiased phenotypic anchoring of transcriptomics responses. Furthermore, in all inferred Bayesian networks several dependencies are found which point to known but also to new relationships between the expression of specific genes, cigarette smoke exposure, DNA damaging-effects, and smoking-related diseases, in particular associated with apoptosis, DNA repair, and tumor suppression, as well as with autoimmunity.
Collapse
Affiliation(s)
- Danyel G J Jennen
- Department of Toxicogenomics, Maastricht University , Universiteitssingel 40, 6229 ER Maastricht, The Netherlands
| | - Danitsja M van Leeuwen
- Department of Toxicogenomics, Maastricht University , Universiteitssingel 40, 6229 ER Maastricht, The Netherlands
| | - Diana M Hendrickx
- Department of Toxicogenomics, Maastricht University , Universiteitssingel 40, 6229 ER Maastricht, The Netherlands
| | - Ralph W H Gottschalk
- Department of Toxicogenomics, Maastricht University , Universiteitssingel 40, 6229 ER Maastricht, The Netherlands
| | - Joost H M van Delft
- Department of Toxicogenomics, Maastricht University , Universiteitssingel 40, 6229 ER Maastricht, The Netherlands
| | - Jos C S Kleinjans
- Department of Toxicogenomics, Maastricht University , Universiteitssingel 40, 6229 ER Maastricht, The Netherlands
| |
Collapse
|
23
|
Yu G, Zhu H, Domeniconi C, Liu J. Predicting protein function via downward random walks on a gene ontology. BMC Bioinformatics 2015; 16:271. [PMID: 26310806 PMCID: PMC4551531 DOI: 10.1186/s12859-015-0713-y] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2015] [Accepted: 08/20/2015] [Indexed: 12/24/2022] Open
Abstract
Background High-throughput bio-techniques accumulate ever-increasing amount of genomic and proteomic data. These data are far from being functionally characterized, despite the advances in gene (or gene’s product proteins) functional annotations. Due to experimental techniques and to the research bias in biology, the regularly updated functional annotation databases, i.e., the Gene Ontology (GO), are far from being complete. Given the importance of protein functions for biological studies and drug design, proteins should be more comprehensively and precisely annotated. Results We proposed downward Random Walks (dRW) to predict missing (or new) functions of partially annotated proteins. Particularly, we apply downward random walks with restart on the GO directed acyclic graph, along with the available functions of a protein, to estimate the probability of missing functions. To further boost the prediction accuracy, we extend dRW to dRW-kNN. dRW-kNN computes the semantic similarity between proteins based on the functional annotations of proteins; it then predicts functions based on the functions estimated by dRW, together with the functions associated with the k nearest proteins. Our proposed models can predict two kinds of missing functions: (i) the ones that are missing for a protein but associated with other proteins of interest; (ii) the ones that are not available for any protein of interest, but exist in the GO hierarchy. Experimental results on the proteins of Yeast and Human show that dRW and dRW-kNN can replenish functions more accurately than other related approaches, especially for sparse functions associated with no more than 10 proteins. Conclusion The empirical study shows that the semantic similarity between GO terms and the ontology hierarchy play important roles in predicting protein function. The proposed dRW and dRW-kNN can serve as tools for replenishing functions of partially annotated proteins. Electronic supplementary material The online version of this article (doi:10.1186/s12859-015-0713-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Guoxian Yu
- College of Computer and Information Sciences, Southwest University, Beibei, Chongqing, China. .,Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University, Changchun, China.
| | - Hailong Zhu
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, Hong Kong.
| | | | - Jiming Liu
- Department of Computer Science, Hong Kong Baptist University, Hong Kong, Hong Kong.
| |
Collapse
|
24
|
Foulger RE, Osumi-Sutherland D, McIntosh BK, Hulo C, Masson P, Poux S, Le Mercier P, Lomax J. Representing virus-host interactions and other multi-organism processes in the Gene Ontology. BMC Microbiol 2015; 15:146. [PMID: 26215368 PMCID: PMC4517558 DOI: 10.1186/s12866-015-0481-x] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2014] [Accepted: 07/10/2015] [Indexed: 01/25/2023] Open
Abstract
BACKGROUND The Gene Ontology project is a collaborative effort to provide descriptions of gene products in a consistent and computable language, and in a species-independent manner. The Gene Ontology is designed to be applicable to all organisms but up to now has been largely under-utilized for prokaryotes and viruses, in part because of a lack of appropriate ontology terms. METHODS To address this issue, we have developed a set of Gene Ontology classes that are applicable to microbes and their hosts, improving both coverage and quality in this area of the Gene Ontology. Describing microbial and viral gene products brings with it the additional challenge of capturing both the host and the microbe. Recognising this, we have worked closely with annotation groups to test and optimize the GO classes, and we describe here a set of annotation guidelines that allow the controlled description of two interacting organisms. CONCLUSIONS Building on the microbial resources already in existence such as ViralZone, UniProtKB keywords and MeGO, this project provides an integrated ontology to describe interactions between microbial species and their hosts, with mappings to the external resources above. Housing this information within the freely-accessible Gene Ontology project allows the classes and annotation structure to be utilized by a large community of biologists and users.
Collapse
Affiliation(s)
- R E Foulger
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - D Osumi-Sutherland
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| | - B K McIntosh
- Department of Biochemistry and Biophysics, Texas Agrilife Research, Texas A&M University, College Station, TX, 77843, USA.
| | - C Hulo
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel-Servet, 1211, Geneva 4, Switzerland.
| | - P Masson
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel-Servet, 1211, Geneva 4, Switzerland.
| | - S Poux
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel-Servet, 1211, Geneva 4, Switzerland.
| | - P Le Mercier
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, 1 Rue Michel-Servet, 1211, Geneva 4, Switzerland.
| | - J Lomax
- European Molecular Biology Laboratory, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridge, CB10 1SD, UK.
| |
Collapse
|
25
|
Masseroli M, Canakoglu A, Quigliatti M. Detection of gene annotations and protein-protein interaction associated disorders through transitive relationships between integrated annotations. BMC Genomics 2015; 16:S5. [PMID: 26046679 PMCID: PMC4460591 DOI: 10.1186/1471-2164-16-s6-s5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
Background Increasingly high amounts of heterogeneous and valuable controlled biomolecular annotations are available, but far from exhaustive and scattered in many databases. Several annotation integration and prediction approaches have been proposed, but these issues are still unsolved. We previously created a Genomic and Proteomic Knowledge Base (GPKB) that efficiently integrates many distributed biomolecular annotation and interaction data of several organisms, including 32,956,102 gene annotations, 273,522,470 protein annotations and 277,095 protein-protein interactions (PPIs). Results By comprehensively leveraging transitive relationships defined by the numerous association data integrated in GPKB, we developed a software procedure that effectively detects and supplement consistent biomolecular annotations not present in the integrated sources. According to some defined logic rules, it does so only when the semantic type of data and of their relationships, as well as the cardinality of the relationships, allow identifying molecular biology compliant annotations. Thanks to controlled consistency and quality enforced on data integrated in GPKB, and to the procedures used to avoid error propagation during their automatic processing, we could reliably identify many annotations, which we integrated in GPKB. They comprise 3,144 gene to pathway and 21,942 gene to biological function annotations of many organisms, and 1,027 candidate associations between 317 genetic disorders and 782 human PPIs. Overall estimated recall and precision of our approach were 90.56 % and 96.61 %, respectively. Co-functional evaluation of genes with known function showed high functional similarity between genes with new detected and known annotation to the same pathway; considering also the new detected gene functional annotations enhanced such functional similarity, which resembled the one existing between genes known to be annotated to the same pathway. Strong evidence was also found in the literature for the candidate associations detected between Cystic fibrosis disorder and the PPIs between the CFTR_HUMAN, DERL1_HUMAN, RNF5_HUMAN, AHSA1_HUMAN and GOPC_HUMAN proteins, and between the CHIP_HUMAN and HSP7C_HUMAN proteins. Conclusions Although identified gene annotations and PPI-genetic disorder candidate associations require biological validation, our approach intrinsically provides their in silico evidence based on available data. Public availability within the GPKB (http://www.bioinformatics.deib.polimi.it/GPKB/) of all identified and integrated annotations offers a valuable resource fostering new biomedical-molecular knowledge discoveries.
Collapse
|
26
|
Abstract
Background Gene function annotations, which are associations between a gene and a term of a controlled vocabulary describing gene functional features, are of paramount importance in modern biology. Datasets of these annotations, such as the ones provided by the Gene Ontology Consortium, are used to design novel biological experiments and interpret their results. Despite their importance, these sources of information have some known issues. They are incomplete, since biological knowledge is far from being definitive and it rapidly evolves, and some erroneous annotations may be present. Since the curation process of novel annotations is a costly procedure, both in economical and time terms, computational tools that can reliably predict likely annotations, and thus quicken the discovery of new gene annotations, are very useful. Methods We used a set of computational algorithms and weighting schemes to infer novel gene annotations from a set of known ones. We used the latent semantic analysis approach, implementing two popular algorithms (Latent Semantic Indexing and Probabilistic Latent Semantic Analysis) and propose a novel method, the Semantic IMproved Latent Semantic Analysis, which adds a clustering step on the set of considered genes. Furthermore, we propose the improvement of these algorithms by weighting the annotations in the input set. Results We tested our methods and their weighted variants on the Gene Ontology annotation sets of three model organism genes (Bos taurus, Danio rerio and Drosophila melanogaster ). The methods showed their ability in predicting novel gene annotations and the weighting procedures demonstrated to lead to a valuable improvement, although the obtained results vary according to the dimension of the input annotation set and the considered algorithm. Conclusions Out of the three considered methods, the Semantic IMproved Latent Semantic Analysis is the one that provides better results. In particular, when coupled with a proper weighting policy, it is able to predict a significant number of novel annotations, demonstrating to actually be a helpful tool in supporting scientists in the curation process of gene functional annotations.
Collapse
|
27
|
Youngs N, Penfold-Brown D, Bonneau R, Shasha D. Negative example selection for protein function prediction: the NoGO database. PLoS Comput Biol 2014; 10:e1003644. [PMID: 24922051 PMCID: PMC4055410 DOI: 10.1371/journal.pcbi.1003644] [Citation(s) in RCA: 29] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/24/2013] [Accepted: 04/08/2014] [Indexed: 12/28/2022] Open
Abstract
Negative examples – genes that are known not to carry out a given protein function – are rarely recorded in genome and proteome annotation databases, such as the Gene Ontology database. Negative examples are required, however, for several of the most powerful machine learning methods for integrative protein function prediction. Most protein function prediction efforts have relied on a variety of heuristics for the choice of negative examples. Determining the accuracy of methods for negative example prediction is itself a non-trivial task, given that the Open World Assumption as applied to gene annotations rules out many traditional validation metrics. We present a rigorous comparison of these heuristics, utilizing a temporal holdout, and a novel evaluation strategy for negative examples. We add to this comparison several algorithms adapted from Positive-Unlabeled learning scenarios in text-classification, which are the current state of the art methods for generating negative examples in low-density annotation contexts. Lastly, we present two novel algorithms of our own construction, one based on empirical conditional probability, and the other using topic modeling applied to genes and annotations. We demonstrate that our algorithms achieve significantly fewer incorrect negative example predictions than the current state of the art, using multiple benchmarks covering multiple organisms. Our methods may be applied to generate negative examples for any type of method that deals with protein function, and to this end we provide a database of negative examples in several well-studied organisms, for general use (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html). Many machine learning methods have been applied to the task of predicting the biological function of proteins based on a variety of available data. The majority of these methods require negative examples: proteins that are known not to perform a function, in order to achieve meaningful predictions, but negative examples are often not available. In addition, past heuristic methods for negative example selection suffer from a high error rate. Here, we rigorously compare two novel algorithms against past heuristics, as well as some algorithms adapted from a similar task in text-classification. Through this comparison, performed on several different benchmarks, we demonstrate that our algorithms make significantly fewer mistakes when predicting negative examples. We also provide a database of negative examples for general use in machine learning for protein function prediction (The NoGO database, available at: bonneaulab.bio.nyu.edu/nogo.html).
Collapse
Affiliation(s)
- Noah Youngs
- Department of Computer Science, New York University, New York, New York, United States of America
| | - Duncan Penfold-Brown
- Social Media and Political Participation Lab, New York University, New York, New York, United States of America
| | - Richard Bonneau
- Department of Computer Science, New York University, New York, New York, United States of America
- Department of Biology, New York University, New York, New York, United States of America
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York, United States of America
- * E-mail: (RB); (DS)
| | - Dennis Shasha
- Department of Computer Science, New York University, New York, New York, United States of America
- Center for Genomics and Systems Biology, Department of Biology, New York University, New York, New York, United States of America
- * E-mail: (RB); (DS)
| |
Collapse
|
28
|
Kuppuswamy U, Ananthasubramanian S, Wang Y, Balakrishnan N, Ganapathiraju MK. Predicting gene ontology annotations of orphan GWAS genes using protein-protein interactions. Algorithms Mol Biol 2014; 9:10. [PMID: 24708602 PMCID: PMC4124845 DOI: 10.1186/1748-7188-9-10] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2013] [Accepted: 03/11/2014] [Indexed: 01/30/2023] Open
Abstract
Background The number of genome-wide association studies (GWAS) has increased rapidly in the
past couple of years, resulting in the identification of genes associated with
different diseases. The next step in translating these findings into biomedically
useful information is to find out the mechanism of the action of these genes.
However, GWAS studies often implicate genes whose functions are currently unknown;
for example, MYEOV, ANKLE1, TMEM45B and ORAOV1 are found to be associated with
breast cancer, but their molecular function is unknown. Results We carried out Bayesian inference of Gene Ontology (GO) term annotations of genes
by employing the directed acyclic graph structure of GO and the network of
protein-protein interactions (PPIs). The approach is designed based on the fact
that two proteins that interact biophysically would be in physical proximity of
each other, would possess complementary molecular function, and play role in
related biological processes. Predicted GO terms were ranked according to their
relative association scores and the approach was evaluated quantitatively by
plotting the precision versus recall values and F-scores (the harmonic mean of
precision and recall) versus varying thresholds. Precisions of ~58%
and ~ 40% for localization and functions respectively of proteins were
determined at a threshold of ~30 (top 30 GO terms in the ranked list). Comparison
with function prediction based on semantic similarity among nodes in an ontology
and incorporation of those similarities in a k-nearest neighbor classifier
confirmed that our results compared favorably. Conclusions This approach was applied to predict the cellular component and molecular function
GO terms of all human proteins that have interacting partners possessing at least
one known GO annotation. The list of predictions is available at
http://severus.dbmi.pitt.edu/engo/GOPRED.html. We present the
algorithm, evaluations and the results of the computational predictions,
especially for genes identified in GWAS studies to be associated with diseases,
which are of translational interest.
Collapse
|
29
|
Glass K, Girvan M. Annotation enrichment analysis: an alternative method for evaluating the functional properties of gene sets. Sci Rep 2014; 4:4191. [PMID: 24569707 PMCID: PMC3935204 DOI: 10.1038/srep04191] [Citation(s) in RCA: 43] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2013] [Accepted: 01/28/2014] [Indexed: 12/18/2022] Open
Abstract
Gene annotation databases (compendiums maintained by the scientific community that describe the biological functions performed by individual genes) are commonly used to evaluate the functional properties of experimentally derived gene sets. Overlap statistics, such as Fishers Exact test (FET), are often employed to assess these associations, but don't account for non-uniformity in the number of genes annotated to individual functions or the number of functions associated with individual genes. We find FET is strongly biased toward over-estimating overlap significance if a gene set has an unusually high number of annotations. To correct for these biases, we develop Annotation Enrichment Analysis (AEA), which properly accounts for the non-uniformity of annotations. We show that AEA is able to identify biologically meaningful functional enrichments that are obscured by numerous false-positive enrichment scores in FET, and we therefore suggest it be used to more accurately assess the biological properties of gene sets.
Collapse
Affiliation(s)
- Kimberly Glass
- 1] Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA [2] Department of Biostatistics, Harvard School of Public Health, Boston, MA, USA [3] Department of Physics, University of Maryland, College Park, MD, USA
| | - Michelle Girvan
- 1] Department of Physics, University of Maryland, College Park, MD, USA [2] Institute for Physical Science and Technology, University of Maryland, College Park, MD, USA [3] Santa Fe Institute, Santa Fe, NM
| |
Collapse
|
30
|
Suárez-Obando F, Camacho Sánchez J. [Standards in Medical Informatics: Fundamentals and Applications]. REVISTA COLOMBIANA DE PSIQUIATRIA 2013; 42:295-302. [PMID: 26572951 DOI: 10.1016/s0034-7450(13)70023-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/05/2012] [Accepted: 01/23/2013] [Indexed: 06/05/2023]
Abstract
The use of computers in medical practice has enabled novel forms of communication to be developed in health care. The optimization of communication processes is achieved through the use of standards to harmonize the exchange of information and provide a common language for all those involved. This article describes the concept of a standard applied to medical informatics and its importance in the development of various applications, such as computational representation of medical knowledge, disease classification and coding systems, medical literature searches and integration of biological and clinical sciences.
Collapse
Affiliation(s)
- Fernando Suárez-Obando
- Instituto de Genética Humana, Facultad de Medicina, Pontificia Universidad Javeriana, Bogotá, D.C., Colombia; Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, Estados Unidos.
| | - Jhon Camacho Sánchez
- Department of Biomedical Informatics, University of Pittsburgh, Pittsburgh, Pennsylvania, Estados Unidos; Departamento de Epidemiología y Bioestadística, Facultad de Medicina, Pontificia Universidad Javeriana, Bogotá, D.C., Colombia
| |
Collapse
|
31
|
Genome-wide gene expression profiling of stress response in a spinal cord clip compression injury model. BMC Genomics 2013; 14:583. [PMID: 23984903 PMCID: PMC3846681 DOI: 10.1186/1471-2164-14-583] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2013] [Accepted: 08/13/2013] [Indexed: 12/23/2022] Open
Abstract
Background The aneurysm clip impact-compression model of spinal cord injury (SCI) is a standard injury model in animals that closely mimics the primary mechanism of most human injuries: acute impact and persisting compression. Its histo-pathological and behavioural outcomes are extensively similar to human SCI. To understand the distinct molecular events underlying this injury model we analyzed global mRNA abundance changes during the acute, subacute and chronic stages of a moderate to severe injury to the rat spinal cord. Results Time-series expression analyses resulted in clustering of the majority of deregulated transcripts into eight statistically significant expression profiles. Systematic application of Gene Ontology (GO) enrichment pathway analysis allowed inference of biological processes participating in SCI pathology. Temporal analysis identified events specific to and common between acute, subacute and chronic time-points. Processes common to all phases of injury include blood coagulation, cellular extravasation, leukocyte cell-cell adhesion, the integrin-mediated signaling pathway, cytokine production and secretion, neutrophil chemotaxis, phagocytosis, response to hypoxia and reactive oxygen species, angiogenesis, apoptosis, inflammatory processes and ossification. Importantly, various elements of adaptive and induced innate immune responses span, not only the acute and subacute phases, but also persist throughout the chronic phase of SCI. Induced innate responses, such as Toll-like receptor signaling, are more active during the acute phase but persist throughout the chronic phase. However, adaptive immune response processes such as B and T cell activation, proliferation, and migration, T cell differentiation, B and T cell receptor-mediated signaling, and B cell- and immunoglobulin-mediated immune response become more significant during the chronic phase. Conclusions This analysis showed that, surprisingly, the diverse series of molecular events that occur in the acute and subacute stages persist into the chronic stage of SCI. The strong agreement between our results and previous findings suggest that our analytical approach will be useful in revealing other biological processes and genes contributing to SCI pathology.
Collapse
|
32
|
Hodgins KA, Lai Z, Nurkowski K, Huang J, Rieseberg LH. The molecular basis of invasiveness: differences in gene expression of native and introduced common ragweed (Ambrosia artemisiifolia) in stressful and benign environments. Mol Ecol 2013; 22:2496-510. [PMID: 23294156 DOI: 10.1111/mec.12179] [Citation(s) in RCA: 48] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/19/2012] [Revised: 11/14/2012] [Accepted: 11/21/2012] [Indexed: 11/28/2022]
Abstract
Although the evolutionary and ecological processes that contribute to plant invasion have been the focus of much research, investigation into the molecular basis of invasion is just beginning. Common ragweed (Ambrosia artemisiifolia) is an annual weed native to North America and has been introduced to Europe where it has become invasive. Using a custom-designed NimbleGen oligoarray, we examined differences in gene expression between five native and six introduced populations of common ragweed in three different environments (control, light stress and nutrient stress), as well as two different time points. We identified candidate genes that may contribute to invasiveness in common ragweed based on differences in expression between native and introduced populations from Europe. Specifically, we found 180 genes where range explained a significant proportion of the variation in gene expression and a further 103 genes with a significant range by treatment interaction. Several of these genes are potentially involved in the metabolism of secondary compounds, stress response and the detoxification of xenobiotics. Previously, we found more rapid growth and greater reproductive success in introduced populations, particularly in benign and competitive (light stress) environments, and many of these candidate genes potentially underlie these growth differences. We also found expression differences among populations within each range, reflecting either local adaptation or neutral processes, although no associations with climate or latitude were identified. These data provide a first step in identifying genes that are involved with introduction success in an aggressive annual weed.
Collapse
Affiliation(s)
- Kathryn A Hodgins
- Department of Botany and Biodiversity Research Centre, University of British Columbia, Vancouver, BC, Canada.
| | | | | | | | | |
Collapse
|
33
|
Zhang XF, Dai DQ. A framework for incorporating functional interrelationships into protein function prediction algorithms. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2012; 9:740-753. [PMID: 22084148 DOI: 10.1109/tcbb.2011.148] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The functional annotation of proteins is one of the most important tasks in the post-genomic era. Although many computational approaches have been developed in recent years to predict protein function, most of these traditional algorithms do not take interrelationships among functional terms into account, such as different GO terms usually coannotate with some common proteins. In this study, we propose a new functional similarity measure in the form of Jaccard coefficient to quantify these interrelationships and also develop a framework for incorporating GO term similarity into protein function prediction process. The experimental results of cross-validation on S. cerevisiae and Homo sapiens data sets demonstrate that our method is able to improve the performance of protein function prediction. In addition, we find that small size terms associated with a few of proteins obtain more benefit than the large size ones when considering functional interrelationships. We also compare our similarity measure with other two widely used measures, and results indicate that when incorporated into function prediction algorithms, our proposed measure is more effective. Experiment results also illustrate that our algorithms outperform two previous competing algorithms, which also take functional interrelationships into account, in prediction accuracy. Finally, we show that our method is robust to annotations in the database which are not complete at present. These results give new insights about the importance of functional interrelationships in protein function prediction.
Collapse
Affiliation(s)
- Xiao-Fei Zhang
- Center for Computer Vision and Department of Mathematics, Sun Yat-Sen University, Guangzhou 510275, China.
| | | |
Collapse
|
34
|
Abbaraju NV, Boutaghou MN, Townley IK, Zhang Q, Wang G, Cole RB, Rees BB. Analysis of tissue proteomes of the Gulf killifish, Fundulus grandis, by 2D electrophoresis and MALDI-TOF/TOF mass spectrometry. Integr Comp Biol 2012; 52:626-35. [PMID: 22537935 DOI: 10.1093/icb/ics063] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/02/2023] Open
Abstract
The Gulf killifish, Fundulus grandis, is a small teleost fish that inhabits marshes of the Gulf of Mexico and demonstrates high tolerance of environmental variation, making it an excellent subject for the study of physiological and molecular adaptations to environmental stress. In the present study, two-dimensional (2D) gel electrophoresis and matrix-assisted laser desorption/ionization time-of-flight tandem mass spectrometry were used to resolve and identify proteins from five tissues: skeletal muscle, liver, brain, heart, and gill. Of 864 protein features excised from 2D gels, 424 proteins were identified, corresponding to a 49% identification rate. For any given tissue, several protein features were identified as the same protein, resulting in a total of 254 nonredundant proteins. These nonredundant proteins were categorized into a total of 11 molecular functions, including catalytic activity, structural molecule, binding, and transport. In all tissues, catalytic activity and binding were the most highly represented molecular functions. Comparing across the tissues, proteome coverage was lowest in skeletal muscle, due to a combination of a low number of gel spots excised for analysis and a high redundancy of identifications among these spots. Nevertheless, the identification of a substantial number of proteins with high statistical confidence from other tissues suggests that F. grandis may serve as a model fish for future studies of environmental proteomics and ultimately help to elucidate proteomic responses of fish and other vertebrates to environmental stress.
Collapse
Affiliation(s)
- Naga V Abbaraju
- Department of Chemistry, University of New Orleans, New Orleans, LA 70148, USA.
| | | | | | | | | | | | | |
Collapse
|
35
|
Iacucci E, Zingg HH, Perkins TJ. Methods for Determining the Statistical Significance of Enrichment or Depletion of Gene Ontology Classifications under Weighted Membership. Front Genet 2012; 3:24. [PMID: 22375144 PMCID: PMC3284693 DOI: 10.3389/fgene.2012.00024] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/04/2011] [Accepted: 02/06/2012] [Indexed: 11/20/2022] Open
Abstract
High-throughput molecular biology studies, such as microarray assays of gene expression, two-hybrid experiments for detecting protein interactions, or ChIP-Seq experiments for transcription factor binding, often result in an “interesting” set of genes – say, genes that are co-expressed or bound by the same factor. One way of understanding the biological meaning of such a set is to consider what processes or functions, as defined in an ontology, are over-represented (enriched) or under-represented (depleted) among genes in the set. Usually, the significance of enrichment or depletion scores is based on simple statistical models and on the membership of genes in different classifications. We consider the more general problem of computing p-values for arbitrary integer additive statistics, or weighted membership functions. Such membership functions can be used to represent, for example, prior knowledge on the role of certain genes or classifications, differential importance of different classifications or genes to the experimenter, hierarchical relationships between classifications, or different degrees of interestingness or evidence for specific genes. We describe a generic dynamic programming algorithm that can compute exact p-values for arbitrary integer additive statistics. We also describe several optimizations for important special cases, which can provide orders-of-magnitude speed up in the computations. We apply our methods to datasets describing oxidative phosphorylation and parturition and compare p-values based on computations of several different statistics for measuring enrichment. We find major differences between p-values resulting from these statistics, and that some statistics recover “gold standard” annotations of the data better than others. Our work establishes a theoretical and algorithmic basis for far richer notions of enrichment or depletion of gene sets with respect to gene ontologies than has previously been available.
Collapse
Affiliation(s)
- Ernesto Iacucci
- Department of Electrical Engineering ESAT-SCD, Katholieke Universiteit Leuven Leuven, Belgium
| | | | | |
Collapse
|
36
|
Glass K, Ott E, Losert W, Girvan M. Implications of functional similarity for gene regulatory interactions. J R Soc Interface 2012; 9:1625-36. [PMID: 22298814 DOI: 10.1098/rsif.2011.0585] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
If one gene regulates another, those two genes are likely to be involved in many of the same biological functions. Conversely, shared biological function may be suggestive of the existence and nature of a regulatory interaction. With this in mind, we develop a measure of functional similarity between genes based on annotations made to the Gene Ontology in which the magnitude of their functional relationship is also indicative of a regulatory relationship. In contrast to other measures that have previously been used to quantify the functional similarity between genes, our measure scales the strength of any shared functional annotation by the frequency of that function's appearance across the entire set of annotations. We apply our method to both Escherichia coli and Saccharomyces cerevisiae gene annotations and find that the strength of our scaled similarity measure is more predictive of known regulatory interactions than previously published measures of functional similarity. In addition, we observe that the strength of the scaled similarity measure is correlated with the structural importance of links in the known regulatory network. By contrast, other measures of functional similarity are not indicative of any structural importance in the regulatory network. We therefore conclude that adequately adjusting for the frequency of shared biological functions is important in the construction of a functional similarity measure aimed at elucidating the existence and nature of regulatory interactions. We also compare the performance of the scaled similarity with a high-throughput method for determining regulatory interactions from gene expression data and observe that the ontology-based approach identifies a different subset of regulatory interactions compared with the gene expression approach. We show that combining predictions from the scaled similarity with those from the reconstruction algorithm leads to a significant improvement in the accuracy of the reconstructed network.
Collapse
Affiliation(s)
- Kimberly Glass
- Department of Biostatistics and Computational Biology, Dana-Farber Cancer Institute, Boston, MA, USA.
| | | | | | | |
Collapse
|
37
|
Genomic Annotation Prediction Based on Integrated Information. COMPUTATIONAL INTELLIGENCE METHODS FOR BIOINFORMATICS AND BIOSTATISTICS 2012. [DOI: 10.1007/978-3-642-35686-5_20] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/11/2023]
|
38
|
Quantification of protein group coherence and pathway assignment using functional association. BMC Bioinformatics 2011; 12:373. [PMID: 21929787 PMCID: PMC3189934 DOI: 10.1186/1471-2105-12-373] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2011] [Accepted: 09/19/2011] [Indexed: 11/11/2022] Open
Abstract
Background Genomics and proteomics experiments produce a large amount of data that are awaiting functional elucidation. An important step in analyzing such data is to identify functional units, which consist of proteins that play coherent roles to carry out the function. Importantly, functional coherence is not identical with functional similarity. For example, proteins in the same pathway may not share the same Gene Ontology (GO) terms, but they work in a coordinated fashion so that the aimed function can be performed. Thus, simply applying existing functional similarity measures might not be the best solution to identify functional units in omics data. Results We have designed two scores for quantifying the functional coherence by considering association of GO terms observed in two biological contexts, co-occurrences in protein annotations and co-mentions in literature in the PubMed database. The counted co-occurrences of GO terms were normalized in a similar fashion as the statistical amino acid contact potential is computed in the protein structure prediction field. We demonstrate that the developed scores can identify functionally coherent protein sets, i.e. proteins in the same pathways, co-localized proteins, and protein complexes, with statistically significant score values showing a better accuracy than existing functional similarity scores. The scores are also capable of detecting protein pairs that interact with each other. It is further shown that the functional coherence scores can accurately assign proteins to their respective pathways. Conclusion We have developed two scores which quantify the functional coherence of sets of proteins. The scores reflect the actual associations of GO terms observed either in protein annotations or in literature. It has been shown that they have the ability to accurately distinguish biologically relevant groups of proteins from random ones as well as a good discriminative power for detecting interacting pairs of proteins. The scores were further successfully applied for assigning proteins to pathways.
Collapse
|
39
|
Hester SD, Johnstone AF, Boyes WK, Bushnell PJ, Shafer TJ. Acute toluene exposure alters expression of genes in the central nervous system associated with synaptic structure and function. Neurotoxicol Teratol 2011; 33:521-9. [DOI: 10.1016/j.ntt.2011.07.008] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2011] [Revised: 07/07/2011] [Accepted: 07/20/2011] [Indexed: 10/17/2022]
|
40
|
Deo RC, MacRae CA. The zebrafish: scalable in vivo modeling for systems biology. WILEY INTERDISCIPLINARY REVIEWS-SYSTEMS BIOLOGY AND MEDICINE 2010; 3:335-46. [PMID: 20882534 DOI: 10.1002/wsbm.117] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/23/2023]
Abstract
The zebrafish offers a scalable vertebrate model for many areas of biologic investigation. There is substantial conservation of genetic and genomic features and, at a higher order, conservation of intermolecular networks, as well as physiologic systems and phenotypes. We highlight recent work demonstrating the extent of this homology, and efforts to develop high-throughput phenotyping strategies suited to genetic or chemical screening on a scale compatible with in vivo validation for systems biology. We discuss the implications of these approaches for functional annotation of the genome, elucidation of multicellular processes in vivo, and mechanistic exploration of hypotheses generated by a broad range of 'unbiased' 'omic technologies such as expression profiling and genome-wide association. Finally, we outline potential strategies for the application of the zebrafish to the systematic study of phenotypic architecture, disease heterogeneity and drug responses.
Collapse
Affiliation(s)
- Rahul C Deo
- Cardiology Division, Massachusetts General Hospital and Harvard Medical School, Boston, MA, USA
| | | |
Collapse
|
41
|
Yu H, Huang J, Qiao N, Green CD, Han JDJ. Evaluating diabetes and hypertension disease causality using mouse phenotypes. BMC SYSTEMS BIOLOGY 2010; 4:97. [PMID: 20642857 PMCID: PMC2917432 DOI: 10.1186/1752-0509-4-97] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/03/2010] [Accepted: 07/20/2010] [Indexed: 01/11/2023]
Abstract
Background Genome-wide association studies (GWAS) have found hundreds of single nucleotide polymorphisms (SNPs) associated with common diseases. However, it is largely unknown what genes linked with the SNPs actually implicate disease causality. A definitive proof for disease causality can be demonstration of disease-like phenotypes through genetic perturbation of the genes or alleles, which is obviously a daunting task for complex diseases where only mammalian models can be used. Results Here we tapped the rich resource of mouse phenotype data and developed a method to quantify the probability that a gene perturbation causes the phenotypes of a disease. Using type II diabetes (T2D) and hypertension (HT) as study cases, we found that the genes, when perturbed, having high probability to cause T2D and HT phenotypes tend to be hubs in the interactome networks and are enriched for signaling pathways regulating metabolism but not metabolic pathways, even though the genes in these metabolic pathways are often the most significantly changed in expression levels in these diseases. Conclusions Compared to human genetic disease-based predictions, our mouse phenotype based predictors greatly increased the coverage while keeping a similarly high specificity. The disease phenotype probabilities given by our approach can be used to evaluate the likelihood of disease causality of disease-associated genes and genes surrounding disease-associated SNPs.
Collapse
Affiliation(s)
- Hong Yu
- Chinese Academy of Sciences Key Laboratory of Molecular Developmental Biology, Center for Molecular Systems Biology, Institute of Genetics and Developmental Biology, Chinese Academy of Sciences, Lincui East Road, Beijing 100101, China
| | | | | | | | | |
Collapse
|
42
|
Bogdanov P, Singh AK. Molecular function prediction using neighborhood features. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:208-217. [PMID: 20431141 DOI: 10.1109/tcbb.2009.81] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/29/2023]
Abstract
The recent advent of high-throughput methods has generated large amounts of gene interaction data. This has allowed the construction of genomewide networks. A significant number of genes in such networks remain uncharacterized and predicting the molecular function of these genes remains a major challenge. A number of existing techniques assume that genes with similar functions are topologically close in the network. Our hypothesis is that genes with similar functions observe similar annotation patterns in their neighborhood, regardless of the distance between them in the interaction network. We thus predict molecular functions of uncharacterized genes by comparing their functional neighborhoods to genes of known function. We propose a two-phase approach. First, we extract functional neighborhood features of a gene using Random Walks with Restarts. We then employ a KNN classifier to predict the function of uncharacterized genes based on the computed neighborhood features. We perform leave-one-out validation experiments on two S. cerevisiae interaction networks and show significant improvements over previous techniques. Our technique provides a natural control of the trade-off between accuracy and coverage of prediction. We further propose and evaluate prediction in sparse genomes by exploiting features from well-annotated genomes.
Collapse
Affiliation(s)
- Petko Bogdanov
- Department of Computer Science, University of California at Santa Barbara, Santa Barbara, CA 93106-5110, USA
| | | |
Collapse
|
43
|
Holmans P. Statistical methods for pathway analysis of genome-wide data for association with complex genetic traits. ADVANCES IN GENETICS 2010; 72:141-79. [PMID: 21029852 DOI: 10.1016/b978-0-12-380862-2.00007-2] [Citation(s) in RCA: 78] [Impact Index Per Article: 5.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
A number of statistical methods have been developed to test for associations between pathways (collections of genes related biologically) and complex genetic traits. Pathway analysis methods were originally developed for analyzing gene expression data, but recently methods have been developed to perform pathway analysis on genome-wide association study (GWAS) data. The purpose of this review is to give an overview of these methods, enabling the reader to gain an understanding of what pathway analysis involves, and to select the method most suited to their purposes. This review describes the various types of statistical methods for pathway analysis, detailing the strengths and weaknesses of each. Factors influencing the power of pathway analyses, such as gene coverage and choice of pathways to analyze, are discussed, as well as various unresolved statistical issues. Finally, a list of computer programs for performing pathway analysis on genome-wide association data is provided.
Collapse
Affiliation(s)
- Peter Holmans
- Biostatistics and Bioinformatics Unit, MRC Centre for Neuropsychiatric Genetics and Genomics, Department of Psychological Medicine and Neurology, Cardiff University School of Medicine, Heath Park, Cardiff, United Kingdom
| |
Collapse
|
44
|
Done B, Khatri P, Done A, Draghici S. Predicting novel human gene ontology annotations using semantic analysis. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2010; 7:91-9. [PMID: 20150671 PMCID: PMC3712327 DOI: 10.1109/tcbb.2008.29] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/16/2023]
Abstract
The correct interpretation of many molecular biology experiments depends in an essential way on the accuracy and consistency of the existing annotation databases. Such databases are meant to act as repositories for our biological knowledge as we acquire and refine it. Hence, by definition, they are incomplete at any given time. In this paper, we describe a technique that improves our previous method for predicting novel GO annotations by extracting implicit semantic relationships between genes and functions. In this work, we use a vector space model and a number of weighting schemes in addition to our previous latent semantic indexing approach. The technique described here is able to take into consideration the hierarchical structure of the Gene Ontology (GO) and can weight differently GO terms situated at different depths. The prediction abilities of 15 different weighting schemes are compared and evaluated. Nine such schemes were previously used in other problem domains, while six of them are introduced in this paper. The best weighting scheme was a novel scheme, n2tn. Out of the top 50 functional annotations predicted using this weighting scheme, we found support in the literature for 84 percent of them, while 6 percent of the predictions were contradicted by the existing literature. For the remaining 10 percent, we did not find any relevant publications to confirm or contradict the predictions. The n2tn weighting scheme also outperformed the simple binary scheme used in our previous approach.
Collapse
|
45
|
What we can learn about Escherichia coli through application of Gene Ontology. Trends Microbiol 2009; 17:269-78. [PMID: 19576778 DOI: 10.1016/j.tim.2009.04.004] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2008] [Revised: 03/04/2009] [Accepted: 04/08/2009] [Indexed: 11/21/2022]
Abstract
How we classify the genes, products and complexes that are present or absent in genomes, transcriptomes, proteomes and other datasets helps us place biological objects into subsystems with common functions, see how molecular functions are used to implement biological processes and compare the biology of different species and strains. Gene Ontology (GO) is one of the most successful systems for classifying biological function. Although GO is widely used for eukaryotic genomics, it has not yet been widely used for bacterial systems. The potential applications of GO are currently limited by the need to improve the annotation of bacterial genomes with GO and to improve how prokaryotic biology is represented in the ontology. Here, we discuss why GO should be adopted by microbiologists, and describe recent efforts to build and maintain high-quality GO annotation for Escherichia coli as a model system.
Collapse
|
46
|
Pandey G, Myers CL, Kumar V. Incorporating functional inter-relationships into protein function prediction algorithms. BMC Bioinformatics 2009; 10:142. [PMID: 19435516 PMCID: PMC2693438 DOI: 10.1186/1471-2105-10-142] [Citation(s) in RCA: 56] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/15/2008] [Accepted: 05/12/2009] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Functional classification schemes (e.g. the Gene Ontology) that serve as the basis for annotation efforts in several organisms are often the source of gold standard information for computational efforts at supervised protein function prediction. While successful function prediction algorithms have been developed, few previous efforts have utilized more than the protein-to-functional class label information provided by such knowledge bases. For instance, the Gene Ontology not only captures protein annotations to a set of functional classes, but it also arranges these classes in a DAG-based hierarchy that captures rich inter-relationships between different classes. These inter-relationships present both opportunities, such as the potential for additional training examples for small classes from larger related classes, and challenges, such as a harder to learn distinction between similar GO terms, for standard classification-based approaches. RESULTS We propose a method to enhance the performance of classification-based protein function prediction algorithms by addressing the issue of using these interrelationships between functional classes constituting functional classification schemes. Using a standard measure for evaluating the semantic similarity between nodes in an ontology, we quantify and incorporate these inter-relationships into the k-nearest neighbor classifier. We present experiments on several large genomic data sets, each of which is used for the modeling and prediction of over hundred classes from the GO Biological Process ontology. The results show that this incorporation produces more accurate predictions for a large number of the functional classes considered, and also that the classes benefitted most by this approach are those containing the fewest members. In addition, we show how our proposed framework can be used for integrating information from the entire GO hierarchy for improving the accuracy of predictions made over a set of base classes. Finally, we provide qualitative and quantitative evidence that this incorporation of functional inter-relationships enables the discovery of interesting biology in the form of novel functional annotations for several yeast proteins, such as Sna4, Rtn1 and Lin1. CONCLUSION We implemented and evaluated a methodology for incorporating interrelationships between functional classes into a standard classification-based protein function prediction algorithm. Our results show that this incorporation can help improve the accuracy of such algorithms, and help uncover novel biology in the form of previously unknown functional annotations. The complete source code, a sample data set and the additional files for this paper are available free of charge for non-commercial use at http://www.cs.umn.edu/vk/gaurav/functionalsimilarity/.
Collapse
Affiliation(s)
- Gaurav Pandey
- Department of Computer Science & Engineering, University of Minnesota, Minneapolis, MN, USA.
| | | | | |
Collapse
|
47
|
Fontana P, Cestaro A, Velasco R, Formentin E, Toppo S. Rapid annotation of anonymous sequences from genome projects using semantic similarities and a weighting scheme in gene ontology. PLoS One 2009; 4:e4619. [PMID: 19247487 PMCID: PMC2645684 DOI: 10.1371/journal.pone.0004619] [Citation(s) in RCA: 29] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2008] [Accepted: 01/09/2009] [Indexed: 11/22/2022] Open
Abstract
Background Large-scale sequencing projects have now become routine lab practice and this has led to the development of a new generation of tools involving function prediction methods, bringing the latter back to the fore. The advent of Gene Ontology, with its structured vocabulary and paradigm, has provided computational biologists with an appropriate means for this task. Methodology We present here a novel method called ARGOT (Annotation Retrieval of Gene Ontology Terms) that is able to process quickly thousands of sequences for functional inference. The tool exploits for the first time an integrated approach which combines clustering of GO terms, based on their semantic similarities, with a weighting scheme which assesses retrieved hits sharing a certain number of biological features with the sequence to be annotated. These hits may be obtained by different methods and in this work we have based ARGOT processing on BLAST results. Conclusions The extensive benchmark involved 10,000 protein sequences, the complete S. cerevisiae genome and a small subset of proteins for purposes of comparison with other available tools. The algorithm was proven to outperform existing methods and to be suitable for function prediction of single proteins due to its high degree of sensitivity, specificity and coverage.
Collapse
Affiliation(s)
- Paolo Fontana
- FEM-IASMA Research Center, San Michele all'Adige (TN), Italy
| | | | | | | | - Stefano Toppo
- Department of Biological Chemistry, University of Padova, Padova, Italy
- * E-mail:
| |
Collapse
|
48
|
Sam LT, Mendonça EA, Li J, Blake J, Friedman C, Lussier YA. PhenoGO: an integrated resource for the multiscale mining of clinical and biological data. BMC Bioinformatics 2009; 10 Suppl 2:S8. [PMID: 19208196 PMCID: PMC2646241 DOI: 10.1186/1471-2105-10-s2-s8] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
The evolving complexity of genome-scale experiments has increasingly centralized the role of a highly computable, accurate, and comprehensive resource spanning multiple biological scales and viewpoints. To provide a resource to meet this need, we have significantly extended the PhenoGO database with gene-disease specific annotations and included an additional ten species. This a computationally-derived resource is primarily intended to provide phenotypic context (cell type, tissue, organ, and disease) for mining existing associations between gene products and GO terms specified in the Gene Ontology Databases Automated natural language processing (BioMedLEE) and computational ontology (PhenOS) methods were used to derive these relationships from the literature, expanding the database with information from ten additional species to include over 600,000 phenotypic contexts spanning eleven species from five GO annotation databases. A comprehensive evaluation evaluating the mappings (n = 300) found precision (positive predictive value) at 85%, and recall (sensitivity) at 76%. Phenotypes are encoded in general purpose ontologies such as Cell Ontology, the Unified Medical Language System, and in specialized ontologies such as the Mouse Anatomy and the Mammalian Phenotype Ontology. A web portal has also been developed, allowing for advanced filtering and querying of the database as well as download of the entire dataset .
Collapse
Affiliation(s)
- Lee T Sam
- Center for Biomedical Informatics, Department of Medicine, The University of Chicago, Chicago, IL, USA.
| | | | | | | | | | | |
Collapse
|
49
|
Taher L, Ovcharenko I. Variable locus length in the human genome leads to ascertainment bias in functional inference for non-coding elements. ACTA ACUST UNITED AC 2009; 25:578-84. [PMID: 19168912 PMCID: PMC2647827 DOI: 10.1093/bioinformatics/btp043] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022]
Abstract
MOTIVATION Several functional gene annotation databases have been developed in the recent years, and are widely used to infer the biological function of gene sets, by scrutinizing the attributes that appear over- and underrepresented. However, this strategy is not directly applicable to the study of non-coding DNA, as the non-coding sequence span varies greatly among different gene loci in the human genome and longer loci have a higher likelihood of being selected purely by chance. Therefore, conclusions involving the function of non-coding elements that are drawn based on the annotation of neighboring genes are often biased. We assessed the systematic bias in several particular Gene Ontology (GO) categories using the standard hypergeometric test, by randomly sampling non-coding elements from the human genome and inferring their function based on the functional annotation of the closest genes. While no category is expected to occur significantly over- or underrepresented for a random selection of elements, categories such as 'cell adhesion', 'nervous system development' and 'transcription factor activities' appeared to be systematically overrepresented, while others such as 'olfactory receptor activity'-underrepresented. RESULTS Our results suggest that functional inference for non-coding elements using gene annotation databases requires a special correction. We introduce a set of correction coefficients for the probabilities of the GO categories that accounts for the variability in the length of the non-coding DNA across different loci and effectively eliminates the ascertainment bias from the functional characterization of non-coding elements. Our approach can be easily generalized to any other gene annotation database.
Collapse
Affiliation(s)
- Leila Taher
- Computational Biology Branch, National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, 8600 Rockville Pike, Bethesda, MD 20894, USA
| | | |
Collapse
|
50
|
Sidhu AS, Bellgard MI, Dillon TS. Classification of Information About Proteins. Bioinformatics 2009. [DOI: 10.1007/978-0-387-92738-1_12] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022] Open
|