1
|
Arumugam K, Shanker RR. Text Mining and Machine Learning Protocol for Extracting Human-Related Protein Phosphorylation Information from PubMed. Methods Mol Biol 2022; 2496:159-177. [PMID: 35713864 DOI: 10.1007/978-1-0716-2305-3_9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
In the modern health care research, protein phosphorylation has gained an enormous attention from the researchers across the globe and requires automated approaches to process a huge volume of data on proteins and their modifications at the cellular level. The data generated at the cellular level is unique as well as arbitrary, and an accumulation of massive volume of information is inevitable. Biological research has revealed that a huge array of cellular communication aided by protein phosphorylation and other similar mechanisms imply different and diverse meanings. This led to a collection of huge volume of data to understand the biological functions of human evolution, especially for combating diseases in a better way. Text mining, an automated approach to mine the information from an unstructured data, finds its application in extracting protein phosphorylation information from the biomedical literature databases such as PubMed. This chapter outlines a recent text mining protocol that applies natural language parsing (NLP) for named entity recognition and text processing, and support vector machines (SVM), a machine learning algorithm for classifying the processed text related human protein phosphorylation. We discuss on evaluating the text mining system which is the outcome of the protocol on three corpora, namely, human Protein Phosphorylation (hPP) corpus, Integrated Protein Literature Information and Knowledge corpus (iProLink), and Phosphorylation Literature corpus (PLC). We also present a basic understanding on the chemistry and biology that drive the protein phosphorylation process in a human body. We believe that this basic understanding will be useful to advance the existing text mining systems for extracting protein phosphorylation information from PubMed.
Collapse
Affiliation(s)
- Krishnamurthy Arumugam
- Department of Management Studies, Coimbatore Institute of Engineering and Technology, Coimbatore, Tamilnadu, India.
| | - Raja Ravi Shanker
- International Business Unit, Alembic Pharmaceuticals Limited, Vadodara, Gujarat, India
| |
Collapse
|
2
|
Savage SR, Zhang B. Using phosphoproteomics data to understand cellular signaling: a comprehensive guide to bioinformatics resources. Clin Proteomics 2020; 17:27. [PMID: 32676006 PMCID: PMC7353784 DOI: 10.1186/s12014-020-09290-x] [Citation(s) in RCA: 40] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2019] [Accepted: 07/04/2020] [Indexed: 12/19/2022] Open
Abstract
Mass spectrometry-based phosphoproteomics is becoming an essential methodology for the study of global cellular signaling. Numerous bioinformatics resources are available to facilitate the translation of phosphopeptide identification and quantification results into novel biological and clinical insights, a critical step in phosphoproteomics data analysis. These resources include knowledge bases of kinases and phosphatases, phosphorylation sites, kinase inhibitors, and sequence variants affecting kinase function, and bioinformatics tools that can predict phosphorylation sites in addition to the kinase that phosphorylates them, infer kinase activity, and predict the effect of mutations on kinase signaling. However, these resources exist in silos and it is challenging to select among multiple resources with similar functions. Therefore, we put together a comprehensive collection of resources related to phosphoproteomics data interpretation, compared the use of tools with similar functions, and assessed the usability from the standpoint of typical biologists or clinicians. Overall, tools could be improved by standardization of enzyme names, flexibility of data input and output format, consistent maintenance, and detailed manuals.
Collapse
Affiliation(s)
- Sara R. Savage
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN USA
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX USA
| | - Bing Zhang
- Lester and Sue Smith Breast Center, Baylor College of Medicine, Houston, TX USA
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX USA
| |
Collapse
|
3
|
Huang H, Arighi CN, Ross KE, Ren J, Li G, Chen SC, Wang Q, Cowart J, Vijay-Shanker K, Wu CH. iPTMnet: an integrated resource for protein post-translational modification network discovery. Nucleic Acids Res 2019; 46:D542-D550. [PMID: 29145615 PMCID: PMC5753337 DOI: 10.1093/nar/gkx1104] [Citation(s) in RCA: 109] [Impact Index Per Article: 18.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2017] [Accepted: 10/24/2017] [Indexed: 12/19/2022] Open
Abstract
Protein post-translational modifications (PTMs) play a pivotal role in numerous biological processes by modulating regulation of protein function. We have developed iPTMnet (http://proteininformationresource.org/iPTMnet) for PTM knowledge discovery, employing an integrative bioinformatics approach—combining text mining, data mining, and ontological representation to capture rich PTM information, including PTM enzyme-substrate-site relationships, PTM-specific protein-protein interactions (PPIs) and PTM conservation across species. iPTMnet encompasses data from (i) our PTM-focused text mining tools, RLIMS-P and eFIP, which extract phosphorylation information from full-scale mining of PubMed abstracts and full-length articles; (ii) a set of curated databases with experimentally observed PTMs; and iii) Protein Ontology that organizes proteins and PTM proteoforms, enabling their representation, annotation and comparison within and across species. Presently covering eight major PTM types (phosphorylation, ubiquitination, acetylation, methylation, glycosylation, S-nitrosylation, sumoylation and myristoylation), iPTMnet knowledgebase contains more than 654 500 unique PTM sites in over 62 100 proteins, along with more than 1200 PTM enzymes and over 24 300 PTM enzyme-substrate-site relations. The website supports online search, browsing, retrieval and visual analysis for scientific queries. Several examples, including functional interpretation of phosphoproteomic data, demonstrate iPTMnet as a gateway for visual exploration and systematic analysis of PTM networks and conservation, thereby enabling PTM discovery and hypothesis generation.
Collapse
Affiliation(s)
- Hongzhan Huang
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA.,Department of Computer & Information Sciences, University of Delaware, Newark, DE 19711, USA
| | - Cecilia N Arighi
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA.,Department of Computer & Information Sciences, University of Delaware, Newark, DE 19711, USA
| | - Karen E Ross
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC 20057, USA
| | - Jia Ren
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - Gang Li
- Department of Computer & Information Sciences, University of Delaware, Newark, DE 19711, USA
| | - Sheng-Chih Chen
- Department of Computer & Information Sciences, University of Delaware, Newark, DE 19711, USA
| | - Qinghua Wang
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA.,Department of Computer & Information Sciences, University of Delaware, Newark, DE 19711, USA
| | - Julie Cowart
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA
| | - K Vijay-Shanker
- Department of Computer & Information Sciences, University of Delaware, Newark, DE 19711, USA
| | - Cathy H Wu
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE 19711, USA.,Department of Computer & Information Sciences, University of Delaware, Newark, DE 19711, USA.,Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC 20057, USA
| |
Collapse
|
4
|
Riedel MC, Salo T, Hays J, Turner MD, Sutherland MT, Turner JA, Laird AR. Automated, Efficient, and Accelerated Knowledge Modeling of the Cognitive Neuroimaging Literature Using the ATHENA Toolkit. Front Neurosci 2019; 13:494. [PMID: 31156374 PMCID: PMC6530419 DOI: 10.3389/fnins.2019.00494] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2019] [Accepted: 04/29/2019] [Indexed: 11/13/2022] Open
Abstract
Neuroimaging research is growing rapidly, providing expansive resources for synthesizing data. However, navigating these dense resources is complicated by the volume of research articles and variety of experimental designs implemented across studies. The advent of machine learning algorithms and text-mining techniques has advanced automated labeling of published articles in biomedical research to alleviate such obstacles. As of yet, a comprehensive examination of document features and classifier techniques for annotating neuroimaging articles has yet to be undertaken. Here, we evaluated which combination of corpus (abstract-only or full-article text), features (bag-of-words or Cognitive Atlas terms), and classifier (Bernoulli naïve Bayes, k-nearest neighbors, logistic regression, or support vector classifier) resulted in the highest predictive performance in annotating a selection of 2,633 manually annotated neuroimaging articles. We found that, when utilizing full article text, data-driven features derived from the text performed the best, whereas if article abstracts were used for annotation, features derived from the Cognitive Atlas performed better. Additionally, we observed that when features were derived from article text, anatomical terms appeared to be the most frequently utilized for classification purposes and that cognitive concepts can be identified based on similar representations of these anatomical terms. Optimizing parameters for the automated classification of neuroimaging articles may result in a larger proportion of the neuroimaging literature being annotated with labels supporting the meta-analysis of psychological constructs.
Collapse
Affiliation(s)
- Michael C. Riedel
- Department of Physics, Florida International University, Miami, FL, United States
| | - Taylor Salo
- Department of Psychology, Florida International University, Miami, FL, United States
| | - Jason Hays
- Department of Psychology, Florida International University, Miami, FL, United States
| | - Matthew D. Turner
- Psychology and Neuroscience, Georgia State University, Atlanta, GA, United States
| | - Matthew T. Sutherland
- Department of Psychology, Florida International University, Miami, FL, United States
| | - Jessica A. Turner
- Psychology and Neuroscience, Georgia State University, Atlanta, GA, United States
| | - Angela R. Laird
- Department of Physics, Florida International University, Miami, FL, United States
| |
Collapse
|
5
|
Sun D, Wang M, Li A. MPTM: A tool for mining protein post-translational modifications from literature. J Bioinform Comput Biol 2017; 15:1740005. [PMID: 28982288 DOI: 10.1142/s0219720017400054] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Due to the importance of post-translational modifications (PTMs) in human health and diseases, PTMs are regularly reported in the biomedical literature. However, the continuing and rapid pace of expansion of this literature brings a huge challenge for researchers and database curators. Therefore, there is a pressing need to aid them in identifying relevant PTM information more efficiently by using a text mining system. So far, only a few web servers are available for mining information of a very limited number of PTMs, which are based on simple pattern matching or pre-defined rules. In our work, in order to help researchers and database curators easily find and retrieve PTM information from available text, we have developed a text mining tool called MPTM, which extracts and organizes valuable knowledge about 11 common PTMs from abstracts in PubMed by using relations extracted from dependency parse trees and a heuristic algorithm. It is the first web server that provides literature mining service for hydroxylation, myristoylation and GPI-anchor. The tool is also used to find new publications on PTMs from PubMed and uncovers potential PTM information by large-scale text analysis. MPTM analyzes text sentences to identify protein names including substrates and protein-interacting enzymes, and automatically associates them with the UniProtKB protein entry. To facilitate further investigation, it also retrieves PTM-related information, such as human diseases, Gene Ontology terms and organisms from the input text and related databases. In addition, an online database (MPTMDB) with extracted PTM information and a local MPTM Lite package are provided on the MPTM website. MPTM is freely available online at http://bioinformatics.ustc.edu.cn/mptm/ and the source codes are hosted on GitHub: https://github.com/USTC-HILAB/MPTM .
Collapse
Affiliation(s)
- Dongdong Sun
- 1 School of Information Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, P. R. China
| | - Minghui Wang
- 1 School of Information Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, P. R. China
| | - Ao Li
- 1 School of Information Science and Technology, University of Science and Technology of China, Hefei, Anhui 230027, P. R. China
| |
Collapse
|
6
|
Wang Q, Ross KE, Huang H, Ren J, Li G, Vijay-Shanker K, Wu CH, Arighi CN. Analysis of Protein Phosphorylation and Its Functional Impact on Protein-Protein Interactions via Text Mining of the Scientific Literature. Methods Mol Biol 2017; 1558:213-232. [PMID: 28150240 PMCID: PMC5446092 DOI: 10.1007/978-1-4939-6783-4_10] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/12/2023]
Abstract
Post-translational modifications (PTMs) are one of the main contributors to the diversity of proteoforms in the proteomic landscape. In particular, protein phosphorylation represents an essential regulatory mechanism that plays a role in many biological processes. Protein kinases, the enzymes catalyzing this reaction, are key participants in metabolic and signaling pathways. Their activation or inactivation dictate downstream events: what substrates are modified and their subsequent impact (e.g., activation state, localization, protein-protein interactions (PPIs)). The biomedical literature continues to be the main source of evidence for experimental information about protein phosphorylation. Automatic methods to bring together phosphorylation events and phosphorylation-dependent PPIs can help to summarize the current knowledge and to expose hidden connections. In this chapter, we demonstrate two text mining tools, RLIMS-P and eFIP, for the retrieval and extraction of kinase-substrate-site data and phosphorylation-dependent PPIs from the literature. These tools offer several advantages over a literature search in PubMed as their results are specific for phosphorylation. RLIMS-P and eFIP results can be sorted, organized, and viewed in multiple ways to answer relevant biological questions, and the protein mentions are linked to UniProt identifiers.
Collapse
Affiliation(s)
- Qinghua Wang
- Center for Bioinformatics and Computational Biology, Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE, 19711, USA
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | - Karen E Ross
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, 20057, USA
| | - Hongzhan Huang
- Center for Bioinformatics and Computational Biology, Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE, 19711, USA
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | - Jia Ren
- Center for Bioinformatics and Computational Biology, Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE, 19711, USA
| | - Gang Li
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | - K Vijay-Shanker
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA
| | - Cathy H Wu
- Center for Bioinformatics and Computational Biology, Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE, 19711, USA
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA
- Department of Biochemistry and Molecular & Cellular Biology, Georgetown University Medical Center, Washington, DC, 20057, USA
| | - Cecilia N Arighi
- Center for Bioinformatics and Computational Biology, Delaware Biotechnology Institute, University of Delaware, 15 Innovation Way, Suite 205, Newark, DE, 19711, USA.
- Department of Computer & Information Sciences, University of Delaware, Newark, DE, 19711, USA.
| |
Collapse
|
7
|
Kinoshita E, Kinoshita-Kikuta E, Kubota Y, Takekawa M, Koike T. A Phos-tag SDS-PAGE method that effectively uses phosphoproteomic data for profiling the phosphorylation dynamics of MEK1. Proteomics 2016; 16:1825-36. [DOI: 10.1002/pmic.201500494] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/08/2015] [Revised: 04/14/2016] [Accepted: 05/02/2016] [Indexed: 12/11/2022]
Affiliation(s)
- Eiji Kinoshita
- Department of Functional Molecular Science; Institute of Biomedical and Health Sciences; Hiroshima University; Japan
| | - Emiko Kinoshita-Kikuta
- Department of Functional Molecular Science; Institute of Biomedical and Health Sciences; Hiroshima University; Japan
| | - Yuji Kubota
- Division of Cell Signaling and Molecular Medicine; Institute of Medical Science; The University of Tokyo; Japan
| | - Mutsuhiro Takekawa
- Division of Cell Signaling and Molecular Medicine; Institute of Medical Science; The University of Tokyo; Japan
| | - Tohru Koike
- Department of Functional Molecular Science; Institute of Biomedical and Health Sciences; Hiroshima University; Japan
| |
Collapse
|
8
|
Matos S, Campos D, Pinho R, Silva RM, Mort M, Cooper DN, Oliveira JL. Mining clinical attributes of genomic variants through assisted literature curation in Egas. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2016; 2016:baw096. [PMID: 27278817 PMCID: PMC4897594 DOI: 10.1093/database/baw096] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/04/2015] [Accepted: 05/15/2016] [Indexed: 01/08/2023]
Abstract
The veritable deluge of biological data over recent years has led to the establishment of a considerable number of knowledge resources that compile curated information extracted from the literature and store it in structured form, facilitating its use and exploitation. In this article, we focus on the curation of inherited genetic variants and associated clinical attributes, such as zygosity, penetrance or inheritance mode, and describe the use of Egas for this task. Egas is a web-based platform for text-mining assisted literature curation that focuses on usability through modern design solutions and simple user interactions. Egas offers a flexible and customizable tool that allows defining the concept types and relations of interest for a given annotation task, as well as the ontologies used for normalizing each concept type. Further, annotations may be performed on raw documents or on the results of automated concept identification and relation extraction tools. Users can inspect, correct or remove automatic text-mining results, manually add new annotations, and export the results to standard formats. Egas is compatible with the most recent versions of Google Chrome, Mozilla Firefox, Internet Explorer and Safari and is available for use at https://demo.bmd-software.com/egas/. Database URL: https://demo.bmd-software.com/egas/
Collapse
Affiliation(s)
- Sérgio Matos
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193, Portugal
| | | | - Renato Pinho
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193, Portugal
| | - Raquel M Silva
- IEETA/DETI, University of Aveiro, Aveiro, 3810-193, Portugal Department of Medical Sciences, iBiMED, University of Aveiro, Aveiro, 3810-193, Portugal
| | | | - David N Cooper
- Institute of Medical Genetics, Cardiff University, Heath Park, Cardiff, UK
| | | |
Collapse
|
9
|
Soliman M, Nasraoui O, Cooper NGF. Building a glaucoma interaction network using a text mining approach. BioData Min 2016; 9:17. [PMID: 27152122 PMCID: PMC4857381 DOI: 10.1186/s13040-016-0096-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/09/2015] [Accepted: 04/23/2016] [Indexed: 11/21/2022] Open
Abstract
Background The volume of biomedical literature and its underlying knowledge base is rapidly expanding, making it beyond the ability of a single human being to read through all the literature. Several automated methods have been developed to help make sense of this dilemma. The present study reports on the results of a text mining approach to extract gene interactions from the data warehouse of published experimental results which are then used to benchmark an interaction network associated with glaucoma. To the best of our knowledge, there is, as yet, no glaucoma interaction network derived solely from text mining approaches. The presence of such a network could provide a useful summative knowledge base to complement other forms of clinical information related to this disease. Results A glaucoma corpus was constructed from PubMed Central and a text mining approach was applied to extract genes and their relations from this corpus. The extracted relations between genes were checked using reference interaction databases and classified generally as known or new relations. The extracted genes and relations were then used to construct a glaucoma interaction network. Analysis of the resulting network indicated that it bears the characteristics of a small world interaction network. Our analysis showed the presence of seven glaucoma linked genes that defined the network modularity. A web-based system for browsing and visualizing the extracted glaucoma related interaction networks is made available at http://neurogene.spd.louisville.edu/GlaucomaINViewer/Form1.aspx. Conclusions This study has reported the first version of a glaucoma interaction network using a text mining approach. The power of such an approach is in its ability to cover a wide range of glaucoma related studies published over many years. Hence, a bigger picture of the disease can be established. To the best of our knowledge, this is the first glaucoma interaction network to summarize the known literature. The major findings were a set of relations that could not be found in existing interaction databases and that were found to be new, in addition to a smaller subnetwork consisting of interconnected clusters of seven glaucoma genes. Future improvements can be applied towards obtaining a better version of this network. Electronic supplementary material The online version of this article (doi:10.1186/s13040-016-0096-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Maha Soliman
- Department of Anatomical Sciences and Neurobiology, University of Louisville, School of Medicine, Louisville, KY USA
| | - Olfa Nasraoui
- Knowledge Discovery & Web Mining Lab, Department of Computer Engineering & Computer Science, University of Louisville, J.B Speed School of Engineering, Louisville, KY USA
| | - Nigel G F Cooper
- Department of Anatomical Sciences and Neurobiology, University of Louisville, School of Medicine, Louisville, KY USA
| |
Collapse
|
10
|
Gupta S, Ross KE, Tudor CO, Wu CH, Schmidt CJ, Vijay-Shanker K. miRiaD: A Text Mining Tool for Detecting Associations of microRNAs with Diseases. J Biomed Semantics 2016; 7:9. [PMID: 27216254 PMCID: PMC4877743 DOI: 10.1186/s13326-015-0044-y] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/03/2015] [Accepted: 12/21/2015] [Indexed: 12/31/2022] Open
Abstract
Background MicroRNAs are increasingly being appreciated as critical players in human diseases, and questions concerning the role of microRNAs arise in many areas of biomedical research. There are several manually curated databases of microRNA-disease associations gathered from the biomedical literature; however, it is difficult for curators of these databases to keep up with the explosion of publications in the microRNA-disease field. Moreover, automated literature mining tools that assist manual curation of microRNA-disease associations currently capture only one microRNA property (expression) in the context of one disease (cancer). Thus, there is a clear need to develop more sophisticated automated literature mining tools that capture a variety of microRNA properties and relations in the context of multiple diseases to provide researchers with fast access to the most recent published information and to streamline and accelerate manual curation. Methods We have developed miRiaD (microRNAs in association with Disease), a text-mining tool that automatically extracts associations between microRNAs and diseases from the literature. These associations are often not directly linked, and the intermediate relations are often highly informative for the biomedical researcher. Thus, miRiaD extracts the miR-disease pairs together with an explanation for their association. We also developed a procedure that assigns scores to sentences, marking their informativeness, based on the microRNA-disease relation observed within the sentence. Results miRiaD was applied to the entire Medline corpus, identifying 8301 PMIDs with miR-disease associations. These abstracts and the miR-disease associations are available for browsing at http://biotm.cis.udel.edu/miRiaD. We evaluated the recall and precision of miRiaD with respect to information of high interest to public microRNA-disease database curators (expression and target gene associations), obtaining a recall of 88.46–90.78. When we expanded the evaluation to include sentences with a wide range of microRNA-disease information that may be of interest to biomedical researchers, miRiaD also performed very well with a F-score of 89.4. The informativeness ranking of sentences was evaluated in terms of nDCG (0.977) and correlation metrics (0.678-0.727) when compared to an annotator’s ranked list. Conclusions miRiaD, a high performance system that can capture a wide variety of microRNA-disease related information, extends beyond the scope of existing microRNA-disease resources. It can be incorporated into manual curation pipelines and serve as a resource for biomedical researchers interested in the role of microRNAs in disease. In our ongoing work we are developing an improved miRiaD web interface that will facilitate complex queries about microRNA-disease relationships, such as “In what diseases does microRNA regulation of apoptosis play a role?” or “Is there overlap in the sets of genes targeted by microRNAs in different types of dementia?”.” Electronic supplementary material The online version of this article (doi:10.1186/s13326-015-0044-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Samir Gupta
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA.
| | - Karen E Ross
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA
| | - Catalina O Tudor
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA.,Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA
| | - Cathy H Wu
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA.,Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, 19711, USA
| | - Carl J Schmidt
- Department of Food and Animal Sciences, University of Delaware, Newark, DE, 19711, USA
| | - K Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, Newark, DE, 19711, USA
| |
Collapse
|
11
|
Bioinformatics Knowledge Map for Analysis of Beta-Catenin Function in Cancer. PLoS One 2015; 10:e0141773. [PMID: 26509276 PMCID: PMC4624812 DOI: 10.1371/journal.pone.0141773] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/24/2015] [Accepted: 10/13/2015] [Indexed: 01/26/2023] Open
Abstract
Given the wealth of bioinformatics resources and the growing complexity of biological information, it is valuable to integrate data from disparate sources to gain insight into the role of genes/proteins in health and disease. We have developed a bioinformatics framework that combines literature mining with information from biomedical ontologies and curated databases to create knowledge "maps" of genes/proteins of interest. We applied this approach to the study of beta-catenin, a cell adhesion molecule and transcriptional regulator implicated in cancer. The knowledge map includes post-translational modifications (PTMs), protein-protein interactions, disease-associated mutations, and transcription factors co-activated by beta-catenin and their targets and captures the major processes in which beta-catenin is known to participate. Using the map, we generated testable hypotheses about beta-catenin biology in normal and cancer cells. By focusing on proteins participating in multiple relation types, we identified proteins that may participate in feedback loops regulating beta-catenin transcriptional activity. By combining multiple network relations with PTM proteoform-specific functional information, we proposed a mechanism to explain the observation that the cyclin dependent kinase CDK5 positively regulates beta-catenin co-activator activity. Finally, by overlaying cancer-associated mutation data with sequence features, we observed mutation patterns in several beta-catenin PTM sites and PTM enzyme binding sites that varied by tissue type, suggesting multiple mechanisms by which beta-catenin mutations can contribute to cancer. The approach described, which captures rich information for molecular species from genes and proteins to PTM proteoforms, is extensible to other proteins and their involvement in disease.
Collapse
|
12
|
Predicting CK2 beta-dependent substrates using linear patterns. Biochem Biophys Rep 2015; 4:20-27. [PMID: 29124183 PMCID: PMC5668876 DOI: 10.1016/j.bbrep.2015.08.011] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2015] [Revised: 08/14/2015] [Accepted: 08/17/2015] [Indexed: 12/13/2022] Open
Abstract
CK2 is a constitutively active Ser/Thr protein kinase deregulated in cancer and other pathologies, responsible for about the 20% of the human phosphoproteome. The holoenzyme is a complex composed of two catalytic (α or α´) and two regulatory (β) subunits, with individual subunits also coexisting in the cell. In the holoenzyme, CK2β is a substrate-dependent modulator of kinase activity. Therefore, a comprehensive characterization of CK2 cellular function should firstly address which substrates are phosphorylated exclusively when CK2β is present (class-III or beta-dependent substrates). However, current experimental constrains limit this classification to a few substrates. Here, we took advantage of motif-based prediction and designed four linear patterns for predicting class-III behavior in sets of experimentally determined CK2 substrates. Integrating high-throughput substrate prediction, functional classification and network analysis, our results suggest that beta-dependent phosphorylation might exert particular regulatory roles in viral infection and biological processes/pathways like apoptosis, DNA repair and RNA metabolism. It also pointed, that human beta-dependent substrates are mainly nuclear, a few of them shuttling between nuclear and cytoplasmic compartments. The designed linear patterns assist CK2 beta-dependent substrates prediction. A high-throughput prediction of CK2 beta-dependent substrates was performed in several organisms including human, mouse and rat. The functional classification indicated a role of CK2 beta-dependent regulation in viral infection, apoptosis, DNA repair and RNA metabolism. The functional classification indicated that human CK2 beta-dependent substrates are mainly nuclear with a number of them also found in cytoplasm.
Collapse
|
13
|
Ding R, Arighi CN, Lee JY, Wu CH, Vijay-Shanker K. pGenN, a gene normalization tool for plant genes and proteins in scientific literature. PLoS One 2015; 10:e0135305. [PMID: 26258475 PMCID: PMC4530884 DOI: 10.1371/journal.pone.0135305] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2015] [Accepted: 07/20/2015] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Automatically detecting gene/protein names in the literature and connecting them to databases records, also known as gene normalization, provides a means to structure the information buried in free-text literature. Gene normalization is critical for improving the coverage of annotation in the databases, and is an essential component of many text mining systems and database curation pipelines. METHODS In this manuscript, we describe a gene normalization system specifically tailored for plant species, called pGenN (pivot-based Gene Normalization). The system consists of three steps: dictionary-based gene mention detection, species assignment, and intra species normalization. We have developed new heuristics to improve each of these phases. RESULTS We evaluated the performance of pGenN on an in-house expertly annotated corpus consisting of 104 plant relevant abstracts. Our system achieved an F-value of 88.9% (Precision 90.9% and Recall 87.2%) on this corpus, outperforming state-of-art systems presented in BioCreative III. We have processed over 440,000 plant-related Medline abstracts using pGenN. The gene normalization results are stored in a local database for direct query from the pGenN web interface (proteininformationresource.org/pgenn/). The annotated literature corpus is also publicly available through the PIR text mining portal (proteininformationresource.org/iprolink/).
Collapse
Affiliation(s)
- Ruoyao Ding
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
- * E-mail:
| | - Cecilia N. Arighi
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware, United States of America
| | - Jung-Youn Lee
- Department of Plant and Soil Sciences, University of Delaware, Newark, Delaware, United States of America
| | - Cathy H. Wu
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
- Center for Bioinformatics and Computational Biology, University of Delaware, Newark, Delaware, United States of America
| | - K. Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, Newark, Delaware, United States of America
| |
Collapse
|
14
|
Tudor CO, Ross KE, Li G, Vijay-Shanker K, Wu CH, Arighi CN. Construction of phosphorylation interaction networks by text mining of full-length articles using the eFIP system. Database (Oxford) 2015; 2015:bav020. [PMID: 25833953 PMCID: PMC4381107 DOI: 10.1093/database/bav020] [Citation(s) in RCA: 21] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2014] [Revised: 02/17/2015] [Accepted: 02/18/2015] [Indexed: 12/11/2022]
Abstract
Protein phosphorylation is a reversible post-translational modification where a protein kinase adds a phosphate group to a protein, potentially regulating its function, localization and/or activity. Phosphorylation can affect protein-protein interactions (PPIs), abolishing interaction with previous binding partners or enabling new interactions. Extracting phosphorylation information coupled with PPI information from the scientific literature will facilitate the creation of phosphorylation interaction networks of kinases, substrates and interacting partners, toward knowledge discovery of functional outcomes of protein phosphorylation. Increasingly, PPI databases are interested in capturing the phosphorylation state of interacting partners. We have previously developed the eFIP (Extracting Functional Impact of Phosphorylation) text mining system, which identifies phosphorylated proteins and phosphorylation-dependent PPIs. In this work, we present several enhancements for the eFIP system: (i) text mining for full-length articles from the PubMed Central open-access collection; (ii) the integration of the RLIMS-P 2.0 system for the extraction of phosphorylation events with kinase, substrate and site information; (iii) the extension of the PPI module with new trigger words/phrases describing interactions and (iv) the addition of the iSimp tool for sentence simplification to aid in the matching of syntactic patterns. We enhance the website functionality to: (i) support searches based on protein roles (kinases, substrates, interacting partners) or using keywords; (ii) link protein entities to their corresponding UniProt identifiers if mapped and (iii) support visual exploration of phosphorylation interaction networks using Cytoscape. The evaluation of eFIP on full-length articles achieved 92.4% precision, 76.5% recall and 83.7% F-measure on 100 article sections. To demonstrate eFIP for knowledge extraction and discovery, we constructed phosphorylation-dependent interaction networks involving 14-3-3 proteins identified from cancer-related versus diabetes-related articles. Comparison of the phosphorylation interaction network of kinases, phosphoproteins and interactants obtained from eFIP searches, along with enrichment analysis of the protein set, revealed several shared interactions, highlighting common pathways discussed in the context of both diseases.
Collapse
Affiliation(s)
- Catalina O Tudor
- Department of Computer and Information Sciences and Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA Department of Computer and Information Sciences and Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Karen E Ross
- Department of Computer and Information Sciences and Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Gang Li
- Department of Computer and Information Sciences and Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - K Vijay-Shanker
- Department of Computer and Information Sciences and Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Cathy H Wu
- Department of Computer and Information Sciences and Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA Department of Computer and Information Sciences and Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| | - Cecilia N Arighi
- Department of Computer and Information Sciences and Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA Department of Computer and Information Sciences and Center for Bioinformatics and Computational Biology, University of Delaware, Newark, DE, USA
| |
Collapse
|
15
|
Khare R, Burger JD, Aberdeen JS, Tresner-Kirsch DW, Corrales TJ, Hirchman L, Lu Z. Scaling drug indication curation through crowdsourcing. Database (Oxford) 2015; 2015:bav016. [PMID: 25797061 PMCID: PMC4369375 DOI: 10.1093/database/bav016] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/06/2014] [Revised: 02/04/2015] [Accepted: 02/09/2015] [Indexed: 01/24/2023]
Abstract
Motivated by the high cost of human curation of biological databases, there is an increasing interest in using computational approaches to assist human curators and accelerate the manual curation process. Towards the goal of cataloging drug indications from FDA drug labels, we recently developed LabeledIn, a human-curated drug indication resource for 250 clinical drugs. Its development required over 40 h of human effort across 20 weeks, despite using well-defined annotation guidelines. In this study, we aim to investigate the feasibility of scaling drug indication annotation through a crowdsourcing technique where an unknown network of workers can be recruited through the technical environment of Amazon Mechanical Turk (MTurk). To translate the expert-curation task of cataloging indications into human intelligence tasks (HITs) suitable for the average workers on MTurk, we first simplify the complex task such that each HIT only involves a worker making a binary judgment of whether a highlighted disease, in context of a given drug label, is an indication. In addition, this study is novel in the crowdsourcing interface design where the annotation guidelines are encoded into user options. For evaluation, we assess the ability of our proposed method to achieve high-quality annotations in a time-efficient and cost-effective manner. We posted over 3000 HITs drawn from 706 drug labels on MTurk. Within 8 h of posting, we collected 18 775 judgments from 74 workers, and achieved an aggregated accuracy of 96% on 450 control HITs (where gold-standard answers are known), at a cost of $1.75 per drug label. On the basis of these results, we conclude that our crowdsourcing approach not only results in significant cost and time saving, but also leads to accuracy comparable to that of domain experts.
Collapse
Affiliation(s)
- Ritu Khare
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA
| | - John D Burger
- The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA
| | - John S Aberdeen
- The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA
| | | | - Theodore J Corrales
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA, Montgomery Blair High School, 57 University Blvd E., Silver Spring, MD 20901, USA
| | - Lynette Hirchman
- The MITRE Corporation, 202 Burlington Road, Bedford, MA 01730, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), 8600 Rockville Pike, Bethesda, MD 20894, USA.
| |
Collapse
|
16
|
Torii M, Arighi CN, Li G, Wang Q, Wu CH, Vijay-Shanker K. RLIMS-P 2.0: A Generalizable Rule-Based Information Extraction System for Literature Mining of Protein Phosphorylation Information. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2015; 12:17-29. [PMID: 26357075 PMCID: PMC4568560 DOI: 10.1109/tcbb.2014.2372765] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/05/2023]
Abstract
We introduce RLIMS-P version 2.0, an enhanced rule-based information extraction (IE) system for mining kinase, substrate, and phosphorylation site information from scientific literature. Consisting of natural language processing and IE modules, the system has integrated several new features, including the capability of processing full-text articles and generalizability towards different post-translational modifications (PTMs). To evaluate the system, sets of abstracts and full-text articles, containing a variety of textual expressions, were annotated. On the abstract corpus, the system achieved F-scores of 0.91, 0.92, and 0.95 for kinases, substrates, and sites, respectively. The corresponding scores on the full-text corpus were 0.88, 0.91, and 0.92. It was additionally evaluated on the corpus of the 2013 BioNLP-ST GE task, and achieved an F-score of 0.87 for the phosphorylation core task, improving upon the results previously reported on the corpus. Full-scale processing of all abstracts in MEDLINE and all articles in PubMed Central Open Access Subset has demonstrated scalability for mining rich information in literature, enabling its adoption for biocuration and for knowledge discovery. The new system is generalizable and it will be adapted to tackle other major PTM types. RLIMS-P 2.0 online system is available online (http://proteininformationresource.org/rlimsp/) and the developed corpora are available from iProLINK (http://proteininformationresource.org/iprolink/).
Collapse
Affiliation(s)
- Manabu Torii
- Medical Informatics Group, Kaiser Permanente Southern California, 11975 El Camino Real, San Diego, CA 92130
| | - Cecilia N. Arighi
- Center for Bioinformatics & Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711
| | - Gang Li
- Center for Bioinformatics & Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 1971
| | - Qinghua Wang
- Center for Bioinformatics & Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711
| | - Cathy H. Wu
- Center for Bioinformatics & Computational Biology, University of Delaware, 15 Innovation Way, Newark, DE 19711
| | - K. Vijay-Shanker
- Department of Computer and Information Sciences, University of Delaware, 101 Smith Hall, Newark, DE 19716
| |
Collapse
|
17
|
Chatr-Aryamontri A, Breitkreutz BJ, Oughtred R, Boucher L, Heinicke S, Chen D, Stark C, Breitkreutz A, Kolas N, O'Donnell L, Reguly T, Nixon J, Ramage L, Winter A, Sellam A, Chang C, Hirschman J, Theesfeld C, Rust J, Livstone MS, Dolinski K, Tyers M. The BioGRID interaction database: 2015 update. Nucleic Acids Res 2014; 43:D470-8. [PMID: 25428363 PMCID: PMC4383984 DOI: 10.1093/nar/gku1204] [Citation(s) in RCA: 648] [Impact Index Per Article: 58.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
The Biological General Repository for Interaction Datasets (BioGRID: http://thebiogrid.org) is an open access database that houses genetic and protein interactions curated from the primary biomedical literature for all major model organism species and humans. As of September 2014, the BioGRID contains 749 912 interactions as drawn from 43 149 publications that represent 30 model organisms. This interaction count represents a 50% increase compared to our previous 2013 BioGRID update. BioGRID data are freely distributed through partner model organism databases and meta-databases and are directly downloadable in a variety of formats. In addition to general curation of the published literature for the major model species, BioGRID undertakes themed curation projects in areas of particular relevance for biomedical sciences, such as the ubiquitin-proteasome system and various human disease-associated interaction networks. BioGRID curation is coordinated through an Interaction Management System (IMS) that facilitates the compilation interaction records through structured evidence codes, phenotype ontologies, and gene annotation. The BioGRID architecture has been improved in order to support a broader range of interaction and post-translational modification types, to allow the representation of more complex multi-gene/protein interactions, to account for cellular phenotypes through structured ontologies, to expedite curation through semi-automated text-mining approaches, and to enhance curation quality control.
Collapse
Affiliation(s)
- Andrew Chatr-Aryamontri
- Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, Quebec H3C 3J7, Canada
| | - Bobby-Joe Breitkreutz
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Lorrie Boucher
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Sven Heinicke
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Daici Chen
- Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, Quebec H3C 3J7, Canada
| | - Chris Stark
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Ashton Breitkreutz
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Nadine Kolas
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Lara O'Donnell
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Teresa Reguly
- The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada
| | - Julie Nixon
- School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JR, UK
| | - Lindsay Ramage
- School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JR, UK
| | - Andrew Winter
- School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JR, UK
| | - Adnane Sellam
- Centre Hospitalier de l'Université Laval (CHUL), Québec, Québec G1V 4G2, Canada
| | - Christie Chang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Jodi Hirschman
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Chandra Theesfeld
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Jennifer Rust
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Michael S Livstone
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Mike Tyers
- Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, Quebec H3C 3J7, Canada The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario M5G 1X5, Canada School of Biological Sciences, University of Edinburgh, Edinburgh EH9 3JR, UK
| |
Collapse
|