1
|
Queralt-Rosinach N, Stupp GS, Li TS, Mayers M, Hoatlin ME, Might M, Good BM, Su AI. Structured reviews for data and knowledge-driven research. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2021; 2020:5818923. [PMID: 32283553 PMCID: PMC7153956 DOI: 10.1093/database/baaa015] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/12/2019] [Revised: 01/21/2020] [Accepted: 02/07/2020] [Indexed: 12/25/2022]
Abstract
Hypothesis generation is a critical step in research and a cornerstone in the rare disease field. Research is most efficient when those hypotheses are based on the entirety of knowledge known to date. Systematic review articles are commonly used in biomedicine to summarize existing knowledge and contextualize experimental data. But the information contained within review articles is typically only expressed as free-text, which is difficult to use computationally. Researchers struggle to navigate, collect and remix prior knowledge as it is scattered in several silos without seamless integration and access. This lack of a structured information framework hinders research by both experimental and computational scientists. To better organize knowledge and data, we built a structured review article that is specifically focused on NGLY1 Deficiency, an ultra-rare genetic disease first reported in 2012. We represented this structured review as a knowledge graph and then stored this knowledge graph in a Neo4j database to simplify dissemination, querying and visualization of the network. Relative to free-text, this structured review better promotes the principles of findability, accessibility, interoperability and reusability (FAIR). In collaboration with domain experts in NGLY1 Deficiency, we demonstrate how this resource can improve the efficiency and comprehensiveness of hypothesis generation. We also developed a read–write interface that allows domain experts to contribute FAIR structured knowledge to this community resource. In contrast to traditional free-text review articles, this structured review exists as a living knowledge graph that is curated by humans and accessible to computational analyses. Finally, we have generalized this workflow into modular and repurposable components that can be applied to other domain areas. This NGLY1 Deficiency-focused network is publicly available at http://ngly1graph.org/. Availability and implementation Database URL: http://ngly1graph.org/. Network data files are at: https://github.com/SuLab/ngly1-graph and source code at: https://github.com/SuLab/bioknowledge-reviewer. Contact asu@scripps.edu
Collapse
Affiliation(s)
- Núria Queralt-Rosinach
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Gregory S Stupp
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Tong Shu Li
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Michael Mayers
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Maureen E Hoatlin
- Department of Biochemistry and Molecular Biology, Oregon Health and Science University, 3181 SW Sam Jackson Parkway, Portland, OR 97239, USA
| | - Matthew Might
- Department of Medicine, Hugh Kaul Precision Medicine Institute, University of Alabama at Birmingham, 510 20th St S, Birmingham, AL 35210, USA
| | - Benjamin M Good
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| | - Andrew I Su
- Department of Integrative Structural and Computational Biology, Scripps Research, 10550 N Torrey Pines Rd. La Jolla, CA 92037, USA
| |
Collapse
|
2
|
Coley CW, Eyke NS, Jensen KF. Autonomous Discovery in the Chemical Sciences Part I: Progress. Angew Chem Int Ed Engl 2020; 59:22858-22893. [DOI: 10.1002/anie.201909987] [Citation(s) in RCA: 100] [Impact Index Per Article: 20.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/06/2019] [Indexed: 01/05/2023]
Affiliation(s)
- Connor W. Coley
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge MA 02139 USA
| | - Natalie S. Eyke
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge MA 02139 USA
| | - Klavs F. Jensen
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge MA 02139 USA
| |
Collapse
|
3
|
Coley CW, Eyke NS, Jensen KF. Autonome Entdeckung in den chemischen Wissenschaften, Teil I: Fortschritt. Angew Chem Int Ed Engl 2020. [DOI: 10.1002/ange.201909987] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/14/2022]
Affiliation(s)
- Connor W. Coley
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge MA 02139 USA
| | - Natalie S. Eyke
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge MA 02139 USA
| | - Klavs F. Jensen
- Department of Chemical Engineering Massachusetts Institute of Technology Cambridge MA 02139 USA
| |
Collapse
|
4
|
Bouadjenek MR, Zobel J, Verspoor K. Automated assessment of biological database assertions using the scientific literature. BMC Bioinformatics 2019; 20:216. [PMID: 31035936 PMCID: PMC6489365 DOI: 10.1186/s12859-019-2801-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/14/2018] [Accepted: 04/09/2019] [Indexed: 12/27/2022] Open
Abstract
BACKGROUND The large biological databases such as GenBank contain vast numbers of records, the content of which is substantively based on external resources, including published literature. Manual curation is used to establish whether the literature and the records are indeed consistent. We explore in this paper an automated method for assessing the consistency of biological assertions, to assist biocurators, which we call BARC, Biocuration tool for Assessment of Relation Consistency. In this method a biological assertion is represented as a relation between two objects (for example, a gene and a disease); we then use our novel set-based relevance algorithm SaBRA to retrieve pertinent literature, and apply a classifier to estimate the likelihood that this relation (assertion) is correct. RESULTS Our experiments on assessing gene-disease relations and protein-protein interactions using the PubMed Central collection show that BARC can be effective at assisting curators to perform data cleansing. Specifically, the results obtained showed that BARC substantially outperforms the best baselines, with an improvement of F-measure of 3.5% and 13%, respectively, on gene-disease relations and protein-protein interactions. We have additionally carried out a feature analysis that showed that all feature types are informative, as are all fields of the documents. CONCLUSIONS BARC provides a clear benefit for the biocuration community, as there are no prior automated tools for identifying inconsistent assertions in large-scale biological databases.
Collapse
Affiliation(s)
- Mohamed Reda Bouadjenek
- Department of Mechanical & Industrial Engineering, University of Toronto, Toronto, M5S 3G8 Canada
| | - Justin Zobel
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010 Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Melbourne, 3010 Australia
| |
Collapse
|
5
|
Abstract
Computational manipulation of knowledge is an important, and often under-appreciated, aspect of biomedical Data Science. The first Data Science initiative from the US National Institutes of Health was entitled "Big Data to Knowledge (BD2K)." The main emphasis of the more than $200M allocated to that program has been on "Big Data;" the "Knowledge" component has largely been the implicit assumption that the work will lead to new biomedical knowledge. However, there is long-standing and highly productive work in computational knowledge representation and reasoning, and computational processing of knowledge has a role in the world of Data Science. Knowledge-based biomedical Data Science involves the design and implementation of computer systems that act as if they knew about biomedicine. There are many ways in which a computational approach might act as if it knew something: for example, it might be able to answer a natural language question about a biomedical topic, or pass an exam; it might be able to use existing biomedical knowledge to rank or evaluate hypotheses; it might explain or interpret data in light of prior knowledge, either in a Bayesian or other sort of framework. These are all examples of automated reasoning that act on computational representations of knowledge. After a brief survey of existing approaches to knowledge-based data science, this position paper argues that such research is ripe for expansion, and expanded application.
Collapse
Affiliation(s)
- Lawrence E Hunter
- Computational Bioscience, University of Colorado School of Medicine, Aurora, CO 80045, USA ; ORCID: https://orcid.org/0000-0003-1455-3370
| |
Collapse
|
6
|
Golestan Hashemi FS, Razi Ismail M, Rafii Yusop M, Golestan Hashemi MS, Nadimi Shahraki MH, Rastegari H, Miah G, Aslani F. Intelligent mining of large-scale bio-data: Bioinformatics applications. BIOTECHNOL BIOTEC EQ 2017. [DOI: 10.1080/13102818.2017.1364977] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/05/2023] Open
Affiliation(s)
- Farahnaz Sadat Golestan Hashemi
- Plant Genetics, AgroBioChem Department, Gembloux Agro-Bio Tech, University of Liege, Liege, Belgium
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mohd Razi Ismail
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mohd Rafii Yusop
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Mahboobe Sadat Golestan Hashemi
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
- Big Data Research Center, Najafabad Branch, Islamic Azad University, Isfahan, Iran
| | - Mohammad Hossein Nadimi Shahraki
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
- Big Data Research Center, Najafabad Branch, Islamic Azad University, Isfahan, Iran
| | - Hamid Rastegari
- Department of Software Engineering, Faculty of Computer Engineering, Najafabad Branch, Islamic Azad University, Isfahan,Iran
| | - Gous Miah
- Laboratory of Food Crops, Institute of Tropical Agriculture and Food Security, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| | - Farzad Aslani
- Department of Crop Science, Faculty of Agriculture, Universiti Putra Malaysia, Serdang, Selangor, Malaysia
| |
Collapse
|
7
|
Kao DP, Stevens LM, Hinterberg MA, Görg C. Phenotype-Specific Association of Single-Nucleotide Polymorphisms with Heart Failure and Preserved Ejection Fraction: a Genome-Wide Association Analysis of the Cardiovascular Health Study. J Cardiovasc Transl Res 2017; 10:285-294. [PMID: 28105587 DOI: 10.1007/s12265-017-9729-1] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/03/2016] [Accepted: 01/04/2017] [Indexed: 12/29/2022]
Abstract
Little is known about genetics of heart failure with preserved ejection fraction (HFpEF) in part because of the many comorbidities in this population. To identify single-nucleotide polymorphisms (SNPs) associated with HFpEF, we analyzed phenotypic and genotypic data from the Cardiovascular Health Study, which profiled patients using a 50,000 SNP array. Results were explored using novel SNP- and gene-centric tools. We performed analyses to determine whether some SNPs were relevant only in certain phenotypes. Among 3804 patients, 7 clinical factors and 9 SNPs were significantly associated with HFpEF; the most notable of which was rs6996224, a SNP associated with transforming growth factor-beta receptor 3. Most SNPs were associated with HFpEF only in the absence of a clinical predictor. Significant SNPs represented genes involved in myocyte proliferation, transforming growth factor-beta/erbB signaling, and extracellular matrix formation. These findings suggest that genetic factors may be more important in some phenotypes than others.
Collapse
Affiliation(s)
- David P Kao
- University of Colorado School of Medicine, 12700 E 19th Ave Campus Box B-139, Aurora, CO, 80045, USA.
| | - Laura M Stevens
- University of Colorado School of Medicine, 12700 E 19th Ave Campus Box B-139, Aurora, CO, 80045, USA
| | - Michael A Hinterberg
- University of Colorado School of Medicine, 12700 E 19th Ave Campus Box B-139, Aurora, CO, 80045, USA
| | - Carsten Görg
- University of Colorado School of Medicine, 12700 E 19th Ave Campus Box B-139, Aurora, CO, 80045, USA
| |
Collapse
|
8
|
Lagani V, Karozou AD, Gomez-Cabrero D, Silberberg G, Tsamardinos I. A comparative evaluation of data-merging and meta-analysis methods for reconstructing gene-gene interactions. BMC Bioinformatics 2016; 17 Suppl 5:194. [PMID: 27294826 PMCID: PMC4905611 DOI: 10.1186/s12859-016-1038-1] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/16/2022] Open
Abstract
BACKGROUND We address the problem of integratively analyzing multiple gene expression, microarray datasets in order to reconstruct gene-gene interaction networks. Integrating multiple datasets is generally believed to provide increased statistical power and to lead to a better characterization of the system under study. However, the presence of systematic variation across different studies makes network reverse-engineering tasks particularly challenging. We contrast two approaches that have been frequently used in the literature for addressing systematic biases: meta-analysis methods, which first calculate opportune statistics on single datasets and successively summarize them, and data-merging methods, which directly analyze the pooled data after removing eventual biases. This comparative evaluation is performed on both synthetic and real data, the latter consisting of two manually curated microarray compendia comprising several E. coli and Yeast studies, respectively. Furthermore, the reconstruction of the regulatory network of the transcription factor Ikaros in human Peripheral Blood Mononuclear Cells (PBMCs) is presented as a case-study. RESULTS The meta-analysis and data-merging methods included in our experimentations provided comparable performances on both synthetic and real data. Furthermore, both approaches outperformed (a) the naïve solution of merging data together ignoring possible biases, and (b) the results that are expected when only one dataset out of the available ones is analyzed in isolation. Using correlation statistics proved to be more effective than using p-values for correctly ranking candidate interactions. The results from the PBMC case-study indicate that the findings of the present study generalize to different types of network reconstruction algorithms. CONCLUSIONS Ignoring the systematic variations that differentiate heterogeneous studies can produce results that are statistically indistinguishable from random guessing. Meta-analysis and data merging methods have proved equally effective in addressing this issue, and thus researchers may safely select the approach that best suit their specific application.
Collapse
Affiliation(s)
- Vincenzo Lagani
- />Institute of Computer Science, Foundation for Research and Technology – Hellas, Heraklion, Greece
- />Computer Science Department, University of Crete, Heraklion, Sweden
| | - Argyro D. Karozou
- />Institute of Computer Science, Foundation for Research and Technology – Hellas, Heraklion, Greece
| | - David Gomez-Cabrero
- />Unit of Computational Medicine, Department of Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden
- />Center for Molecular Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden
- />Unit of Clinical Epidemiology, Department of Medicine, Karolinska University Hospital, L8, 17176 Heraklion, Sweden
- />Science for Life Laboratory, 17121 Solna, Sweden
| | - Gilad Silberberg
- />Unit of Computational Medicine, Department of Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden
- />Center for Molecular Medicine, Karolinska Institutet, 171 77 Stockholm, Sweden
- />Unit of Clinical Epidemiology, Department of Medicine, Karolinska University Hospital, L8, 17176 Heraklion, Sweden
- />Science for Life Laboratory, 17121 Solna, Sweden
| | - Ioannis Tsamardinos
- />Institute of Computer Science, Foundation for Research and Technology – Hellas, Heraklion, Greece
- />Computer Science Department, University of Crete, Heraklion, Sweden
| |
Collapse
|
9
|
Tripathi S, Flobak Å, Chawla K, Baudot A, Bruland T, Thommesen L, Kuiper M, Lægreid A. The gastrin and cholecystokinin receptors mediated signaling network: a scaffold for data analysis and new hypotheses on regulatory mechanisms. BMC SYSTEMS BIOLOGY 2015. [PMID: 26205660 PMCID: PMC4513977 DOI: 10.1186/s12918-015-0181-z] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
Abstract
Background The gastrointestinal peptide hormones cholecystokinin and gastrin exert their biological functions via cholecystokinin receptors CCK1R and CCK2R respectively. Gastrin, a central regulator of gastric acid secretion, is involved in growth and differentiation of gastric and colonic mucosa, and there is evidence that it is pro-carcinogenic. Cholecystokinin is implicated in digestion, appetite control and body weight regulation, and may play a role in several digestive disorders. Results We performed a detailed analysis of the literature reporting experimental evidence on signaling pathways triggered by CCK1R and CCK2R, in order to create a comprehensive map of gastrin and cholecystokinin-mediated intracellular signaling cascades. The resulting signaling map captures 413 reactions involving 530 molecular species, and incorporates the currently available knowledge into one integrated signaling network. The decomposition of the signaling map into sub-networks revealed 18 modules that represent higher-level structures of the signaling map. These modules allow a more compact mapping of intracellular signaling reactions to known cell behavioral outcomes such as proliferation, migration and apoptosis. The integration of large-scale protein-protein interaction data to this literature-based signaling map in combination with topological analyses allowed us to identify 70 proteins able to increase the compactness of the map. These proteins represent experimentally testable hypotheses for gaining new knowledge on gastrin- and cholecystokinin receptor signaling. The CCKR map is freely available both in a downloadable, machine-readable SBML-compatible format and as a web resource through PAYAO (http://sblab.celldesigner.org:18080/Payao11/bin/). Conclusion We have demonstrated how a literature-based CCKR signaling map together with its protein interaction extensions can be analyzed to generate new hypotheses on molecular mechanisms involved in gastrin- and cholecystokinin-mediated regulation of cellular processes. Electronic supplementary material The online version of this article (doi:10.1186/s12918-015-0181-z) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sushil Tripathi
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), N-7489, Trondheim, Norway.
| | - Åsmund Flobak
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), N-7489, Trondheim, Norway.
| | - Konika Chawla
- Department of Biology, Norwegian University of Science and Technology (NTNU), N-7491, Trondheim, Norway.
| | - Anaïs Baudot
- I2M, Marseilles Institute of Mathematics CNRS - AMU, Case 907, 13288, Marseille, Cedex 9, France.
| | - Torunn Bruland
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), N-7489, Trondheim, Norway.
| | - Liv Thommesen
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), N-7489, Trondheim, Norway. .,Department of Technology, Sør-Trøndelag University College, N-7004, Trondheim, Norway.
| | - Martin Kuiper
- Department of Biology, Norwegian University of Science and Technology (NTNU), N-7491, Trondheim, Norway.
| | - Astrid Lægreid
- Department of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), N-7489, Trondheim, Norway. .,Institute of Cancer Research and Molecular Medicine, Norwegian University of Science and Technology (NTNU), N-7489, Trondheim, Norway.
| |
Collapse
|
10
|
Vehlow C, Kao DP, Bristow MR, Hunter LE, Weiskopf D, Görg C. Visual analysis of biological data-knowledge networks. BMC Bioinformatics 2015; 16:135. [PMID: 25925016 PMCID: PMC4456720 DOI: 10.1186/s12859-015-0550-z] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/07/2014] [Accepted: 03/25/2015] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The interpretation of the results from genome-scale experiments is a challenging and important problem in contemporary biomedical research. Biological networks that integrate experimental results with existing knowledge from biomedical databases and published literature can provide a rich resource and powerful basis for hypothesizing about mechanistic explanations for observed gene-phenotype relationships. However, the size and density of such networks often impede their efficient exploration and understanding. RESULTS We introduce a visual analytics approach that integrates interactive filtering of dense networks based on degree-of-interest functions with attribute-based layouts of the resulting subnetworks. The comparison of multiple subnetworks representing different analysis facets is facilitated through an interactive super-network that integrates brushing-and-linking techniques for highlighting components across networks. An implementation is freely available as a Cytoscape app. CONCLUSIONS We demonstrate the utility of our approach through two case studies using a dataset that combines clinical data with high-throughput data for studying the effect of β-blocker treatment on heart failure patients. Furthermore, we discuss our team-based iterative design and development process as well as the limitations and generalizability of our approach.
Collapse
Affiliation(s)
- Corinna Vehlow
- VISUS, University of Stuttgart, Allmandring 19, Stuttgart, Germany.
| | - David P Kao
- School of Medicine, University of Colorado, E 17th Pl, Aurora, CO, USA.
| | - Michael R Bristow
- School of Medicine, University of Colorado, E 17th Pl, Aurora, CO, USA.
| | - Lawrence E Hunter
- School of Medicine, University of Colorado, E 17th Pl, Aurora, CO, USA.
| | - Daniel Weiskopf
- VISUS, University of Stuttgart, Allmandring 19, Stuttgart, Germany.
| | - Carsten Görg
- School of Medicine, University of Colorado, E 17th Pl, Aurora, CO, USA.
| |
Collapse
|
11
|
Application of text mining in the biomedical domain. Methods 2015; 74:97-106. [PMID: 25641519 DOI: 10.1016/j.ymeth.2015.01.015] [Citation(s) in RCA: 80] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2014] [Revised: 01/21/2015] [Accepted: 01/23/2015] [Indexed: 12/12/2022] Open
Abstract
In recent years the amount of experimental data that is produced in biomedical research and the number of papers that are being published in this field have grown rapidly. In order to keep up to date with developments in their field of interest and to interpret the outcome of experiments in light of all available literature, researchers turn more and more to the use of automated literature mining. As a consequence, text mining tools have evolved considerably in number and quality and nowadays can be used to address a variety of research questions ranging from de novo drug target discovery to enhanced biological interpretation of the results from high throughput experiments. In this paper we introduce the most important techniques that are used for a text mining and give an overview of the text mining tools that are currently being used and the type of problems they are typically applied for.
Collapse
|
12
|
Hristovski D, Dinevski D, Kastrin A, Rindflesch TC. Biomedical question answering using semantic relations. BMC Bioinformatics 2015; 16:6. [PMID: 25592675 PMCID: PMC4307891 DOI: 10.1186/s12859-014-0365-3] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2014] [Accepted: 10/29/2014] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The proliferation of the scientific literature in the field of biomedicine makes it difficult to keep abreast of current knowledge, even for domain experts. While general Web search engines and specialized information retrieval (IR) systems have made important strides in recent decades, the problem of accurate knowledge extraction from the biomedical literature is far from solved. Classical IR systems usually return a list of documents that have to be read by the user to extract relevant information. This tedious and time-consuming work can be lessened with automatic Question Answering (QA) systems, which aim to provide users with direct and precise answers to their questions. In this work we propose a novel methodology for QA based on semantic relations extracted from the biomedical literature. RESULTS We extracted semantic relations with the SemRep natural language processing system from 122,421,765 sentences, which came from 21,014,382 MEDLINE citations (i.e., the complete MEDLINE distribution up to the end of 2012). A total of 58,879,300 semantic relation instances were extracted and organized in a relational database. The QA process is implemented as a search in this database, which is accessed through a Web-based application, called SemBT (available at http://sembt.mf.uni-lj.si ). We conducted an extensive evaluation of the proposed methodology in order to estimate the accuracy of extracting a particular semantic relation from a particular sentence. Evaluation was performed by 80 domain experts. In total 7,510 semantic relation instances belonging to 2,675 distinct relations were evaluated 12,083 times. The instances were evaluated as correct 8,228 times (68%). CONCLUSIONS In this work we propose an innovative methodology for biomedical QA. The system is implemented as a Web-based application that is able to provide precise answers to a wide range of questions. A typical question is answered within a few seconds. The tool has some extensions that make it especially useful for interpretation of DNA microarray results.
Collapse
Affiliation(s)
- Dimitar Hristovski
- Institute for Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana, Vrazov trg 2, SI-1104, Ljubljana, Slovenia.
| | - Dejan Dinevski
- Faculty of Medicine, University of Maribor, Slomškov trg 15, SI-2000, Maribor, Slovenia.
| | - Andrej Kastrin
- Faculty of Information Studies, Ulica talcev 3, SI-8000, Novo mesto, Slovenia.
| | - Thomas C Rindflesch
- Lister Hill National Center for Biomedical Communications, U.S. National Library of Medicine, 8600 Rockville Pike, Bethesda, MD, 20894, USA.
| |
Collapse
|
13
|
Gil Y, Greaves M, Hendler J, Hirsh H. Artificial Intelligence. Amplify scientific discovery with artificial intelligence. Science 2014; 346:171-2. [PMID: 25301606 DOI: 10.1126/science.1259439] [Citation(s) in RCA: 46] [Impact Index Per Article: 4.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/02/2022]
Affiliation(s)
- Yolanda Gil
- Information Sciences Institute, University of Southern California, Marina del Rey, CA 90292, USA
| | - Mark Greaves
- Pacific Northwest National Laboratory, Richland, WA 99354, USA
| | - James Hendler
- Information Technology and Web Science, Rensselaer Polytechnic Institute, Troy, NY 12203, USA.
| | - Haym Hirsh
- Cornell University, Ithaca, NY 14850, USA
| |
Collapse
|
14
|
Chaudhri VK, Elenius D, Goldenkranz A, Gong A, Martone ME, Webb W, Yorke-Smith N. Comparative analysis of knowledge representation and reasoning requirements across a range of life sciences textbooks. J Biomed Semantics 2014; 5:51. [PMID: 25785183 PMCID: PMC4362633 DOI: 10.1186/2041-1480-5-51] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/23/2014] [Accepted: 11/26/2014] [Indexed: 11/29/2022] Open
Abstract
Background Using knowledge representation for biomedical projects is now commonplace. In previous work, we represented the knowledge found in a college-level biology textbook in a fashion useful for answering questions. We showed that embedding the knowledge representation and question-answering abilities in an electronic textbook helped to engage student interest and improve learning. A natural question that arises from this success, and this paper’s primary focus, is whether a similar approach is applicable across a range of life science textbooks. To answer that question, we considered four different textbooks, ranging from a below-introductory college biology text to an advanced, graduate-level neuroscience textbook. For these textbooks, we investigated the following questions: (1) To what extent is knowledge shared between the different textbooks? (2) To what extent can the same upper ontology be used to represent the knowledge found in different textbooks? (3) To what extent can the questions of interest for a range of textbooks be answered by using the same reasoning mechanisms? Results Our existing modeling and reasoning methods apply especially well both to a textbook that is comparable in level to the text studied in our previous work (i.e., an introductory-level text) and to a textbook at a lower level, suggesting potential for a high degree of portability. Even for the overlapping knowledge found across the textbooks, the level of detail covered in each textbook was different, which requires that the representations must be customized for each textbook. We also found that for advanced textbooks, representing models and scientific reasoning processes was particularly important. Conclusions With some additional work, our representation methodology would be applicable to a range of textbooks. The requirements for knowledge representation are common across textbooks, suggesting that a shared semantic infrastructure for the life sciences is feasible. Because our representation overlaps heavily with those already being used for biomedical ontologies, this work suggests a natural pathway to include such representations as part of the life sciences curriculum at different grade levels.
Collapse
Affiliation(s)
| | | | | | | | | | - William Webb
- Foothill Community College, Los Altos Hills, CA USA
| | - Neil Yorke-Smith
- American University of Beirut, Beirut, Lebanon ; University of Cambridge, Cambridge, UK
| |
Collapse
|
15
|
Bosio M, Salembier P, Bellot P, Oliveras-Vergès A. Hierarchical clustering combining numerical and biological similarities for gene expression data classification. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2013; 2013:584-7. [PMID: 24109754 DOI: 10.1109/embc.2013.6609567] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
High throughput data analysis is a challenging problem due to the vast amount of available data. A major concern is to develop algorithms that provide accurate numerical predictions and biologically relevant results. A wide variety of tools exist in the literature using biological knowledge to evaluate analysis results. Only recently, some works have included biological knowledge inside the analysis process improving the prediction results.
Collapse
|
16
|
Bohra R, Klepacki J, Klawitter J, Klawitter J, Thurman J, Christians U. Proteomics and metabolomics in renal transplantation-quo vadis? Transpl Int 2013; 26:225-41. [PMID: 23350848 PMCID: PMC4006577 DOI: 10.1111/tri.12003] [Citation(s) in RCA: 36] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/15/2012] [Revised: 05/07/2012] [Accepted: 10/07/2012] [Indexed: 12/13/2022]
Abstract
The improvement of long-term transplant organ and patient survival remains a critical challenge following kidney transplantation. Proteomics and biochemical profiling (metabolomics) may allow for the detection of early changes in cell signal transduction regulation and biochemistry with high sensitivity and specificity. Hence, these analytical strategies hold the promise to detect and monitor disease processes and drug effects before histopathological and pathophysiological changes occur. In addition, they will identify enriched populations and enable individualized drug therapy. However, proteomics and metabolomics have not yet lived up to such high expectations. Renal transplant patients are highly complex, making it difficult to establish cause-effect relationships between surrogate markers and disease processes. Appropriate study design, adequate sample handling, storage and processing, quality and reproducibility of bioanalytical multi-analyte assays, data analysis and interpretation, mechanistic verification, and clinical qualification (=establishment of sensitivity and specificity in adequately powered prospective clinical trials) are important factors for the success of molecular marker discovery and development in renal transplantation. However, a newly developed and appropriately qualified molecular marker can only be successful if it is realistic that it can be implemented in a clinical setting. The development of combinatorial markers with supporting software tools is an attractive goal.
Collapse
Affiliation(s)
- Rahul Bohra
- iC42 Clinical Research & Development, Department of Anesthesiology, University of Colorado Denver, Aurora, Colorado, USA
| | - Jacek Klepacki
- iC42 Clinical Research & Development, Department of Anesthesiology, University of Colorado Denver, Aurora, Colorado, USA
| | - Jelena Klawitter
- iC42 Clinical Research & Development, Department of Anesthesiology, University of Colorado Denver, Aurora, Colorado, USA
- Renal Medicine, University of Colorado Denver, Aurora, USA
| | - Jost Klawitter
- iC42 Clinical Research & Development, Department of Anesthesiology, University of Colorado Denver, Aurora, Colorado, USA
| | - Joshua Thurman
- Renal Medicine, University of Colorado Denver, Aurora, USA
| | - Uwe Christians
- iC42 Clinical Research & Development, Department of Anesthesiology, University of Colorado Denver, Aurora, Colorado, USA
| |
Collapse
|
17
|
Blaschke C, Valencia A. The Functional Genomics Network in the evolution of biological text mining over the past decade. N Biotechnol 2012. [PMID: 23202358 DOI: 10.1016/j.nbt.2012.11.020] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
Different programs of The European Science Foundation (ESF) have contributed significantly to connect researchers in Europe and beyond through several initiatives. This support was particularly relevant for the development of the areas related with extracting information from papers (text-mining) because it supported the field in its early phases long before it was recognized by the community. We review the historical development of text mining research and how it was introduced in bioinformatics. Specific applications in (functional) genomics are described like it's integration in genome annotation pipelines and the support to the analysis of high-throughput genomics experimental data, and we highlight the activities of evaluation of methods and benchmarking for which the ESF programme support was instrumental.
Collapse
Affiliation(s)
- Christian Blaschke
- Spanish National Cancer Research Centre, C/Melchor Fernández Almagro, 3, E-28029 Madrid, Spain.
| | | |
Collapse
|
18
|
Rebholz-Schuhmann D, Oellrich A, Hoehndorf R. Text-mining solutions for biomedical research: enabling integrative biology. Nat Rev Genet 2012; 13:829-39. [DOI: 10.1038/nrg3337] [Citation(s) in RCA: 139] [Impact Index Per Article: 10.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
|
19
|
Abstract
OBJECTIVE This study investigated the utility of advanced computational techniques to large-scale genome-based data to identify novel genes that govern murine pancreatic development. METHODS An expression data set for mouse pancreatic development was complemented with high-throughput data analyzer to identify and prioritize novel genes. Quantitative real-time polymerase chain reaction, in situ hybridization, and immunohistochemistry were used to validate selected genes. RESULTS Four new genes whose roles in the development of murine pancreas have not previously been established were identified: cystathionine β-synthase (Cbs), Meis homeobox 1, growth factor independent 1, and aldehyde dehydrogenase 18 family, member A1. Their temporal expression during development was documented. Cbs was localized in the cytoplasm of the tip cells of the epithelial chords of the undifferentiated progenitor cells at E12.5 and was coexpressed with the pancreatic and duodenal homeobox 1 and pancreas-specific transcription factor, 1a-positive cells. In the adult pancreas, Cbs was localized primarily within the acinar compartment. CONCLUSIONS In silico analysis of high-throughput microarray data in combination with background knowledge about genes provides an additional reliable method of identifying novel genes. To our knowledge, the expression and localization of Cbs have not been previously documented during mouse pancreatic development.
Collapse
|
20
|
Bada M, Eckert M, Evans D, Garcia K, Shipley K, Sitnikov D, Baumgartner WA, Cohen KB, Verspoor K, Blake JA, Hunter LE. Concept annotation in the CRAFT corpus. BMC Bioinformatics 2012; 13:161. [PMID: 22776079 PMCID: PMC3476437 DOI: 10.1186/1471-2105-13-161] [Citation(s) in RCA: 120] [Impact Index Per Article: 9.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2011] [Accepted: 06/08/2012] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Manually annotated corpora are critical for the training and evaluation of automated methods to identify concepts in biomedical text. RESULTS This paper presents the concept annotations of the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-length, open-access biomedical journal articles that have been annotated both semantically and syntactically to serve as a research resource for the biomedical natural-language-processing (NLP) community. CRAFT identifies all mentions of nearly all concepts from nine prominent biomedical ontologies and terminologies: the Cell Type Ontology, the Chemical Entities of Biological Interest ontology, the NCBI Taxonomy, the Protein Ontology, the Sequence Ontology, the entries of the Entrez Gene database, and the three subontologies of the Gene Ontology. The first public release includes the annotations for 67 of the 97 articles, reserving two sets of 15 articles for future text-mining competitions (after which these too will be released). Concept annotations were created based on a single set of guidelines, which has enabled us to achieve consistently high interannotator agreement. CONCLUSIONS As the initial 67-article release contains more than 560,000 tokens (and the full set more than 790,000 tokens), our corpus is among the largest gold-standard annotated biomedical corpora. Unlike most others, the journal articles that comprise the corpus are drawn from diverse biomedical disciplines and are marked up in their entirety. Additionally, with a concept-annotation count of nearly 100,000 in the 67-article subset (and more than 140,000 in the full collection), the scale of conceptual markup is also among the largest of comparable corpora. The concept annotations of the CRAFT Corpus have the potential to significantly advance biomedical text mining by providing a high-quality gold standard for NLP systems. The corpus, annotation guidelines, and other associated resources are freely available at http://bionlp-corpora.sourceforge.net/CRAFT/index.shtml.
Collapse
Affiliation(s)
- Michael Bada
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Miriam Eckert
- Department of Linguistics, University of Colorado Boulder, Boulder, CO, USA
| | - Donald Evans
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Kristin Garcia
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Krista Shipley
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Dmitry Sitnikov
- Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME, USA
| | - William A Baumgartner
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - K Bretonnel Cohen
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Karin Verspoor
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
- Victoria Research Lab, National ICT Australia, Melbourne, VIC, 3010, Australia
| | - Judith A Blake
- Mouse Genome Informatics, The Jackson Laboratory, Bar Harbor, ME, USA
| | - Lawrence E Hunter
- Department of Pharmacology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| |
Collapse
|
21
|
Hahn U, Cohen KB, Garten Y, Shah NH. Mining the pharmacogenomics literature--a survey of the state of the art. Brief Bioinform 2012; 13:460-94. [PMID: 22833496 PMCID: PMC3404399 DOI: 10.1093/bib/bbs018] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2011] [Accepted: 03/23/2012] [Indexed: 01/05/2023] Open
Abstract
This article surveys efforts on text mining of the pharmacogenomics literature, mainly from the period 2008 to 2011. Pharmacogenomics (or pharmacogenetics) is the field that studies how human genetic variation impacts drug response. Therefore, publications span the intersection of research in genotypes, phenotypes and pharmacology, a topic that has increasingly become a focus of active research in recent years. This survey covers efforts dealing with the automatic recognition of relevant named entities (e.g. genes, gene variants and proteins, diseases and other pathological phenomena, drugs and other chemicals relevant for medical treatment), as well as various forms of relations between them. A wide range of text genres is considered, such as scientific publications (abstracts, as well as full texts), patent texts and clinical narratives. We also discuss infrastructure and resources needed for advanced text analytics, e.g. document corpora annotated with corresponding semantic metadata (gold standards and training data), biomedical terminologies and ontologies providing domain-specific background knowledge at different levels of formality and specificity, software architectures for building complex and scalable text analytics pipelines and Web services grounded to them, as well as comprehensive ways to disseminate and interact with the typically huge amounts of semiformal knowledge structures extracted by text mining tools. Finally, we consider some of the novel applications that have already been developed in the field of pharmacogenomic text mining and point out perspectives for future research.
Collapse
Affiliation(s)
- Udo Hahn
- Jena University Language and Information Engineering (JULIE) Lab, Friedrich-Schiller-Universität Jena, Jena, Germany.
| | | | | | | |
Collapse
|
22
|
Galligan JJ, Smathers RL, Fritz KS, Epperson LE, Hunter LE, Petersen DR. Protein carbonylation in a murine model for early alcoholic liver disease. Chem Res Toxicol 2012; 25:1012-21. [PMID: 22502949 DOI: 10.1021/tx300002q] [Citation(s) in RCA: 246] [Impact Index Per Article: 18.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
Abstract
Hepatic oxidative stress and subsequent lipid peroxidation are well-recognized consequences of sustained ethanol consumption. The covalent adduction of nucleophilic amino acid side-chains by lipid electrophiles is significantly increased in patients with alcoholic liver disease (ALD); a global assessment of in vivo protein targets and the consequences of these modifications, however, has not been conducted. In this article, we describe the identification of novel protein targets for covalent adduction in a 6-week murine model for ALD. Ethanol-fed mice displayed a 2-fold increase in hepatic TBARS, while immunohistochemical analysis for the reactive aldehydes 4-hydroxynonenal (4-HNE), 4-oxononenal (4-ONE), acrolein (ACR), and malondialdehyde (MDA) revealed a marked increase in the staining of modified proteins in the ethanol-treated mice. Increased protein carbonyl content was confirmed utilizing subcellular fractionation of liver homogenates followed by biotin-tagging through hydrazide chemistry, where approximately a 2-fold increase in modified proteins was observed in microsomal and cytosolic fractions. To determine targets of protein carbonylation, a secondary hydrazide method coupled to a highly sensitive 2-dimensional liquid chromatography tandem mass spectrometry (2D LC-MS/MS or MuDPIT) technique was utilized. Our results have identified 414 protein targets for modification by reactive aldehydes in ALD. The presence of novel in vivo sites of protein modification by 4-HNE (2), 4-ONE (4) and ACR (2) was also confirmed in our data set. While the precise impact of protein carbonylation in ALD remains unknown, a bioinformatic analysis of the data set has revealed key pathways associated with disease progression, including fatty acid metabolism, drug metabolism, oxidative phosphorylation, and the TCA cycle. These data suggest a major role for aldehyde adduction in the pathogenesis of ALD.
Collapse
Affiliation(s)
- James J Galligan
- Department of Pharmacology, School of Medicine, University of Colorado-Denver, Aurora, CO 80045, USA
| | | | | | | | | | | |
Collapse
|
23
|
Sartor MA, Ade A, Wright Z, States D, Omenn GS, Athey B, Karnovsky A. Metab2MeSH: annotating compounds with medical subject headings. ACTA ACUST UNITED AC 2012; 28:1408-10. [PMID: 22492643 DOI: 10.1093/bioinformatics/bts156] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/30/2023]
Abstract
SUMMARY Progress in high-throughput genomic technologies has led to the development of a variety of resources that link genes to functional information contained in the biomedical literature. However, tools attempting to link small molecules to normal and diseased physiology and published data relevant to biologists and clinical investigators, are still lacking. With metabolomics rapidly emerging as a new omics field, the task of annotating small molecule metabolites becomes highly relevant. Our tool Metab2MeSH uses a statistical approach to reliably and automatically annotate compounds with concepts defined in Medical Subject Headings, and the National Library of Medicine's controlled vocabulary for biomedical concepts. These annotations provide links from compounds to biomedical literature and complement existing resources such as PubChem and the Human Metabolome Database.
Collapse
Affiliation(s)
- Maureen A Sartor
- National Center for Integrative Biomedical Informatics, University of Michigan, Ann Arbor, MI 48109, USA
| | | | | | | | | | | | | |
Collapse
|
24
|
Cohen KB, Verspoor K, Johnson HL, Roeder C, Ogren PV, Baumgartner WA, White E, Tipney H, Hunter L. HIGH-PRECISION BIOLOGICAL EVENT EXTRACTION: EFFECTS OF SYSTEM AND OF DATA. Comput Intell 2011; 27:681-701. [PMID: 25937701 DOI: 10.1111/j.1467-8640.2011.00405.x] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
Abstract
We approached the problems of event detection, argument identification, and negation and speculation detection in the BioNLP'09 information extraction challenge through concept recognition and analysis. Our methodology involved using the OpenDMAP semantic parser with manually written rules. The original OpenDMAP system was updated for this challenge with a broad ontology defined for the events of interest, new linguistic patterns for those events, and specialized coordination handling. We achieved state-of-the-art precision for two of the three tasks, scoring the highest of 24 teams at precision of 71.81 on Task 1 and the highest of 6 teams at precision of 70.97 on Task 2. We provide a detailed analysis of the training data and show that a number of trigger words were ambiguous as to event type, even when their arguments are constrained by semantic class. The data is also shown to have a number of missing annotations. Analysis of a sampling of the comparatively small number of false positives returned by our system shows that major causes of this type of error were failing to recognize second themes in two-theme events, failing to recognize events when they were the arguments to other events, failure to recognize nontheme arguments, and sentence segmentation errors. We show that specifically handling coordination had a small but important impact on the overall performance of the system. The OpenDMAP system and the rule set are available at http://bionlp.sourceforge.net.
Collapse
Affiliation(s)
- K Bretonnel Cohen
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO, USA
| | - Karin Verspoor
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO, USA
| | - Helen L Johnson
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO, USA
| | - Chris Roeder
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO, USA
| | - Philip V Ogren
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO, USA
| | - William A Baumgartner
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO, USA
| | - Elizabeth White
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO, USA
| | - Hannah Tipney
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO, USA
| | - Lawrence Hunter
- Center for Computational Pharmacology, University of Colorado Denver School of Medicine, Aurora, CO, USA
| |
Collapse
|
25
|
Sintchenko V, Coiera EW. Translational web robots for pathogen genome analysis. MICROBIAL INFORMATICS AND EXPERIMENTATION 2011; 1:10. [PMID: 22587672 PMCID: PMC3372293 DOI: 10.1186/2042-5783-1-10] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/20/2011] [Accepted: 10/31/2011] [Indexed: 11/10/2022]
Affiliation(s)
- Vitali Sintchenko
- Centre for Infectious Diseases and Microbiology-Public Health, Institute of Clinical Pathology and Medical Research, Westmead Hospital, Sydney, New South Wales, 2145 Australia.
| | | |
Collapse
|
26
|
Tsuruoka Y, Miwa M, Hamamoto K, Tsujii J, Ananiadou S. Discovering and visualizing indirect associations between biomedical concepts. Bioinformatics 2011; 27:i111-9. [PMID: 21685059 PMCID: PMC3117364 DOI: 10.1093/bioinformatics/btr214] [Citation(s) in RCA: 55] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Motivation: Discovering useful associations between biomedical concepts has been one of the main goals in biomedical text-mining, and understanding their biomedical contexts is crucial in the discovery process. Hence, we need a text-mining system that helps users explore various types of (possibly hidden) associations in an easy and comprehensible manner. Results: This article describes FACTA+, a real-time text-mining system for finding and visualizing indirect associations between biomedical concepts from MEDLINE abstracts. The system can be used as a text search engine like PubMed with additional features to help users discover and visualize indirect associations between important biomedical concepts such as genes, diseases and chemical compounds. FACTA+ inherits all functionality from its predecessor, FACTA, and extends it by incorporating three new features: (i) detecting biomolecular events in text using a machine learning model, (ii) discovering hidden associations using co-occurrence statistics between concepts, and (iii) visualizing associations to improve the interpretability of the output. To the best of our knowledge, FACTA+ is the first real-time web application that offers the functionality of finding concepts involving biomolecular events and visualizing indirect associations of concepts with both their categories and importance. Availability: FACTA+ is available as a web application at http://refine1-nactem.mc.man.ac.uk/facta/, and its visualizer is available at http://refine1-nactem.mc.man.ac.uk/facta-visualizer/. Contact:tsuruoka@jaist.ac.jp
Collapse
Affiliation(s)
- Yoshimasa Tsuruoka
- School of Information Science, Japan Advanced Institute of Science and Technology (JAIST), Nomi, Japan.
| | | | | | | | | |
Collapse
|
27
|
Cohen KB, Christiansen T, Hunter LE. Parenthetically speaking: classifying the contents of parentheses for text mining. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2011; 2011:267-272. [PMID: 22195078 PMCID: PMC3243264] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
The contents of parentheses in biomedical text have many potential uses in text mining applications. However, making use of them requires the ability to determine what class of contents they are. A system that automatically classifies parenthesized text into one of 20 categories is presented and evaluated here. It performs at a micro-averaged accuracy of 68% and a macro-averaged accuracy of 60% on an annotated corpus. The application is available as a Java class and as a Perl module.
Collapse
Affiliation(s)
- K Bretonnel Cohen
- Computational Bioscience Program, University of Colorado School of Medicine, Aurora, CO, USA
| | | | | |
Collapse
|
28
|
Acquaah-Mensah GK, Taylor RC, Bhave SV. PACAP interactions in the mouse brain: implications for behavioral and other disorders. Gene 2011; 491:224-31. [PMID: 22001548 DOI: 10.1016/j.gene.2011.09.017] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2011] [Revised: 09/02/2011] [Accepted: 09/09/2011] [Indexed: 12/24/2022]
Abstract
As an activator of adenylate cyclase, the neuropeptide Pituitary Adenylate Cyclase Activating Peptide (PACAP) impacts levels of cyclic AMP, a key second messenger available in brain cells. PACAP is involved in certain adult behaviors. To elucidate PACAP interactions, a compendium of microarrays representing mRNA expression in the adult mouse whole brain was pooled from the Phenogen database for analysis. A regulatory network was computed based on mutual information between gene pairs using gene expression data across the compendium. Clusters among genes directly linked to PACAP, and probable interactions between corresponding proteins were computed. Database "experts" affirmed some of the inferred relationships. The findings suggest ADCY7 is probably the adenylate cyclase isoform most relevant to PACAP's action. They also support intervening roles for kinases including GSK3B, PI 3-kinase, SGK3 and AMPK. Other high-confidence interactions are hypothesized for future testing. This new information has implications for certain behavioral and other disorders.
Collapse
Affiliation(s)
- George K Acquaah-Mensah
- Department of Pharmaceutical Sciences, Massachusetts College of Pharmacy and Health Sciences, Worcester, MA 01608, USA.
| | | | | |
Collapse
|
29
|
|
30
|
Evans JA, Rzhetsky A. Advancing science through mining libraries, ontologies, and communities. J Biol Chem 2011; 286:23659-66. [PMID: 21566119 PMCID: PMC3129146 DOI: 10.1074/jbc.r110.176370] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/04/2022] Open
Abstract
Life scientists today cannot hope to read everything relevant to their research. Emerging text-mining tools can help by identifying topics and distilling statements from books and articles with increased accuracy. Researchers often organize these statements into ontologies, consistent systems of reality claims. Like scientific thinking and interchange, however, text-mined information (even when accurately captured) is complex, redundant, sometimes incoherent, and often contradictory: it is rooted in a mixture of only partially consistent ontologies. We review work that models scientific reason and suggest how computational reasoning across ontologies and the broader distribution of textual statements can assess the certainty of statements and the process by which statements become certain. With the emergence of digitized data regarding networks of scientific authorship, institutions, and resources, we explore the possibility of accounting for social dependences and cultural biases in reasoning models. Computational reasoning is starting to fill out ontologies and flag internal inconsistencies in several areas of bioscience. In the not too distant future, scientists may be able to use statements and rich models of the processes that produced them to identify underexplored areas, resurrect forgotten findings and ideas, deconvolute the spaghetti of underlying ontologies, and synthesize novel knowledge and hypotheses.
Collapse
Affiliation(s)
- James A Evans
- Department of Sociology, University of Chicago, Chicago, Illinois 60637, USA.
| | | |
Collapse
|
31
|
Jelier R, Goeman JJ, Hettne KM, Schuemie MJ, den Dunnen JT, 't Hoen PAC. Literature-aided interpretation of gene expression data with the weighted global test. Brief Bioinform 2010; 12:518-29. [PMID: 21183478 DOI: 10.1093/bib/bbq082] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/11/2022] Open
Abstract
Most methods for the interpretation of gene expression profiling experiments rely on the categorization of genes, as provided by the Gene Ontology (GO) and pathway databases. Due to the manual curation process, such databases are never up-to-date and tend to be limited in focus and coverage. Automated literature mining tools provide an attractive, alternative approach. We review how they can be employed for the interpretation of gene expression profiling experiments. We illustrate that their comprehensive scope aids the interpretation of data from domains poorly covered by GO or alternative databases, and allows for the linking of gene expression with diseases, drugs, tissues and other types of concepts. A framework for proper statistical evaluation of the associations between gene expression values and literature concepts was lacking and is now implemented in a weighted extension of global test. The weights are the literature association scores and reflect the importance of a gene for the concept of interest. In a direct comparison with classical GO-based gene sets, we show that use of literature-based associations results in the identification of much more specific GO categories. We demonstrate the possibilities for linking of gene expression data to patient survival in breast cancer and the action and metabolism of drugs. Coupling with online literature mining tools ensures transparency and allows further study of the identified associations. Literature mining tools are therefore powerful additions to the toolbox for the interpretation of high-throughput genomics data.
Collapse
Affiliation(s)
- Rob Jelier
- Department of Human Genetics, Leiden University Medical Center, Leiden, The Netherlands
| | | | | | | | | | | |
Collapse
|
32
|
Thomas R, de la Torre L, Chang X, Mehrotra S. Validation and characterization of DNA microarray gene expression data distribution and associated moments. BMC Bioinformatics 2010; 11:576. [PMID: 21092329 PMCID: PMC3002903 DOI: 10.1186/1471-2105-11-576] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2010] [Accepted: 11/24/2010] [Indexed: 02/03/2023] Open
Abstract
Background The data from DNA microarrays are increasingly being used in order to understand effects of different conditions, exposures or diseases on the modulation of the expression of various genes in a biological system. This knowledge is then further used in order to generate molecular mechanistic hypotheses for an organism when it is exposed to different conditions. Several different methods have been proposed to analyze these data under different distributional assumptions on gene expression. However, the empirical validation of these assumptions is lacking. Results Best fit hypotheses tests, moment-ratio diagrams and relationships between the different moments of the distribution of the gene expression was used to characterize the observed distributions. The data are obtained from the publicly available gene expression database, Gene Expression Omnibus (GEO) to characterize the empirical distributions of gene expressions obtained under varying experimental situations each of which providing relatively large number of samples for hypothesis testing. All data were obtained from either of two microarray platforms - the commercial Affymetrix mouse 430.2 platform and a non-commercial Rosetta/Merck one. The data from each platform were preprocessed in the same manner. Conclusions The null hypotheses for goodness of fit for all considered univariate theoretical probability distributions (including the Normal distribution) are rejected for more than 50% of probe sets on the Affymetrix microarray platform at a 95% confidence level, suggesting that under the tested conditions a priori assumption of any of these distributions across all probe sets is not valid. The pattern of null hypotheses rejection was different for the data from Rosetta/Merck platform with only around 20% of the probe sets failing the logistic distribution goodness-of-fit test. We find that there are statistically significant (at 95% confidence level based on the F-test for the fitted linear model) relationships between the mean and the logarithm of the coefficient of variation of the distributions of the logarithm of gene expressions. An additional novel statistically significant quadratic relationship between the skewness and kurtosis is identified. Data from both microarray platforms fail to identify with any one of the chosen theoretical probability distributions from an analysis of the l-moment ratio diagram.
Collapse
Affiliation(s)
- Reuben Thomas
- Department of Industrial Engineering and Management Sciences, Northwestern University, Evanston, IL, USA.
| | | | | | | |
Collapse
|
33
|
Bada M, Hunter L. Desiderata for ontologies to be used in semantic annotation of biomedical documents. J Biomed Inform 2010; 44:94-101. [PMID: 20971216 DOI: 10.1016/j.jbi.2010.10.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/22/2009] [Revised: 10/03/2010] [Accepted: 10/09/2010] [Indexed: 11/20/2022]
Abstract
A wealth of knowledge valuable to the translational research scientist is contained within the vast biomedical literature, but this knowledge is typically in the form of natural language. Sophisticated natural-language-processing systems are needed to translate text into unambiguous formal representations grounded in high-quality consensus ontologies, and these systems in turn rely on gold-standard corpora of annotated documents for training and testing. To this end, we are constructing the Colorado Richly Annotated Full-Text (CRAFT) Corpus, a collection of 97 full-text biomedical journal articles that are being manually annotated with the entire sets of terms from select vocabularies, predominantly from the Open Biomedical Ontologies (OBO) library. Our efforts in building this corpus has illuminated infelicities of these ontologies with respect to the semantic annotation of biomedical documents, and we propose desiderata whose implementation could substantially improve their utility in this task; these include the integration of overlapping terms across OBOs, the resolution of OBO-specific ambiguities, the integration of the BFO with the OBOs and the use of mid-level ontologies, the inclusion of noncanonical instances, and the expansion of relations and realizable entities.
Collapse
Affiliation(s)
- Michael Bada
- Department of Pharmacology, University of Colorado Denver, MS 8303, RC-1 South, 12801 East 17th Avenue, L18-6400, P.O. Box 6511, Aurora, CO 80045, USA.
| | | |
Collapse
|
34
|
Kann MG. Advances in translational bioinformatics: computational approaches for the hunting of disease genes. Brief Bioinform 2010; 11:96-110. [PMID: 20007728 PMCID: PMC2810112 DOI: 10.1093/bib/bbp048] [Citation(s) in RCA: 53] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/11/2009] [Revised: 09/15/2009] [Indexed: 12/29/2022] Open
Abstract
Over a 100 years ago, William Bateson provided, through his observations of the transmission of alkaptonuria in first cousin offspring, evidence of the application of Mendelian genetics to certain human traits and diseases. His work was corroborated by Archibald Garrod (Archibald AE. The incidence of alkaptonuria: a study in chemical individuality. Lancert 1902;ii:1616-20) and William Farabee (Farabee WC. Inheritance of digital malformations in man. In: Papers of the Peabody Museum of American Archaeology and Ethnology. Cambridge, Mass: Harvard University, 1905; 65-78), who recorded the familial tendencies of inheritance of malformations of human hands and feet. These were the pioneers of the hunt for disease genes that would continue through the century and result in the discovery of hundreds of genes that can be associated with different diseases. Despite many ground-breaking discoveries during the last century, we are far from having a complete understanding of the intricate network of molecular processes involved in diseases, and we are still searching for the cures for most complex diseases. In the last few years, new genome sequencing and other high-throughput experimental techniques have generated vast amounts of molecular and clinical data that contain crucial information with the potential of leading to the next major biomedical discoveries. The need to mine, visualize and integrate these data has motivated the development of several informatics approaches that can broadly be grouped in the research area of 'translational bioinformatics'. This review highlights the latest advances in the field of translational bioinformatics, focusing on the advances of computational techniques to search for and classify disease genes.
Collapse
Affiliation(s)
- Maricel G Kann
- University of Maryland, Baltimore County, 1000 Hilltop Circle, Baltimore, MD 21250, USA.
| |
Collapse
|
35
|
|
36
|
Feng W, Leach SM, Tipney H, Phang T, Geraci M, Spritz RA, Hunter LE, Williams T. Spatial and temporal analysis of gene expression during growth and fusion of the mouse facial prominences. PLoS One 2009; 4:e8066. [PMID: 20016822 PMCID: PMC2789411 DOI: 10.1371/journal.pone.0008066] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/19/2009] [Accepted: 10/25/2009] [Indexed: 11/19/2022] Open
Abstract
Orofacial malformations resulting from genetic and/or environmental causes are frequent human birth defects yet their etiology is often unclear because of insufficient information concerning the molecular, cellular and morphogenetic processes responsible for normal facial development. We have, therefore, derived a comprehensive expression dataset for mouse orofacial development, interrogating three distinct regions – the mandibular, maxillary and frontonasal prominences. To capture the dynamic changes in the transcriptome during face formation, we sampled five time points between E10.5–E12.5, spanning the developmental period from establishment of the prominences to their fusion to form the mature facial platform. Seven independent biological replicates were used for each sample ensuring robustness and quality of the dataset. Here, we provide a general overview of the dataset, characterizing aspects of gene expression changes at both the spatial and temporal level. Considerable coordinate regulation occurs across the three prominences during this period of facial growth and morphogenesis, with a switch from expression of genes involved in cell proliferation to those associated with differentiation. An accompanying shift in the expression of polycomb and trithorax genes presumably maintains appropriate patterns of gene expression in precursor or differentiated cells, respectively. Superimposed on the many coordinated changes are prominence-specific differences in the expression of genes encoding transcription factors, extracellular matrix components, and signaling molecules. Thus, the elaboration of each prominence will be driven by particular combinations of transcription factors coupled with specific cell:cell and cell:matrix interactions. The dataset also reveals several prominence-specific genes not previously associated with orofacial development, a subset of which we externally validate. Several of these latter genes are components of bidirectional transcription units that likely share cis-acting sequences with well-characterized genes. Overall, our studies provide a valuable resource for probing orofacial development and a robust dataset for bioinformatic analysis of spatial and temporal gene expression changes during embryogenesis.
Collapse
Affiliation(s)
- Weiguo Feng
- Department of Craniofacial Biology, University of Colorado Denver, Aurora, Colorado, United States of America
| | - Sonia M. Leach
- Department of Pharmacology, University of Colorado Denver, Aurora, Colorado, United States of America
| | - Hannah Tipney
- Department of Pharmacology, University of Colorado Denver, Aurora, Colorado, United States of America
| | - Tzulip Phang
- Department of Pharmacology, University of Colorado Denver, Aurora, Colorado, United States of America
| | - Mark Geraci
- Department of Medicine, University of Colorado Denver, Aurora, Colorado, United States of America
| | - Richard A. Spritz
- Human Medical Genetics Program, University of Colorado Denver, Aurora, Colorado, United States of America
| | - Lawrence E. Hunter
- Department of Pharmacology, University of Colorado Denver, Aurora, Colorado, United States of America
| | - Trevor Williams
- Department of Craniofacial Biology, University of Colorado Denver, Aurora, Colorado, United States of America
- Department of Cell and Developmental Biology, University of Colorado Denver, Aurora, Colorado, United States of America
- * E-mail:
| |
Collapse
|
37
|
Iossifov I, Rodriguez-Esteban R, Mayzus I, Millen KJ, Rzhetsky A. Looking at cerebellar malformations through text-mined interactomes of mice and humans. PLoS Comput Biol 2009; 5:e1000559. [PMID: 19893633 PMCID: PMC2767227 DOI: 10.1371/journal.pcbi.1000559] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2009] [Accepted: 10/07/2009] [Indexed: 12/11/2022] Open
Abstract
We have generated and made publicly available two very large networks of molecular interactions: 49,493 mouse-specific and 52,518 human-specific interactions. These networks were generated through automated analysis of 368,331 full-text research articles and 8,039,972 article abstracts from the PubMed database, using the GeneWays system. Our networks cover a wide spectrum of molecular interactions, such as bind, phosphorylate, glycosylate, and activate; 207 of these interaction types occur more than 1,000 times in our unfiltered, multi-species data set. Because mouse and human genes are linked through an orthological relationship, human and mouse networks are amenable to straightforward, joint computational analysis. Using our newly generated networks and known associations between mouse genes and cerebellar malformation phenotypes, we predicted a number of new associations between genes and five cerebellar phenotypes (small cerebellum, absent cerebellum, cerebellar degeneration, abnormal foliation, and abnormal vermis). Using a battery of statistical tests, we showed that genes that are associated with cerebellar phenotypes tend to form compact network clusters. Further, we observed that cerebellar malformation phenotypes tend to be associated with highly connected genes. This tendency was stronger for developmental phenotypes and weaker for cerebellar degeneration. We described and made publicly available the largest existing set of text-mined statements; we also presented its application to an important biological problem. We have extracted and purified two large molecular networks, one for humans and one for mouse. We characterized the data sets, described the methods we used to generate them, and presented a novel biological application of the networks to study the etiology of five cerebellum phenotypes. We demonstrated quantitatively that the development-related malformations differ in their system-level properties from degeneration-related genes. We showed that there is a high degree of overlap among the genes implicated in the developmental malformations, that these genes have a strong tendency to be highly connected within the molecular network, and that they also tend to be clustered together, forming a compact molecular network neighborhood. In contrast, the genes involved in malformations due to degeneration do not have a high degree of connectivity, are not strongly clustered in the network, and do not overlap significantly with the development related genes. In addition, taking into account the above-mentioned system-level properties and the gene-specific network interactions, we made highly confident predictions about novel genes that are likely also involved in the etiology of the analyzed phenotypes.
Collapse
Affiliation(s)
- Ivan Iossifov
- Cold Spring Harbor Laboratory, Cold Spring Harbor, New York, United States of America
| | - Raul Rodriguez-Esteban
- Biotherapeutics and Integrative Biology, Boehringer Ingelheim, Ridgefield, Connecticut, United States of America
| | - Ilya Mayzus
- Center for Computational Biology and Bioinformatics, Columbia University, New York, New York, United States of America
| | - Kathleen J. Millen
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
| | - Andrey Rzhetsky
- Department of Human Genetics, University of Chicago, Chicago, Illinois, United States of America
- Department of Medicine, Institute for Genomics and Systems Biology, Computation Institute, University of Chicago, Chicago, Illinois, United States of America
- * E-mail:
| |
Collapse
|
38
|
Tipney HJ, Schuyler RP, Hunter L. Consistent visualizations of changing knowledge. SUMMIT ON TRANSLATIONAL BIOINFORMATICS 2009; 2009:129-32. [PMID: 21347184 PMCID: PMC3041575] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
Networks are increasingly used in biology to represent complex data in uncomplicated symbolic form. However, as biological knowledge is continually evolving, so must those networks representing this knowledge. Capturing and presenting this type of knowledge change over time is particularly challenging due to the intimate manner in which researchers customize those networks they come into contact with. The effective visualization of this knowledge is important as it creates insight into complex systems and stimulates hypothesis generation and biological discovery. Here we highlight how the retention of user customizations, and the collection and visualization of knowledge associated provenance supports effective and productive network exploration. We also present an extension of the Hanalyzer system, ReOrient, which supports network exploration and analysis in the presence of knowledge change.
Collapse
|
39
|
Tipney HJ, Leach SM, Feng W, Spritz R, Williams T, Hunter L. Leveraging existing biological knowledge in the identification of candidate genes for facial dysmorphology. BMC Bioinformatics 2009; 10 Suppl 2:S12. [PMID: 19208187 PMCID: PMC2646237 DOI: 10.1186/1471-2105-10-s2-s12] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/31/2022] Open
Abstract
Background In response to the frequently overwhelming output of high-throughput microarray experiments, we propose a methodology to facilitate interpretation of biological data in the context of existing knowledge. Through the probabilistic integration of explicit and implicit data sources a functional interaction network can be constructed. Each edge connecting two proteins is weighted by a confidence value capturing the strength and reliability of support for that interaction given the combined data sources. The resulting network is examined in conjunction with expression data to identify groups of genes with significant temporal or tissue specific patterns. In contrast to unstructured gene lists, these networks often represent coherent functional groupings. Results By linking from shared functional categorizations to primary biological resources we apply this method to craniofacial microarray data, generating biologically testable hypotheses and identifying candidate genes for craniofacial development. Conclusion The novel methodology presented here illustrates how the effective integration of pre-existing biological knowledge and high-throughput experimental data drives biological discovery and hypothesis generation.
Collapse
Affiliation(s)
- Hannah J Tipney
- Computational Pharmacology Department, University of Colorado at Denver and Health Sciences Center, Aurora, CO, USA.
| | | | | | | | | | | |
Collapse
|
40
|
Predicting positive p53 cancer rescue regions using Most Informative Positive (MIP) active learning. PLoS Comput Biol 2008; 5:e1000498. [PMID: 19756158 PMCID: PMC2742196 DOI: 10.1371/journal.pcbi.1000498] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2009] [Accepted: 08/04/2009] [Indexed: 11/19/2022] Open
Abstract
Many protein engineering problems involve finding mutations that produce proteins
with a particular function. Computational active learning is an attractive
approach to discover desired biological activities. Traditional active learning
techniques have been optimized to iteratively improve classifier accuracy, not
to quickly discover biologically significant results. We report here a novel
active learning technique, Most Informative Positive (MIP), which is tailored to
biological problems because it seeks novel and informative positive results. MIP
active learning differs from traditional active learning methods in two ways:
(1) it preferentially seeks Positive (functionally active) examples; and (2) it
may be effectively extended to select gene regions suitable for high throughput
combinatorial mutagenesis. We applied MIP to discover mutations in the tumor
suppressor protein p53 that reactivate mutated p53 found in human cancers. This
is an important biomedical goal because p53 mutants have been
implicated in half of all human cancers, and restoring active p53 in tumors
leads to tumor regression. MIP found Positive (cancer rescue) p53 mutants
in silico using 33% fewer experiments than
traditional non-MIP active learning, with only a minor decrease in classifier
accuracy. Applying MIP to in vivo experimentation yielded
immediate Positive results. Ten different p53 mutations found in human cancers
were paired in silico with all possible single amino acid
rescue mutations, from which MIP was used to select a Positive Region predicted
to be enriched for p53 cancer rescue mutants. In vivo assays
showed that the predicted Positive Region: (1) had significantly more
(p<0.01) new strong cancer rescue mutants than control regions (Negative,
and non-MIP active learning); (2) had slightly more new strong cancer rescue
mutants than an Expert region selected for purely biological considerations; and
(3) rescued for the first time the previously unrescuable p53 cancer mutant
P152L. Engineering proteins to acquire or enhance a particular useful function is at the
core of many biomedical problems. This paper presents Most Informative Positive
(MIP) active learning, a novel integrated computational/biological approach
designed to help guide biological discovery of novel and informative positive
mutants. A classifier, together with modeled structure-based features, helps
guide biological experiments and so accelerates protein engineering studies. MIP
reduces the number of expensive biological experiments needed to achieve novel
and informative positive results. We used the MIP method to discover novel p53
cancer rescue mutants. p53 is a tumor suppressor protein, and destructive p53
mutations have been implicated in half of all human cancers. Second-site cancer
rescue mutations restore p53 activity and eventually may facilitate rational
design of better cancer drugs. This paper shows that, even in the first round of
in vivo experiments, MIP significantly increased the discovery rate of novel and
informative positive mutants.
Collapse
|