1
|
Landolsi MY, Hlaoua L, Ben Romdhane L. Information extraction from electronic medical documents: state of the art and future research directions. Knowl Inf Syst 2023; 65:463-516. [PMID: 36405956 PMCID: PMC9640816 DOI: 10.1007/s10115-022-01779-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 05/04/2022] [Accepted: 10/17/2022] [Indexed: 11/10/2022]
Abstract
In the medical field, a doctor must have a comprehensive knowledge by reading and writing narrative documents, and he is responsible for every decision he takes for patients. Unfortunately, it is very tiring to read all necessary information about drugs, diseases and patients due to the large amount of documents that are increasing every day. Consequently, so many medical errors can happen and even kill people. Likewise, there is such an important field that can handle this problem, which is the information extraction. There are several important tasks in this field to extract the important and desired information from unstructured text written in natural language. The main principal tasks are named entity recognition and relation extraction since they can structure the text by extracting the relevant information. However, in order to treat the narrative text we should use natural language processing techniques to extract useful information and features. In our paper, we introduce and discuss the several techniques and solutions used in these tasks. Furthermore, we outline the challenges in information extraction from medical documents. In our knowledge, this is the most comprehensive survey in the literature with an experimental analysis and a suggestion for some uncovered directions.
Collapse
Affiliation(s)
- Mohamed Yassine Landolsi
- MARS Research Laboratory, SDM Research Group, ISITCom, University of Sousse, Hammam Sousse, Tunisia
| | - Lobna Hlaoua
- MARS Research Laboratory, SDM Research Group, ISITCom, University of Sousse, Hammam Sousse, Tunisia
| | - Lotfi Ben Romdhane
- MARS Research Laboratory, SDM Research Group, ISITCom, University of Sousse, Hammam Sousse, Tunisia
| |
Collapse
|
2
|
Schultes E, Roos M, Bonino da Silva Santos LO, Guizzardi G, Bouwman J, Hankemeier T, Baak A, Mons B. FAIR Digital Twins for Data-Intensive Research. Front Big Data 2022; 5:883341. [PMID: 35647536 PMCID: PMC9130601 DOI: 10.3389/fdata.2022.883341] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2022] [Accepted: 04/12/2022] [Indexed: 11/13/2022] Open
Abstract
Although all the technical components supporting fully orchestrated Digital Twins (DT) currently exist, what remains missing is a conceptual clarification and analysis of a more generalized concept of a DT that is made FAIR, that is, universally machine actionable. This methodological overview is a first step toward this clarification. We present a review of previously developed semantic artifacts and how they may be used to compose a higher-order data model referred to here as a FAIR Digital Twin (FDT). We propose an architectural design to compose, store and reuse FDTs supporting data intensive research, with emphasis on privacy by design and their use in GDPR compliant open science.
Collapse
|
3
|
Wang Y, Zhang S, Yang L, Yang S, Tian Y, Ma Q. Measurement of Conditional Relatedness Between Genes Using Fully Convolutional Neural Network. Front Genet 2019; 10:1009. [PMID: 31695723 PMCID: PMC6818468 DOI: 10.3389/fgene.2019.01009] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2019] [Accepted: 09/23/2019] [Indexed: 11/13/2022] Open
Abstract
Measuring conditional relatedness, the degree of relation between a pair of genes in a certain condition, is a basic but difficult task in bioinformatics, as traditional co-expression analysis methods rely on co-expression similarities, well known with high false positive rate. Complement with prior-knowledge similarities is a feasible way to tackle the problem. However, classical combination machine learning algorithms fail in detection and application of the complex mapping relations between similarities and conditional relatedness, so a powerful predictive model will have enormous benefit for measuring this kind of complex mapping relations. To this need, we propose a novel deep learning model of convolutional neural network with a fully connected first layer, named fully convolutional neural network (FCNN), to measure conditional relatedness between genes using both co-expression and prior-knowledge similarities. The results on validation and test datasets show FCNN model yields an average 3.0% and 2.7% higher accuracy values for identifying gene–gene interactions collected from the COXPRESdb, KEGG, and TRRUST databases, and a benchmark dataset of Xiao-Yong et al. research, by grid-search 10-fold cross validation, respectively. In order to estimate the FCNN model, we conduct a further verification on the GeneFriends and DIP datasets, and the FCNN model obtains an average of 1.8% and 7.6% higher accuracy, respectively. Then the FCNN model is applied to construct cancer gene networks, and also calls more practical results than other compared models and methods. A website of the FCNN model and relevant datasets can be accessed from https://bmbl.bmi.osumc.edu/FCNN.
Collapse
Affiliation(s)
- Yan Wang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China.,School of Artificial Intelligence, Jilin University, Changchun, China
| | - Shuangquan Zhang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Lili Yang
- Department of Obstetrics, The First Hospital of Jilin University, Changchun, China
| | - Sen Yang
- Key Laboratory of Symbol Computation and Knowledge Engineering of Ministry of Education, College of Computer Science and Technology, Jilin University, Changchun, China
| | - Yuan Tian
- School of Artificial Intelligence, Jilin University, Changchun, China
| | - Qin Ma
- Department of Biomedical Informatics, College of Medicine, The Ohio State University, Columbus, OH, United States
| |
Collapse
|
4
|
Kim YH, Song M. A context-based ABC model for literature-based discovery. PLoS One 2019; 14:e0215313. [PMID: 31017923 PMCID: PMC6481912 DOI: 10.1371/journal.pone.0215313] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2018] [Accepted: 03/29/2019] [Indexed: 12/13/2022] Open
Abstract
Background In the literature-based discovery, considerable research has been done based on the ABC model developed by Swanson. ABC model hypothesizes that there is a meaningful relation between entity A extracted from document set 1 and entity C extracted from document set 2 through B entities that appear commonly in both document sets. The results of ABC model are relations among entity A, B, and C, which is referred as paths. A path allows for hypothesizing the relationship between entity A and entity C, or helps discover entity B as a new evidence for the relationship between entity A and entity C. The co-occurrence based approach of ABC model is a well-known approach to automatic hypothesis generation by creating various paths. However, the co-occurrence based ABC model has a limitation, in that biological context is not considered. It focuses only on matching of B entity which commonly appears in relation between two entities. Therefore, the paths extracted by the co-occurrence based ABC model tend to include a lot of irrelevant paths, meaning that expert verification is essential. Methods In order to overcome this limitation of the co-occurrence based ABC model, we propose a context-based approach to connecting one entity relation to another, modifying the ABC model using biological contexts. In this study, we defined four biological context elements: cell, drug, disease, and organism. Based on these biological context, we propose two extended ABC models: a context-based ABC model and a context-assignment-based ABC model. In order to measure the performance of the both proposed models, we examined the relevance of the B entities between the well-known relations “APOE–MAPT” as well as “FUS–TARDBP”. Each relation means interaction between neurodegenerative disease associated with proteins. The interaction between APOE and MAPT is known to play a crucial role in Alzheimer’s disease as APOE affects tau-mediated neurodegeneration. It has been shown that mutation in FUS and TARDBP are associated with amyotrophic lateral sclerosis(ALS), a motor neuron disease by leading to neuronal cell death. Using these two relations, we compared both of proposed models to co-occurrence based ABC model. Results The precision of B entities by co-occurrence based ABC model was 27.1% for “APOE–MAPT” and 22.1% for “FUS–TARDBP”, respectively. In context-based ABC model, precision of extracted B entities was 71.4% for “APOE–MAPT”, and 77.9% for “FUS–TARDBP”. Context-assignment based ABC model achieved 89% and 97.5% precision for the two relations, respectively. Both proposed models achieved a higher precision than co-occurrence-based ABC model.
Collapse
Affiliation(s)
- Yong Hwan Kim
- Division of Humanities, CheongJu University, CheongJu, Korea
| | - Min Song
- Department of Library and Information Science, Yonsei University, Seoul, Korea
- * E-mail:
| |
Collapse
|
5
|
Data Processing and Text Mining Technologies on Electronic Medical Records: A Review. JOURNAL OF HEALTHCARE ENGINEERING 2018; 2018:4302425. [PMID: 29849998 PMCID: PMC5911323 DOI: 10.1155/2018/4302425] [Citation(s) in RCA: 80] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/10/2017] [Revised: 01/29/2018] [Accepted: 02/18/2018] [Indexed: 11/18/2022]
Abstract
Currently, medical institutes generally use EMR to record patient's condition, including diagnostic information, procedures performed, and treatment results. EMR has been recognized as a valuable resource for large-scale analysis. However, EMR has the characteristics of diversity, incompleteness, redundancy, and privacy, which make it difficult to carry out data mining and analysis directly. Therefore, it is necessary to preprocess the source data in order to improve data quality and improve the data mining results. Different types of data require different processing technologies. Most structured data commonly needs classic preprocessing technologies, including data cleansing, data integration, data transformation, and data reduction. For semistructured or unstructured data, such as medical text, containing more health information, it requires more complex and challenging processing methods. The task of information extraction for medical texts mainly includes NER (named-entity recognition) and RE (relation extraction). This paper focuses on the process of EMR processing and emphatically analyzes the key techniques. In addition, we make an in-depth study on the applications developed based on text mining together with the open challenges and research issues for future work.
Collapse
|
6
|
Song M, Kim M, Kang K, Kim YH, Jeon S. Application of Public Knowledge Discovery Tool (PKDE4J) to Represent Biomedical Scientific Knowledge. Front Res Metr Anal 2018. [DOI: 10.3389/frma.2018.00007] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/04/2023] Open
|
7
|
Özgür A, Hur J, He Y. The Interaction Network Ontology-supported modeling and mining of complex interactions represented with multiple keywords in biomedical literature. BioData Min 2016; 9:41. [PMID: 28031747 PMCID: PMC5168857 DOI: 10.1186/s13040-016-0118-0] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2016] [Accepted: 11/30/2016] [Indexed: 01/15/2023] Open
Abstract
Background The Interaction Network Ontology (INO) logically represents biological interactions, pathways, and networks. INO has been demonstrated to be valuable in providing a set of structured ontological terms and associated keywords to support literature mining of gene-gene interactions from biomedical literature. However, previous work using INO focused on single keyword matching, while many interactions are represented with two or more interaction keywords used in combination. Methods This paper reports our extension of INO to include combinatory patterns of two or more literature mining keywords co-existing in one sentence to represent specific INO interaction classes. Such keyword combinations and related INO interaction type information could be automatically obtained via SPARQL queries, formatted in Excel format, and used in an INO-supported SciMiner, an in-house literature mining program. We studied the gene interaction sentences from the commonly used benchmark Learning Logic in Language (LLL) dataset and one internally generated vaccine-related dataset to identify and analyze interaction types containing multiple keywords. Patterns obtained from the dependency parse trees of the sentences were used to identify the interaction keywords that are related to each other and collectively represent an interaction type. Results The INO ontology currently has 575 terms including 202 terms under the interaction branch. The relations between the INO interaction types and associated keywords are represented using the INO annotation relations: ‘has literature mining keywords’ and ‘has keyword dependency pattern’. The keyword dependency patterns were generated via running the Stanford Parser to obtain dependency relation types. Out of the 107 interactions in the LLL dataset represented with two-keyword interaction types, 86 were identified by using the direct dependency relations. The LLL dataset contained 34 gene regulation interaction types, each of which associated with multiple keywords. A hierarchical display of these 34 interaction types and their ancestor terms in INO resulted in the identification of specific gene-gene interaction patterns from the LLL dataset. The phenomenon of having multi-keyword interaction types was also frequently observed in the vaccine dataset. Conclusions By modeling and representing multiple textual keywords for interaction types, the extended INO enabled the identification of complex biological gene-gene interactions represented with multiple keywords. Electronic supplementary material The online version of this article (doi:10.1186/s13040-016-0118-0) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Arzucan Özgür
- Department of Computer Engineering, Bogazici University, 34342 Istanbul, Turkey
| | - Junguk Hur
- Department of Biomedical Sciences, University of North Dakota School of Medicine and Health Sciences, Grand Forks, ND 58202 USA
| | - Yongqun He
- Unit for Laboratory Animal Medicine, University of Michigan, Ann Arbor, MI 48109 USA.,Department of Microbiology and Immunology, University of Michigan, Ann Arbor, MI 48109 USA.,Center for Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI 48109 USA.,Comprehensive Cancer Center, University of Michigan, Ann Arbor, MI 48109 USA
| |
Collapse
|
8
|
Gökdeniz E, Özgür A, Canbeyli R. Automated Neuroanatomical Relation Extraction: A Linguistically Motivated Approach with a PVT Connectivity Graph Case Study. Front Neuroinform 2016; 10:39. [PMID: 27708573 PMCID: PMC5030238 DOI: 10.3389/fninf.2016.00039] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2016] [Accepted: 08/23/2016] [Indexed: 11/13/2022] Open
Abstract
Identifying the relations among different regions of the brain is vital for a better understanding of how the brain functions. While a large number of studies have investigated the neuroanatomical and neurochemical connections among brain structures, their specific findings are found in publications scattered over a large number of years and different types of publications. Text mining techniques have provided the means to extract specific types of information from a large number of publications with the aim of presenting a larger, if not necessarily an exhaustive picture. By using natural language processing techniques, the present paper aims to identify connectivity relations among brain regions in general and relations relevant to the paraventricular nucleus of the thalamus (PVT) in particular. We introduce a linguistically motivated approach based on patterns defined over the constituency and dependency parse trees of sentences. Besides the presence of a relation between a pair of brain regions, the proposed method also identifies the directionality of the relation, which enables the creation and analysis of a directional brain region connectivity graph. The approach is evaluated over the manually annotated data sets of the WhiteText Project. In addition, as a case study, the method is applied to extract and analyze the connectivity graph of PVT, which is an important brain region that is considered to influence many functions ranging from arousal, motivation, and drug-seeking behavior to attention. The results of the PVT connectivity graph show that PVT may be a new target of research in mood assessment.
Collapse
Affiliation(s)
- Erinç Gökdeniz
- Department of Computer Engineering, Boğaziçi University İstanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University İstanbul, Turkey
| | - Reşit Canbeyli
- Department of Psychology, Boğaziçi University İstanbul, Turkey
| |
Collapse
|
9
|
Roy S, Curry BC, Madahian B, Homayouni R. Prioritization, clustering and functional annotation of MicroRNAs using latent semantic indexing of MEDLINE abstracts. BMC Bioinformatics 2016; 17:350. [PMID: 27766940 PMCID: PMC5073981 DOI: 10.1186/s12859-016-1223-2] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/07/2023] Open
Abstract
Background The amount of scientific information about MicroRNAs (miRNAs) is growing exponentially, making it difficult for researchers to interpret experimental results. In this study, we present an automated text mining approach using Latent Semantic Indexing (LSI) for prioritization, clustering and functional annotation of miRNAs. Results For approximately 900 human miRNAs indexed in miRBase, text documents were created by concatenating titles and abstracts of MEDLINE citations which refer to the miRNAs. The documents were parsed and a weighted term-by-miRNA frequency matrix was created, which was subsequently factorized via singular value decomposition to extract pair-wise cosine values between the term (keyword) and miRNA vectors in reduced rank semantic space. LSI enables derivation of both explicit and implicit associations between entities based on word usage patterns. Using miR2Disease as a gold standard, we found that LSI identified keyword-to-miRNA relationships with high accuracy. In addition, we demonstrate that pair-wise associations between miRNAs can be used to group them into categories which are functionally aligned. Finally, term ranking by querying the LSI space with a group of miRNAs enabled annotation of the clusters with functionally related terms. Conclusions LSI modeling of MEDLINE abstracts provides a robust and automated method for miRNA related knowledge discovery. The latest collection of miRNA abstracts and LSI model can be accessed through the web tool miRNA Literature Network (miRLiN) at http://bioinfo.memphis.edu/mirlin. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-1223-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sujoy Roy
- Bioinformatics Program, University of Memphis, Memphis, 38152, USA.,Center for Translational Informatics, University of Memphis, Memphis, 38152, USA
| | - Brandon C Curry
- Bioinformatics Program, University of Memphis, Memphis, 38152, USA
| | - Behrouz Madahian
- Department of Mathematical Sciences, University of Memphis, Memphis, 38152, USA
| | - Ramin Homayouni
- Bioinformatics Program, University of Memphis, Memphis, 38152, USA. .,Center for Translational Informatics, University of Memphis, Memphis, 38152, USA. .,Department of Biology, University of Memphis, Memphis, 38152, USA.
| |
Collapse
|
10
|
Karadeniz İ, Hur J, He Y, Özgür A. Literature Mining and Ontology based Analysis of Host-Brucella Gene-Gene Interaction Network. Front Microbiol 2015; 6:1386. [PMID: 26696993 PMCID: PMC4673313 DOI: 10.3389/fmicb.2015.01386] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/19/2015] [Accepted: 11/20/2015] [Indexed: 01/27/2023] Open
Abstract
Brucella is an intracellular bacterium that causes chronic brucellosis in humans and various mammals. The identification of host-Brucella interaction is crucial to understand host immunity against Brucella infection and Brucella pathogenesis against host immune responses. Most of the information about the inter-species interactions between host and Brucella genes is only available in the text of the scientific publications. Many text-mining systems for extracting gene and protein interactions have been proposed. However, only a few of them have been designed by considering the peculiarities of host–pathogen interactions. In this paper, we used a text mining approach for extracting host-Brucella gene–gene interactions from the abstracts of articles in PubMed. The gene–gene interactions here represent the interactions between genes and/or gene products (e.g., proteins). The SciMiner tool, originally designed for detecting mammalian gene/protein names in text, was extended to identify host and Brucella gene/protein names in the abstracts. Next, sentence-level and abstract-level co-occurrence based approaches, as well as sentence-level machine learning based methods, originally designed for extracting intra-species gene interactions, were utilized to extract the interactions among the identified host and Brucella genes. The extracted interactions were manually evaluated. A total of 46 host-Brucella gene interactions were identified and represented as an interaction network. Twenty four of these interactions were identified from sentence-level processing. Twenty two additional interactions were identified when abstract-level processing was performed. The Interaction Network Ontology (INO) was used to represent the identified interaction types at a hierarchical ontology structure. Ontological modeling of specific gene–gene interactions demonstrates that host–pathogen gene–gene interactions occur at experimental conditions which can be ontologically represented. Our results show that the introduced literature mining and ontology-based modeling approach are effective in retrieving and analyzing host–pathogen gene–gene interaction networks.
Collapse
Affiliation(s)
- İlknur Karadeniz
- Department of Computer Engineering, Boğaziçi University Istanbul, Turkey
| | - Junguk Hur
- Department of Basic Sciences, School of Medicine and Health Sciences, University of North Dakota, Grand Forks ND, USA
| | - Yongqun He
- Unit for Laboratory Animal Medicine, Department of Microbiology and Immunology, University of Michigan, Ann Arbor MI, USA ; Department of Computational Medicine and Bioinformatics, University of Michigan Medical School, Ann Arbor MI, USA ; Comprehensive Cancer Center, University of Michigan Health System, Ann Arbor MI, USA
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University Istanbul, Turkey
| |
Collapse
|
11
|
Song M, Kim WC, Lee D, Heo GE, Kang KY. PKDE4J: Entity and relation extraction for public knowledge discovery. J Biomed Inform 2015; 57:320-32. [PMID: 26277115 DOI: 10.1016/j.jbi.2015.08.008] [Citation(s) in RCA: 69] [Impact Index Per Article: 7.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2015] [Revised: 07/20/2015] [Accepted: 08/06/2015] [Indexed: 11/18/2022]
Abstract
Due to an enormous number of scientific publications that cannot be handled manually, there is a rising interest in text-mining techniques for automated information extraction, especially in the biomedical field. Such techniques provide effective means of information search, knowledge discovery, and hypothesis generation. Most previous studies have primarily focused on the design and performance improvement of either named entity recognition or relation extraction. In this paper, we present PKDE4J, a comprehensive text-mining system that integrates dictionary-based entity extraction and rule-based relation extraction in a highly flexible and extensible framework. Starting with the Stanford CoreNLP, we developed the system to cope with multiple types of entities and relations. The system also has fairly good performance in terms of accuracy as well as the ability to configure text-processing components. We demonstrate its competitive performance by evaluating it on many corpora and found that it surpasses existing systems with average F-measures of 85% for entity extraction and 81% for relation extraction.
Collapse
Affiliation(s)
- Min Song
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 120-749, Republic of Korea.
| | - Won Chul Kim
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 120-749, Republic of Korea.
| | - Dahee Lee
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 120-749, Republic of Korea.
| | - Go Eun Heo
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 120-749, Republic of Korea.
| | - Keun Young Kang
- Department of Library and Information Science, Yonsei University, 50 Yonsei-ro, Seodaemun-gu, Seoul 120-749, Republic of Korea.
| |
Collapse
|
12
|
Karadeniz İ, Özgür A. Detection and categorization of bacteria habitats using shallow linguistic analysis. BMC Bioinformatics 2015. [PMID: 26201262 PMCID: PMC4511461 DOI: 10.1186/1471-2105-16-s10-s5] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
Background Information regarding bacteria biotopes is important for several research areas including health sciences, microbiology, and food processing and preservation. One of the challenges for scientists in these domains is the huge amount of information buried in the text of electronic resources. Developing methods to automatically extract bacteria habitat relations from the text of these electronic resources is crucial for facilitating research in these areas. Methods We introduce a linguistically motivated rule-based approach for recognizing and normalizing names of bacteria habitats in biomedical text by using an ontology. Our approach is based on the shallow syntactic analysis of the text that include sentence segmentation, part-of-speech (POS) tagging, partial parsing, and lemmatization. In addition, we propose two methods for identifying bacteria habitat localization relations. The underlying assumption for the first method is that discourse changes with a new paragraph. Therefore, it operates on a paragraph-basis. The second method performs a more fine-grained analysis of the text and operates on a sentence-basis. We also develop a novel anaphora resolution method for bacteria coreferences and incorporate it with the sentence-based relation extraction approach. Results We participated in the Bacteria Biotope (BB) Task of the BioNLP Shared Task 2013. Our system (Boun) achieved the second best performance with 68% Slot Error Rate (SER) in Sub-task 1 (Entity Detection and Categorization), and ranked third with an F-score of 27% in Sub-task 2 (Localization Event Extraction). This paper reports the system that is implemented for the shared task, including the novel methods developed and the improvements obtained after the official evaluation. The extensions include the expansion of the OntoBiotope ontology using the training set for Sub-task 1, and the novel sentence-based relation extraction method incorporated with anaphora resolution for Sub-task 2. These extensions resulted in promising results for Sub-task 1 with a SER of 68%, and state-of-the-art performance for Sub-task 2 with an F-score of 53%. Conclusions Our results show that a linguistically-oriented approach based on the shallow syntactic analysis of the text is as effective as machine learning approaches for the detection and ontology-based normalization of habitat entities. Furthermore, the newly developed sentence-based relation extraction system with the anaphora resolution module significantly outperforms the paragraph-based one, as well as the other systems that participated in the BB Shared Task 2013.
Collapse
|
13
|
Durmuş S, Çakır T, Özgür A, Guthke R. A review on computational systems biology of pathogen-host interactions. Front Microbiol 2015; 6:235. [PMID: 25914674 PMCID: PMC4391036 DOI: 10.3389/fmicb.2015.00235] [Citation(s) in RCA: 48] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2014] [Accepted: 03/10/2015] [Indexed: 12/27/2022] Open
Abstract
Pathogens manipulate the cellular mechanisms of host organisms via pathogen-host interactions (PHIs) in order to take advantage of the capabilities of host cells, leading to infections. The crucial role of these interspecies molecular interactions in initiating and sustaining infections necessitates a thorough understanding of the corresponding mechanisms. Unlike the traditional approach of considering the host or pathogen separately, a systems-level approach, considering the PHI system as a whole is indispensable to elucidate the mechanisms of infection. Following the technological advances in the post-genomic era, PHI data have been produced in large-scale within the last decade. Systems biology-based methods for the inference and analysis of PHI regulatory, metabolic, and protein-protein networks to shed light on infection mechanisms are gaining increasing demand thanks to the availability of omics data. The knowledge derived from the PHIs may largely contribute to the identification of new and more efficient therapeutics to prevent or cure infections. There are recent efforts for the detailed documentation of these experimentally verified PHI data through Web-based databases. Despite these advances in data archiving, there are still large amounts of PHI data in the biomedical literature yet to be discovered, and novel text mining methods are in development to unearth such hidden data. Here, we review a collection of recent studies on computational systems biology of PHIs with a special focus on the methods for the inference and analysis of PHI networks, covering also the Web-based databases and text-mining efforts to unravel the data hidden in the literature.
Collapse
Affiliation(s)
- Saliha Durmuş
- Computational Systems Biology Group, Department of Bioengineering, Gebze Technical University, KocaeliTurkey
| | - Tunahan Çakır
- Computational Systems Biology Group, Department of Bioengineering, Gebze Technical University, KocaeliTurkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boǧaziçi University, IstanbulTurkey
| | - Reinhard Guthke
- Leibniz Institute for Natural Product Research and Infection Biology – Hans-Knoell-Institute, JenaGermany
| |
Collapse
|
14
|
Kissa M, Tsatsaronis G, Schroeder M. Prediction of drug gene associations via ontological profile similarity with application to drug repositioning. Methods 2015; 74:71-82. [DOI: 10.1016/j.ymeth.2014.11.017] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/26/2014] [Revised: 11/21/2014] [Accepted: 11/25/2014] [Indexed: 01/10/2023] Open
|
15
|
Chiang JH, Ju JH. Discovering novel protein–protein interactions by measuring the protein semantic similarity from the biomedical literature. J Bioinform Comput Biol 2015; 12:1442008. [DOI: 10.1142/s0219720014420086] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Protein–protein interactions (PPIs) are involved in the majority of biological processes. Identification of PPIs is therefore one of the key aims of biological research. Although there are many databases of PPIs, many other unidentified PPIs could be buried in the biomedical literature. Therefore, automated identification of PPIs from biomedical literature repositories could be used to discover otherwise hidden interactions. Search engines, such as Google, have been successfully applied to measure the relatedness among words. Inspired by such approaches, we propose a novel method to identify PPIs through semantic similarity measures among protein mentions. We define six semantic similarity measures as features based on the page counts retrieved from the MEDLINE database. A machine learning classifier, Random Forest, is trained using the above features. The proposed approach achieve an averaged micro-F of 71.28% and an averaged macro-F of 64.03% over five PPI corpora, an improvement over the results of using only the conventional co-occurrence feature (averaged micro-F of 68.79% and an averaged macro-F of 60.49%). A relation-word reinforcement further improves the averaged micro-F to 71.3% and averaged macro-F to 65.12%. Comparing the results of the current work with other studies on the AIMed corpus (ranging from 77.58% to 85.1% in micro-F, 62.18% to 76.27% in macro-F), we show that the proposed approach achieves micro-F of 81.88% and macro-F of 64.01% without the use of sophisticated feature extraction. Finally, we manually examine the newly discovered PPI pairs based on a literature review, and the results suggest that our approach could extract novel protein–protein interactions.
Collapse
Affiliation(s)
- Jung-Hsien Chiang
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, East District, Tainan City 701, Taiwan
| | - Jiun-Huang Ju
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, East District, Tainan City 701, Taiwan
| |
Collapse
|
16
|
Rule extraction in gene-disease relationship discovery. Gene 2013; 518:132-8. [PMID: 23235120 DOI: 10.1016/j.gene.2012.11.060] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2012] [Accepted: 11/27/2012] [Indexed: 11/24/2022]
Abstract
BACKGROUND Biomedical data available to researchers and clinicians have increased dramatically over the past years because of the exponential growth of knowledge in medical biology. It is difficult for curators to go through all of the unstructured documents so as to curate the information to the database. Associating genes with diseases is important because it is a fundamental challenge in human health with applications to understanding disease properties and developing new techniques for prevention, diagnosis and therapy. METHODS Our study uses the automatic rule-learning approach to gene-disease relationship extraction. We first prepare the experimental corpus from MEDLINE and OMIM. A parser is applied to produce some grammatical information. We then learn all possible rules that discriminate relevant from irrelevant sentences. After that, we compute the scores of the learned rules in order to select rules of interest. As a result, a set of rules is generated. RESULTS We produce the learned rules automatically from the 1000 positive and 1000 negative sentences. The test set includes 400 sentences composed of 200 positives and 200 negatives. Precision, recall and F-score served as our evaluation metrics. The results reveal that the maximal precision rate is 77.8% and the maximal recall rate is 63.5%. The maximal F-score is 66.9% where the precision rate is 70.6% and the recall rate is 63.5%. CONCLUSIONS We employ the rule-learning approach to extract gene-disease relationships. Our main contributions are to build rules automatically and to support a more complete set of rules than a manually generated one. The experiments show exhilarating results and some improving efforts will be made in the future.
Collapse
|
17
|
Hossain MS, Gresock J, Edmonds Y, Helm R, Potts M, Ramakrishnan N. Connecting the dots between PubMed abstracts. PLoS One 2012; 7:e29509. [PMID: 22235301 PMCID: PMC3250456 DOI: 10.1371/journal.pone.0029509] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2011] [Accepted: 11/29/2011] [Indexed: 11/23/2022] Open
Abstract
Background There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and diseases. Each article investigates subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must integrate information from multiple publications. Particularly, unraveling relationships between extra-cellular inputs and downstream molecular response mechanisms requires integrating conclusions from diverse publications. Methodology We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for “connecting the dots” across the literature. We describe a storytelling algorithm that, given a start and end publication, typically with little or no overlap in content, identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. The quality of discovered stories is measured using local criteria such as the size of supporting neighborhoods for each link and the strength of individual links connecting publications, as well as global metrics of dispersion. To ensure that the story stays coherent as it meanders from one publication to another, we demonstrate the design of novel coherence and overlap filters for use as post-processing steps. Conclusions We demonstrate the application of our storytelling algorithm to three case studies: i) a many-one study exploring relationships between multiple cellular inputs and a molecule responsible for cell-fate decisions, ii) a many-many study exploring the relationships between multiple cytokines and multiple downstream transcription factors, and iii) a one-to-one study to showcase the ability to recover a cancer related association, viz. the Warburg effect, from past literature. The storytelling pipeline helps narrow down a scientist's focus from several hundreds of thousands of relevant documents to only around a hundred stories. We argue that our approach can serve as a valuable discovery aid for hypothesis generation and connection exploration in large unstructured biological knowledge bases.
Collapse
Affiliation(s)
- M Shahriar Hossain
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia, United States of America.
| | | | | | | | | | | |
Collapse
|
18
|
Hsiao MY, Chen CC, Chen JH. BrainKnowledge: a human brain function mapping knowledge-base system. Neuroinformatics 2011; 9:21-38. [PMID: 20857233 DOI: 10.1007/s12021-010-9083-9] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
Abstract
Associating fMRI image datasets with the available literature is crucial for the analysis and interpretation of fMRI data. Here, we present a human brain function mapping knowledge-base system (BrainKnowledge) that associates fMRI data analysis and literature search functions. BrainKnowledge not only contains indexed literature, but also provides the ability to compare experimental data with those derived from the literature. BrainKnowledge provides three major functions: (1) to search for brain activation models by selecting a particular brain function; (2) to query functions by brain structure; (3) to compare the fMRI data with data extracted from the literature. All these functions are based on our literature extraction and mining module developed earlier (Hsiao, Chen, Chen. Journal of Biomedical Informatics 42, 912-922, 2009), which automatically downloads and extracts information from a vast amount of fMRI literature and generates co-occurrence models and brain association patterns to illustrate the relevance of brain structures and functions. BrainKnowledge currently provides three co-occurrence models: (1) a structure-to-function co-occurrence model; (2) a function-to-structure co-occurrence model; and (3) a brain structure co-occurrence model. Each model has been generated from over 15,000 extracted Medline abstracts. In this study, we illustrate the capabilities of BrainKnowledge and provide an application example with the studies of affect. BrainKnowledge, which combines fMRI experimental results with Medline abstracts, may be of great assistance to scientists not only by freeing up resources and valuable time, but also by providing a powerful tool that collects and organizes over ten thousand abstracts into readily usable and relevant sources of information for researchers.
Collapse
Affiliation(s)
- Mei-Yu Hsiao
- Interdisciplinary MRI/MRS Laboratory, Department of Electrical Engineering, National Taiwan University, No. 1, Sec. 4, Roosevelt Rd., Taipei 106, Taiwan
| | | | | |
Collapse
|
19
|
Faro A, Giordano D, Spampinato C. Combining literature text mining with microarray data: advances for system biology modeling. Brief Bioinform 2011; 13:61-82. [PMID: 21677032 DOI: 10.1093/bib/bbr018] [Citation(s) in RCA: 45] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/10/2023] Open
Abstract
A huge amount of important biomedical information is hidden in the bulk of research articles in biomedical fields. At the same time, the publication of databases of biological information and of experimental datasets generated by high-throughput methods is in great expansion, and a wealth of annotated gene databases, chemical, genomic (including microarray datasets), clinical and other types of data repositories are now available on the Web. Thus a current challenge of bioinformatics is to develop targeted methods and tools that integrate scientific literature, biological databases and experimental data for reducing the time of database curation and for accessing evidence, either in the literature or in the datasets, useful for the analysis at hand. Under this scenario, this article reviews the knowledge discovery systems that fuse information from the literature, gathered by text mining, with microarray data for enriching the lists of down and upregulated genes with elements for biological understanding and for generating and validating new biological hypothesis. Finally, an easy to use and freely accessible tool, GeneWizard, that exploits text mining and microarray data fusion for supporting researchers in discovering gene-disease relationships is described.
Collapse
Affiliation(s)
- Alberto Faro
- Department of Informatics and Telecommunication Engineering-University of Catania, Catania, Italy
| | | | | |
Collapse
|
20
|
Using unsupervised patterns to extract gene regulation relationships for network construction. PLoS One 2011; 6:e19633. [PMID: 21573008 PMCID: PMC3091867 DOI: 10.1371/journal.pone.0019633] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2010] [Accepted: 04/11/2011] [Indexed: 11/22/2022] Open
Abstract
Background The gene expression is usually described in the literature as a transcription factor X that regulates the target gene Y. Previously, some studies discovered gene regulations by using information from the biomedical literature and most of them require effort of human annotators to build the training dataset. Moreover, the large amount of textual knowledge recorded in the biomedical literature grows very rapidly, and the creation of manual patterns from literatures becomes more difficult. There is an increasing need to automate the process of establishing patterns. Methodology/Principal Findings In this article, we describe an unsupervised pattern generation method called AutoPat. It is a gene expression mining system that can generate unsupervised patterns automatically from a given set of seed patterns. The high scalability and low maintenance cost of the unsupervised patterns could help our system to extract gene expression from PubMed abstracts more precisely and effectively. Conclusions/Significance Experiments on several regulators show reasonable precision and recall rates which validate AutoPat's practical applicability. The conducted regulation networks could also be built precisely and effectively. The system in this study is available at http://ikmbio.csie.ncku.edu.tw/AutoPat/.
Collapse
|
21
|
Sintchenko V, Anthony S, Phan XH, Lin F, Coiera EW. A PubMed-wide associational study of infectious diseases. PLoS One 2010; 5:e9535. [PMID: 20224767 PMCID: PMC2835740 DOI: 10.1371/journal.pone.0009535] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2009] [Accepted: 02/11/2010] [Indexed: 11/18/2022] Open
Abstract
BACKGROUND Computational discovery is playing an ever-greater role in supporting the processes of knowledge synthesis. A significant proportion of the more than 18 million manuscripts indexed in the PubMed database describe infectious disease syndromes and various infectious agents. This study is the first attempt to integrate online repositories of text-based publications and microbial genome databases in order to explore the dynamics of relationships between pathogens and infectious diseases. METHODOLOGY/PRINCIPAL FINDINGS Herein we demonstrate how the knowledge space of infectious diseases can be computationally represented and quantified, and tracked over time. The knowledge space is explored by mapping of the infectious disease literature, looking at dynamics of literature deposition, zooming in from pathogen to genome level and searching for new associations. Syndromic signatures for different pathogens can be created to enable a new and clinically focussed reclassification of the microbial world. Examples of syndrome and pathogen networks illustrate how multilevel network representations of the relationships between infectious syndromes, pathogens and pathogen genomes can illuminate unexpected biological similarities in disease pathogenesis and epidemiology. CONCLUSIONS/SIGNIFICANCE This new approach based on text and data mining can support the discovery of previously hidden associations between diseases and microbial pathogens, clinically relevant reclassification of pathogenic microorganisms and accelerate the translational research enterprise.
Collapse
Affiliation(s)
- Vitali Sintchenko
- Centre for Health Informatics, University of New South Wales, Sydney, New South Wales, Australia.
| | | | | | | | | |
Collapse
|
22
|
|
23
|
Schulz S, Beisswanger E, van den Hoek L, Bodenreider O, van Mulligen EM. Alignment of the UMLS semantic network with BioTop: methodology and assessment. Bioinformatics 2009; 25:i69-76. [PMID: 19478019 PMCID: PMC2687948 DOI: 10.1093/bioinformatics/btp194] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
MOTIVATION For many years, the Unified Medical Language System (UMLS) semantic network (SN) has been used as an upper-level semantic framework for the categorization of terms from terminological resources in biomedicine. BioTop has recently been developed as an upper-level ontology for the biomedical domain. In contrast to the SN, it is founded upon strict ontological principles, using OWL DL as a formal representation language, which has become standard in the semantic Web. In order to make logic-based reasoning available for the resources annotated or categorized with the SN, a mapping ontology was developed aligning the SN with BioTop. METHODS The theoretical foundations and the practical realization of the alignment are being described, with a focus on the design decisions taken, the problems encountered and the adaptations of BioTop that became necessary. For evaluation purposes, UMLS concept pairs obtained from MEDLINE abstracts by a named entity recognition system were tested for possible semantic relationships. Furthermore, all semantic-type combinations that occur in the UMLS Metathesaurus were checked for satisfiability. RESULTS The effort-intensive alignment process required major design changes and enhancements of BioTop and brought up several design errors that could be fixed. A comparison between a human curator and the ontology yielded only a low agreement. Ontology reasoning was also used to successfully identify 133 inconsistent semantic-type combinations. AVAILABILITY BioTop, the OWL DL representation of the UMLS SN, and the mapping ontology are available at http://www.purl.org/biotop/.
Collapse
Affiliation(s)
- Stefan Schulz
- Institute of Medical Biometry and Medical Informatics, University Medical Center Freiburg, Freiburg, Germany.
| | | | | | | | | |
Collapse
|
24
|
Tsoi LC, Boehnke M, Klein RL, Zheng WJ. Evaluation of genome-wide association study results through development of ontology fingerprints. ACTA ACUST UNITED AC 2009; 25:1314-20. [PMID: 19349285 DOI: 10.1093/bioinformatics/btp158] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
MOTIVATION Genome-wide association (GWA) studies may identify multiple variants that are associated with a disease or trait. To narrow down candidates for further validation, quantitatively assessing how identified genes relate to a phenotype of interest is important. RESULTS We describe an approach to characterize genes or biological concepts (phenotypes, pathways, diseases, etc.) by ontology fingerprint--the set of Gene Ontology (GO) terms that are overrepresented among the PubMed abstracts discussing the gene or biological concept together with the enrichment p-value of these terms generated from a hypergeometric enrichment test. We then quantify the relevance of genes to the trait from a GWA study by calculating similarity scores between their ontology fingerprints using enrichment p-values. We validate this approach by correctly identifying corresponding genes for biological pathways with a 90% average area under the ROC curve (AUC). We applied this approach to rank genes identified through a GWA study that are associated with the lipid concentrations in plasma as well as to prioritize genes within linkage disequilibrium (LD) block. We found that the genes with highest scores were: ABCA1, lipoprotein lipase (LPL) and cholesterol ester transfer protein, plasma for high-density lipoprotein; low-density lipoprotein receptor, APOE and APOB for low-density lipoprotein; and LPL, APOA1 and APOB for triglyceride. In addition, we identified genes relevant to lipid metabolism from the literature even in cases where such knowledge was not reflected in current annotation of these genes. These results demonstrate that ontology fingerprints can be used effectively to prioritize genes from GWA studies for experimental validation.
Collapse
Affiliation(s)
- Lam C Tsoi
- Bioinformatics Graduate Program, Department of Biostatistics, Bioinformatics and Epidemiology, Medical University of South Carolina, Charleston, SC, USA
| | | | | | | |
Collapse
|
25
|
Jelier R, Schuemie MJ, Veldhoven A, Dorssers LCJ, Jenster G, Kors JA. Anni 2.0: a multipurpose text-mining tool for the life sciences. Genome Biol 2008; 9:R96. [PMID: 18549479 PMCID: PMC2481428 DOI: 10.1186/gb-2008-9-6-r96] [Citation(s) in RCA: 97] [Impact Index Per Article: 6.1] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/04/2008] [Revised: 04/07/2008] [Accepted: 06/12/2008] [Indexed: 01/19/2023] Open
Abstract
Anni 2.0 provides an ontology-based interface to MEDLINE. Anni 2.0 is an online tool () to aid the biomedical researcher with a broad range of information needs. Anni provides an ontology-based interface to MEDLINE and retrieves documents and associations for several classes of biomedical concepts, including genes, drugs and diseases, with established text-mining technology. In this article we illustrate Anni's usability by applying the tool to two use cases: interpretation of a set of differentially expressed genes, and literature-based knowledge discovery.
Collapse
Affiliation(s)
- Rob Jelier
- Department of Medical Informatics, Erasmus MC University Medical Center, Dr, Molewaterplein, Rotterdam, 3015 GE, The Netherlands.
| | | | | | | | | | | |
Collapse
|
26
|
Mons B, Ashburner M, Chichester C, van Mulligen E, Weeber M, den Dunnen J, van Ommen GJ, Musen M, Cockerill M, Hermjakob H, Mons A, Packer A, Pacheco R, Lewis S, Berkeley A, Melton W, Barris N, Wales J, Meijssen G, Moeller E, Roes PJ, Borner K, Bairoch A. Calling on a million minds for community annotation in WikiProteins. Genome Biol 2008; 9:R89. [PMID: 18507872 PMCID: PMC2441475 DOI: 10.1186/gb-2008-9-5-r89] [Citation(s) in RCA: 108] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/03/2007] [Revised: 03/03/2008] [Indexed: 11/16/2022] Open
Abstract
WikiProteins enables community annotation in a Wiki-based system. Extracts of major data sources have been fused into an editable environment that links out to the original sources. Data from community edits create automatic copies of the original data. Semantic technology captures concepts co-occurring in one sentence and thus potential factual statements. In addition, indirect associations between concepts have been calculated. We call on a 'million minds' to annotate a 'million concepts' and to collect facts from the literature with the reward of collaborative knowledge discovery. The system is available for beta testing at .
Collapse
Affiliation(s)
- Barend Mons
- Erasmus Medical Centre, Department of Medical Informatics, Rotterdam, the Netherlands.
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
27
|
Maskery SM, Hu H, Hooke J, Shriver CD, Liebman MN. A Bayesian derived network of breast pathology co-occurrence. J Biomed Inform 2008; 41:242-50. [PMID: 18262472 DOI: 10.1016/j.jbi.2007.12.005] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2007] [Revised: 12/17/2007] [Accepted: 12/26/2007] [Indexed: 11/16/2022]
Abstract
In this paper, we present the validation and verification of a machine-learning based Bayesian network of breast pathology co-occurrence. The present/not present occurrences of 29 common breast pathologies from 1631 pathology reports were used to build the network. All pathology reports were developed by a single pathologist. The resulting network has 25 diagnosis nodes interconnected by 40 arcs. Each arc represents a predicted co-occurrence or null co-occurrence. Model verification involved assessing the robustness of the original network structure after random exclusion of 25%, 50%, and 75% of the pathology report dataset. The structure of the network appears stable as random removal of 75% of the records in the original dataset leaves 81% of the original network intact. Model validation was primarily assessed by review of the breast pathology literature for each arc in the network. Almost all network identified co-occurrences (95%) have been published in the breast pathology literature or were verified by expert opinion. In conclusion, the Bayesian network of breast pathology co-occurrence presented here is both robust with respect to incomplete data and validated by consistency with the breast pathology literature and by expert opinion. Further, the ability to utilize a specific pathology observation to predict multiple co-current pathologies enables exploration of pathology co-occurrence patterns in an intuitive manner that may have broader application in both the breast pathologist clinical community and the breast cancer research community.
Collapse
Affiliation(s)
- Susan M Maskery
- Windber Research Institute, 620 7th Street, Windber, PA 15963, USA.
| | | | | | | | | |
Collapse
|
28
|
Burkart MF, Wren JD, Herschkowitz JI, Perou CM, Garner HR. Clustering microarray-derived gene lists through implicit literature relationships. Bioinformatics 2007; 23:1995-2003. [PMID: 17537751 DOI: 10.1093/bioinformatics/btm261] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
MOTIVATION Microarrays rapidly generate large quantities of gene expression information, but interpreting such data within a biological context is still relatively complex and laborious. New methods that can identify functionally related genes via shared literature concepts will be useful in addressing these needs. RESULTS We have developed a novel method that uses implicit literature relationships (concepts related via shared, intermediate concepts) to cluster related genes. Genes are evaluated for implicit connections within a network of biomedical objects (other genes, ontological concepts and diseases) that are connected via their co-occurrences in Medline titles and/or abstracts. On the basis of these implicit relationships, individual gene pairs are scored using a probability-based algorithm. Scores are generated for all pairwise combinations of genes, which are then clustered based on the scores. We applied this method to a test set composed of nine functional groups with known relationships. The method scored highly for all nine groups and significantly better than a benchmark co-occurrence-based method for six groups. We then applied this method to gene sets specific to two previously defined breast tumor subtypes. Analysis of the results recapitulated known biological relationships and identified novel pathway relationships unique to each tumor subtype. We demonstrate that this method provides a valuable new means of identifying and visualizing significantly related genes within gene lists via their implicit relationships in the literature.
Collapse
Affiliation(s)
- Mark F Burkart
- Department of Internal Medicine, The McDermott Center for Human Growth and Development, Division of Translational Research, The University of Texas Southwestern Medical Center, Dallas, Texas 75390, USA.
| | | | | | | | | |
Collapse
|
29
|
Rebholz-Schuhmann D, Kirsch H, Arregui M, Gaudan S, Riethoven M, Stoehr P. EBIMed--text crunching to gather facts for proteins from Medline. Bioinformatics 2007; 23:e237-44. [PMID: 17237098 DOI: 10.1093/bioinformatics/btl302] [Citation(s) in RCA: 135] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
UNLABELLED To allow efficient and systematic retrieval of statements from Medline we have developed EBIMed, a service that combines document retrieval with co-occurrence-based analysis of Medline abstracts. Upon keyword query, EBIMed retrieves the abstracts from EMBL-EBI's installation of Medline and filters for sentences that contain biomedical terminology maintained in public bioinformatics resources. The extracted sentences and terminology are used to generate an overview table on proteins, Gene Ontology (GO) annotations, drugs and species used in the same biological context. All terms in retrieved abstracts and extracted sentences are linked to their entries in biomedical databases. We assessed the quality of the identification of terms and relations in the retrieved sentences. More than 90% of the protein names found indeed represented a protein. According to the analysis of four protein-protein pairs from the Wnt pathway we estimated that 37% of the statements containing such a pair mentioned a meaningful interaction and clarified the interaction of Dkk with LRP. We conclude that EBIMed improves access to information where proteins and drugs are involved in the same biological process, e.g. statements with GO annotations of proteins, protein-protein interactions and effects of drugs on proteins. AVAILABILITY Available at http://www.ebi.ac.uk/Rebholz-srv/ebimed
Collapse
|
30
|
Jelier R, Jenster G, Dorssers LCJ, Wouters BJ, Hendriksen PJM, Mons B, Delwel R, Kors JA. Text-derived concept profiles support assessment of DNA microarray data for acute myeloid leukemia and for androgen receptor stimulation. BMC Bioinformatics 2007; 8:14. [PMID: 17233900 PMCID: PMC1784107 DOI: 10.1186/1471-2105-8-14] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2006] [Accepted: 01/18/2007] [Indexed: 12/02/2022] Open
Abstract
Background High-throughput experiments, such as with DNA microarrays, typically result in hundreds of genes potentially relevant to the process under study, rendering the interpretation of these experiments problematic. Here, we propose and evaluate an approach to find functional associations between large numbers of genes and other biomedical concepts from free-text literature. For each gene, a profile of related concepts is constructed that summarizes the context in which the gene is mentioned in literature. We assign a weight to each concept in the profile based on a likelihood ratio measure. Gene concept profiles can then be clustered to find related genes and other concepts. Results The experimental validation was done in two steps. We first applied our method on a controlled test set. After this proved to be successful the datasets from two DNA microarray experiments were analyzed in the same way and the results were evaluated by domain experts. The first dataset was a gene-expression profile that characterizes the cancer cells of a group of acute myeloid leukemia patients. For this group of patients the biological background of the cancer cells is largely unknown. Using our methodology we found an association of these cells to monocytes, which agreed with other experimental evidence. The second data set consisted of differentially expressed genes following androgen receptor stimulation in a prostate cancer cell line. Based on the analysis we put forward a hypothesis about the biological processes induced in these studied cells: secretory lysosomes are involved in the production of prostatic fluid and their development and/or secretion are androgen-regulated processes. Conclusion Our method can be used to analyze DNA microarray datasets based on information explicitly and implicitly available in the literature. We provide a publicly available tool, dubbed Anni, for this purpose.
Collapse
Affiliation(s)
- Rob Jelier
- Department of Medical Informatics, Erasmus MC – University Medical Center, Rotterdam, The Netherlands
| | - Guido Jenster
- Department of Urology, Erasmus MC – University Medical Center, Rotterdam, The Netherlands
| | - Lambert CJ Dorssers
- Department of Pathology, Erasmus MC – University Medical Center, Rotterdam, The Netherlands
| | - Bas J Wouters
- Department of Hematology, Erasmus MC – University Medical Center, Rotterdam, The Netherlands
| | - Peter JM Hendriksen
- Department of Urology, Erasmus MC – University Medical Center, Rotterdam, The Netherlands
| | - Barend Mons
- Department of Medical Informatics, Erasmus MC – University Medical Center, Rotterdam, The Netherlands
| | - Ruud Delwel
- Department of Hematology, Erasmus MC – University Medical Center, Rotterdam, The Netherlands
| | - Jan A Kors
- Department of Medical Informatics, Erasmus MC – University Medical Center, Rotterdam, The Netherlands
| |
Collapse
|
31
|
Abstract
MOTIVATION The discovery of regulatory pathways, signal cascades, metabolic processes or disease models requires knowledge on individual relations like e.g. physical or regulatory interactions between genes and proteins. Most interactions mentioned in the free text of biomedical publications are not yet contained in structured databases. RESULTS We developed RelEx, an approach for relation extraction from free text. It is based on natural language preprocessing producing dependency parse trees and applying a small number of simple rules to these trees. We applied RelEx on a comprehensive set of one million MEDLINE abstracts dealing with gene and protein relations and extracted approximately 150,000 relations with an estimated performance of both 80% precision and 80% recall. AVAILABILITY The used natural language preprocessing tools are free for use for academic research. Test sets and relation term lists are available from our website (http://www.bio.ifi.lmu.de/publications/RelEx/).
Collapse
Affiliation(s)
- Katrin Fundel
- Institut für Informatik, Ludwig-Maximilians-Universität München, Amalienstrasse 17, 80333 München, Germany.
| | | | | |
Collapse
|
32
|
Olsson B, Gawronska B, Erlendsson B. Deriving pathway maps from automated text analysis using a grammar-based approach. J Bioinform Comput Biol 2006; 4:483-501. [PMID: 16819797 DOI: 10.1142/s0219720006002041] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2005] [Revised: 12/29/2005] [Accepted: 01/13/2006] [Indexed: 11/18/2022]
Abstract
We demonstrate how automated text analysis can be used to support the large-scale analysis of metabolic and regulatory pathways by deriving pathway maps from textual descriptions found in the scientific literature. The main assumption is that correct syntactic analysis combined with domain-specific heuristics provides a good basis for relation extraction. Our method uses an algorithm that searches through the syntactic trees produced by a parser based on a Referent Grammar formalism, identifies relations mentioned in the sentence, and classifies them with respect to their semantic class and epistemic status (facts, counterfactuals, hypotheses). The semantic categories used in the classification are based on the relation set used in KEGG (Kyoto Encyclopedia of Genes and Genomes), so that pathway maps using KEGG notation can be automatically generated. We present the current version of the relation extraction algorithm and an evaluation based on a corpus of abstracts obtained from PubMed. The results indicate that the method is able to combine a reasonable coverage with high accuracy. We found that 61% of all sentences were parsed, and 97% of the parse trees were judged to be correct. The extraction algorithm was tested on a sample of 300 parse trees and was found to produce correct extractions in 90.5% of the cases.
Collapse
Affiliation(s)
- Björn Olsson
- School of Humanities and Informatics, University of Skövde, Box 408, 541 28 Skövde, Sweden.
| | | | | |
Collapse
|
33
|
Chagoyen M, Carmona-Saez P, Shatkay H, Carazo JM, Pascual-Montano A. Discovering semantic features in the literature: a foundation for building functional associations. BMC Bioinformatics 2006; 7:41. [PMID: 16438716 PMCID: PMC1386711 DOI: 10.1186/1471-2105-7-41] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2005] [Accepted: 01/26/2006] [Indexed: 11/10/2022] Open
Abstract
Background Experimental techniques such as DNA microarray, serial analysis of gene expression (SAGE) and mass spectrometry proteomics, among others, are generating large amounts of data related to genes and proteins at different levels. As in any other experimental approach, it is necessary to analyze these data in the context of previously known information about the biological entities under study. The literature is a particularly valuable source of information for experiment validation and interpretation. Therefore, the development of automated text mining tools to assist in such interpretation is one of the main challenges in current bioinformatics research. Results We present a method to create literature profiles for large sets of genes or proteins based on common semantic features extracted from a corpus of relevant documents. These profiles can be used to establish pair-wise similarities among genes, utilized in gene/protein classification or can be even combined with experimental measurements. Semantic features can be used by researchers to facilitate the understanding of the commonalities indicated by experimental results. Our approach is based on non-negative matrix factorization (NMF), a machine-learning algorithm for data analysis, capable of identifying local patterns that characterize a subset of the data. The literature is thus used to establish putative relationships among subsets of genes or proteins and to provide coherent justification for this clustering into subsets. We demonstrate the utility of the method by applying it to two independent and vastly different sets of genes. Conclusion The presented method can create literature profiles from documents relevant to sets of genes. The representation of genes as additive linear combinations of semantic features allows for the exploration of functional associations as well as for clustering, suggesting a valuable methodology for the validation and interpretation of high-throughput experimental data.
Collapse
Affiliation(s)
- Monica Chagoyen
- Biocomputing Unit, Centro Nacional de Biotecnologia – CSIC, Madrid, Spain
| | - Pedro Carmona-Saez
- Biocomputing Unit, Centro Nacional de Biotecnologia – CSIC, Madrid, Spain
| | - Hagit Shatkay
- School of Computing, Queen's University, Kingston, Ontario, Canada
| | - Jose M Carazo
- Biocomputing Unit, Centro Nacional de Biotecnologia – CSIC, Madrid, Spain
| | | |
Collapse
|
34
|
Alako BTF, Veldhoven A, van Baal S, Jelier R, Verhoeven S, Rullmann T, Polman J, Jenster G. CoPub Mapper: mining MEDLINE based on search term co-publication. BMC Bioinformatics 2005; 6:51. [PMID: 15760478 PMCID: PMC1274248 DOI: 10.1186/1471-2105-6-51] [Citation(s) in RCA: 56] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/21/2004] [Accepted: 03/11/2005] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND High throughput microarray analyses result in many differentially expressed genes that are potentially responsible for the biological process of interest. In order to identify biological similarities between genes, publications from MEDLINE were identified in which pairs of gene names and combinations of gene name with specific keywords were co-mentioned. RESULTS MEDLINE search strings for 15,621 known genes and 3,731 keywords were generated and validated. PubMed IDs were retrieved from MEDLINE and relative probability of co-occurrences of all gene-gene and gene-keyword pairs determined. To assess gene clustering according to literature co-publication, 150 genes consisting of 8 sets with known connections (same pathway, same protein complex, or same cellular localization, etc.) were run through the program. Receiver operator characteristics (ROC) analyses showed that most gene sets were clustered much better than expected by random chance. To test grouping of genes from real microarray data, 221 differentially expressed genes from a microarray experiment were analyzed with CoPub Mapper, which resulted in several relevant clusters of genes with biological process and disease keywords. In addition, all genes versus keywords were hierarchical clustered to reveal a complete grouping of published genes based on co-occurrence. CONCLUSION The CoPub Mapper program allows for quick and versatile querying of co-published genes and keywords and can be successfully used to cluster predefined groups of genes and microarray data.
Collapse
Affiliation(s)
- Blaise TF Alako
- Department of Molecular Design & Informatics, Organon NV, P.O. Box 20, 5340 BH Oss, The Netherlands
| | - Antoine Veldhoven
- Department of Urology, Erasmus MC, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| | - Sjozef van Baal
- Department of Genetics, Erasmus MC, Rotterdam, The Netherlands
| | - Rob Jelier
- Department of Medical Informatics, Erasmus MC, Rotterdam, The Netherlands
| | - Stefan Verhoeven
- Department of Molecular Design & Informatics, Organon NV, P.O. Box 20, 5340 BH Oss, The Netherlands
| | - Ton Rullmann
- Department of Molecular Design & Informatics, Organon NV, P.O. Box 20, 5340 BH Oss, The Netherlands
| | - Jan Polman
- Department of Molecular Design & Informatics, Organon NV, P.O. Box 20, 5340 BH Oss, The Netherlands
| | - Guido Jenster
- Department of Urology, Erasmus MC, P.O. Box 1738, 3000 DR Rotterdam, The Netherlands
| |
Collapse
|