1
|
Xu S, Sun S, Zhang Z, Xu F, Liu J. BERT gated multi-window attention network for relation extraction. Neurocomputing 2022. [DOI: 10.1016/j.neucom.2021.12.044] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
2
|
|
3
|
Sai Prashanthi G, Deva A, Vadapalli R, Das AV. Automated Categorization of Systemic Disease and Duration From Electronic Medical Record System Data Using Finite-State Machine Modeling: Prospective Validation Study. JMIR Form Res 2020; 4:e24490. [PMID: 33331823 PMCID: PMC7775202 DOI: 10.2196/24490] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2020] [Revised: 11/12/2020] [Accepted: 11/17/2020] [Indexed: 01/14/2023] Open
Abstract
BACKGROUND One of the major challenges in the health care sector is that approximately 80% of generated data remains unstructured and unused. Since it is difficult to handle unstructured data from electronic medical record systems, it tends to be neglected for analyses in most hospitals and medical centers. Therefore, there is a need to analyze unstructured big data in health care systems so that we can optimally utilize and unearth all unexploited information from it. OBJECTIVE In this study, we aimed to extract a list of diseases and associated keywords along with the corresponding time durations from an indigenously developed electronic medical record system and describe the possibility of analytics from the acquired datasets. METHODS We propose a novel, finite-state machine to sequentially detect and cluster disease names from patients' medical history. We defined 3 states in the finite-state machine and transition matrix, which depend on the identified keyword. In addition, we also defined a state-change action matrix, which is essentially an action associated with each transition. The dataset used in this study was obtained from an indigenously developed electronic medical record system called eyeSmart that was implemented across a large, multitier ophthalmology network in India. The dataset included patients' past medical history and contained records of 10,000 distinct patients. RESULTS We extracted disease names and associated keywords by using the finite-state machine with an accuracy of 95%, sensitivity of 94.9%, and positive predictive value of 100%. For the extraction of the duration of disease, the machine's accuracy was 93%, sensitivity was 92.9%, and the positive predictive value was 100%. CONCLUSIONS We demonstrated that the finite-state machine we developed in this study can be used to accurately identify disease names, associated keywords, and time durations from a large cohort of patient records obtained using an electronic medical record system.
Collapse
Affiliation(s)
| | - Ayush Deva
- International Institute of Information Technology, Hyderabad , Telangana, India
| | - Ranganath Vadapalli
- Department of eyeSmart EMR & AEye, LV Prasad Eye Institute, Hyderabad, Telangana, India
| | - Anthony Vipin Das
- Department of eyeSmart EMR & AEye, LV Prasad Eye Institute, Hyderabad, Telangana, India
| |
Collapse
|
4
|
Li Z, Yang J, Gou X, Qi X. Recurrent neural networks with segment attention and entity description for relation extraction from clinical texts. Artif Intell Med 2019; 97:9-18. [PMID: 31202398 DOI: 10.1016/j.artmed.2019.04.003] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/11/2018] [Revised: 04/23/2019] [Accepted: 04/23/2019] [Indexed: 11/30/2022]
Abstract
At present, great progress has been achieved on the relation extraction for clinical texts, but we have noticed that the current models have great drawbacks when dealing with long sentences and multiple entities in a sentence. In this paper, we propose a novel neural network architecture based on Bidirectional Long Short-Term Memory Networks for relation classification. Firstly, we utilize a concat-attention mechanism for capturing the most important context words for relation extraction in a sentence. In addition, a segment attention mechanism is proposed to improve the performance of the model processing long sentences. Finally, a tensor-based entity description is used to overcome the performance degradation of the model when there are multiple entities in a sentence. The performance of the proposed model is evaluated on a part of the i2b2-2010 shared task clinical relation extraction dataset. The result indicates that our model can effectively overcome the above two problems and improve the F1-score by approximately 3% compared with baseline model.
Collapse
Affiliation(s)
- Zhi Li
- College of Electronics and Information Engineering, University of Sichuan, 10065, China; Key Laboratory of Wireless Power Transmission of Ministry of Education, University of Sichuan, 610065, China
| | - Jinshan Yang
- College of Electronics and Information Engineering, University of Sichuan, 10065, China
| | - Xu Gou
- College of Electronics and Information Engineering, University of Sichuan, 10065, China
| | - Xiaorong Qi
- Department of Gynecology and Obstetrics, Key Laboratory of Obstetric and Gynecologic and Pediatric Diseases and Birth Defects of Ministry of Education, West China Second Hospital, University of Sichuan, 610041, China.
| |
Collapse
|
5
|
Li Z, Yang Z, Shen C, Xu J, Zhang Y, Xu H. Integrating shortest dependency path and sentence sequence into a deep learning framework for relation extraction in clinical text. BMC Med Inform Decis Mak 2019; 19:22. [PMID: 30700301 PMCID: PMC6354333 DOI: 10.1186/s12911-019-0736-9] [Citation(s) in RCA: 26] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/05/2022] Open
Abstract
Background Extracting relations between important clinical entities is critical but very challenging for natural language processing (NLP) in the medical domain. Researchers have applied deep learning-based approaches to clinical relation extraction; but most of them consider sentence sequence only, without modeling syntactic structures. The aim of this study was to utilize a deep neural network to capture the syntactic features and further improve the performances of relation extraction in clinical notes. Methods We propose a novel neural approach to model shortest dependency path (SDP) between target entities together with the sentence sequence for clinical relation extraction. Our neural network architecture consists of three modules: (1) sentence sequence representation module using bidirectional long short-term memory network (Bi-LSTM) to capture the features in the sentence sequence; (2) SDP representation module implementing the convolutional neural network (CNN) and Bi-LSTM network to capture the syntactic context for target entities using SDP information; and (3) classification module utilizing a fully-connected layer with Softmax function to classify the relation type between target entities. Results Using the 2010 i2b2/VA relation extraction dataset, we compared our approach with other baseline methods. Our experimental results show that the proposed approach achieved significant improvements over comparable existing methods, demonstrating the effectiveness of utilizing syntactic structures in deep learning-based relation extraction. The F-measure of our method reaches 74.34% which is 2.5% higher than the method without using syntactic features. Conclusions We propose a new neural network architecture by modeling SDP along with sentence sequence to extract multi-relations from clinical text. Our experimental results show that the proposed approach significantly improve the performances on clinical notes, demonstrating the effectiveness of syntactic structures in deep learning-based relation extraction.
Collapse
Affiliation(s)
- Zhiheng Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Chen Shen
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, China
| | - Jun Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Yaoyun Zhang
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Hua Xu
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
| |
Collapse
|
6
|
Leroy G, Gu Y, Pettygrove S, Galindo MK, Arora A, Kurzius-Spencer M. Automated Extraction of Diagnostic Criteria From Electronic Health Records for Autism Spectrum Disorders: Development, Evaluation, and Application. J Med Internet Res 2018; 20:e10497. [PMID: 30404767 PMCID: PMC6249505 DOI: 10.2196/10497] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/25/2018] [Revised: 06/18/2018] [Accepted: 07/10/2018] [Indexed: 12/18/2022] Open
Abstract
BACKGROUND Electronic health records (EHRs) bring many opportunities for information utilization. One such use is the surveillance conducted by the Centers for Disease Control and Prevention to track cases of autism spectrum disorder (ASD). This process currently comprises manual collection and review of EHRs of 4- and 8-year old children in 11 US states for the presence of ASD criteria. The work is time-consuming and expensive. OBJECTIVE Our objective was to automatically extract from EHRs the description of behaviors noted by the clinicians in evidence of the diagnostic criteria in the Diagnostic and Statistical Manual of Mental Disorders (DSM). Previously, we reported on the classification of entire EHRs as ASD or not. In this work, we focus on the extraction of individual expressions of the different ASD criteria in the text. We intend to facilitate large-scale surveillance efforts for ASD and support analysis of changes over time as well as enable integration with other relevant data. METHODS We developed a natural language processing (NLP) parser to extract expressions of 12 DSM criteria using 104 patterns and 92 lexicons (1787 terms). The parser is rule-based to enable precise extraction of the entities from the text. The entities themselves are encompassed in the EHRs as very diverse expressions of the diagnostic criteria written by different people at different times (clinicians, speech pathologists, among others). Due to the sparsity of the data, a rule-based approach is best suited until larger datasets can be generated for machine learning algorithms. RESULTS We evaluated our rule-based parser and compared it with a machine learning baseline (decision tree). Using a test set of 6636 sentences (50 EHRs), we found that our parser achieved 76% precision, 43% recall (ie, sensitivity), and >99% specificity for criterion extraction. The performance was better for the rule-based approach than for the machine learning baseline (60% precision and 30% recall). For some individual criteria, precision was as high as 97% and recall 57%. Since precision was very high, we were assured that criteria were rarely assigned incorrectly, and our numbers presented a lower bound of their presence in EHRs. We then conducted a case study and parsed 4480 new EHRs covering 10 years of surveillance records from the Arizona Developmental Disabilities Surveillance Program. The social criteria (A1 criteria) showed the biggest change over the years. The communication criteria (A2 criteria) did not distinguish the ASD from the non-ASD records. Among behaviors and interests criteria (A3 criteria), 1 (A3b) was present with much greater frequency in the ASD than in the non-ASD EHRs. CONCLUSIONS Our results demonstrate that NLP can support large-scale analysis useful for ASD surveillance and research. In the future, we intend to facilitate detailed analysis and integration of national datasets.
Collapse
Affiliation(s)
- Gondy Leroy
- University of Arizona, Tucson, AZ, United States
| | - Yang Gu
- University of Arizona, Tucson, AZ, United States
| | | | | | | | | |
Collapse
|
7
|
Wang G, He X, Ishuga CI. HAR-SI: A novel hybrid article recommendation approach integrating with social information in scientific social network. Knowl Based Syst 2018. [DOI: 10.1016/j.knosys.2018.02.024] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022]
|
8
|
Khordad M, Mercer RE. Identifying genotype-phenotype relationships in biomedical text. J Biomed Semantics 2017; 8:57. [PMID: 29212530 PMCID: PMC5719522 DOI: 10.1186/s13326-017-0163-8] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2016] [Accepted: 10/28/2017] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND One important type of information contained in biomedical research literature is the newly discovered relationships between phenotypes and genotypes. Because of the large quantity of literature, a reliable automatic system to identify this information for future curation is essential. Such a system provides important and up to date data for database construction and updating, and even text summarization. In this paper we present a machine learning method to identify these genotype-phenotype relationships. No large human-annotated corpus of genotype-phenotype relationships currently exists. So, a semi-automatic approach has been used to annotate a small labelled training set and a self-training method is proposed to annotate more sentences and enlarge the training set. RESULTS The resulting machine-learned model was evaluated using a separate test set annotated by an expert. The results show that using only the small training set in a supervised learning method achieves good results (precision: 76.47, recall: 77.61, F-measure: 77.03) which are improved by applying a self-training method (precision: 77.70, recall: 77.84, F-measure: 77.77). CONCLUSIONS Relationships between genotypes and phenotypes is biomedical information pivotal to the understanding of a patient's situation. Our proposed method is the first attempt to make a specialized system to identify genotype-phenotype relationships in biomedical literature. We achieve good results using a small training set. To improve the results other linguistic contexts need to be explored and an appropriately enlarged training set is required.
Collapse
Affiliation(s)
- Maryam Khordad
- Department of Computer Science, University of Western Ontario, 1151 Richmond Street, London, N6A 5B7 Canada
| | - Robert E. Mercer
- Department of Computer Science, University of Western Ontario, 1151 Richmond Street, London, N6A 5B7 Canada
| |
Collapse
|
9
|
Wang G, He X, Ishuga CI. Social and content aware One-Class recommendation of papers in scientific social networks. PLoS One 2017; 12:e0181380. [PMID: 28771495 PMCID: PMC5542664 DOI: 10.1371/journal.pone.0181380] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2017] [Accepted: 06/29/2017] [Indexed: 11/18/2022] Open
Abstract
With the rapid development of information technology, scientific social networks (SSNs) have become the fastest and most convenient way for researchers to communicate with each other. Many published papers are shared via SSNs every day, resulting in the problem of information overload. How to appropriately recommend personalized and highly valuable papers for researchers is becoming more urgent. However, when recommending papers in SSNs, only a small amount of positive instances are available, leaving a vast amount of unlabelled data, in which negative instances and potential unseen positive instances are mixed together, which naturally belongs to One-Class Collaborative Filtering (OCCF) problem. Therefore, considering the extreme data imbalance and data sparsity of this OCCF problem, a hybrid approach of Social and Content aware One-class Recommendation of Papers in SSNs, termed SCORP, is proposed in this study. Unlike previous approaches recommended to address the OCCF problem, social information, which has been proved playing a significant role in performing recommendations in many domains, is applied in both the profiling of content-based filtering and the collaborative filtering to achieve superior recommendations. To verify the effectiveness of the proposed SCORP approach, a real-life dataset from CiteULike was employed. The experimental results demonstrate that the proposed approach is superior to all of the compared approaches, thus providing a more effective method for recommending papers in SSNs.
Collapse
Affiliation(s)
- Gang Wang
- School of Management, Hefei University of Technology, Hefei, Anhui, People’s Republic of China
| | - XiRan He
- School of Management, Hefei University of Technology, Hefei, Anhui, People’s Republic of China
| | | |
Collapse
|
10
|
Lou Y, Tu SW, Nyulas C, Tudorache T, Chalmers RJG, Musen MA. Use of ontology structure and Bayesian models to aid the crowdsourcing of ICD-11 sanctioning rules. J Biomed Inform 2017; 68:20-34. [PMID: 28192233 DOI: 10.1016/j.jbi.2017.02.004] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2016] [Revised: 02/02/2017] [Accepted: 02/08/2017] [Indexed: 11/18/2022]
Abstract
The International Classification of Diseases (ICD) is the de facto standard international classification for mortality reporting and for many epidemiological, clinical, and financial use cases. The next version of ICD, ICD-11, will be submitted for approval by the World Health Assembly in 2018. Unlike previous versions of ICD, where coders mostly select single codes from pre-enumerated disease and disorder codes, ICD-11 coding will allow extensive use of multiple codes to give more detailed disease descriptions. For example, "severe malignant neoplasms of left breast" may be coded using the combination of a "stem code" (e.g., code for malignant neoplasms of breast) with a variety of "extension codes" (e.g., codes for laterality and severity). The use of multiple codes (a process called post-coordination), while avoiding the pitfall of having to pre-enumerate vast number of possible disease and qualifier combinations, risks the creation of meaningless expressions that combine stem codes with inappropriate qualifiers. To prevent that from happening, "sanctioning rules" that define legal combinations are necessary. In this work, we developed a crowdsourcing method for obtaining sanctioning rules for the post-coordination of concepts in ICD-11. Our method utilized the hierarchical structures in the domain to improve the accuracy of the sanctioning rules and to lower the crowdsourcing cost. We used Bayesian networks to model crowd workers' skills, the accuracy of their responses, and our confidence in the acquired sanctioning rules. We applied reinforcement learning to develop an agent that constantly adjusted the confidence cutoffs during the crowdsourcing process to maximize the overall quality of sanctioning rules under a fixed budget. Finally, we performed formative evaluations using a skin-disease branch of the draft ICD-11 and demonstrated that the crowd-sourced sanctioning rules replicated those defined by an expert dermatologist with high precision and recall. This work demonstrated that a crowdsourcing approach could offer a reasonably efficient method for generating a first draft of sanctioning rules that subject matter experts could verify and edit, thus relieving them of the tedium and cost of formulating the initial set of rules.
Collapse
Affiliation(s)
- Yun Lou
- Stanford University, Stanford, CA, USA
| | | | | | | | | | | |
Collapse
|
11
|
Development and evaluation of a biomedical search engine using a predicate-based vector space model. J Biomed Inform 2013; 46:929-39. [PMID: 23892296 DOI: 10.1016/j.jbi.2013.07.006] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/01/2013] [Revised: 06/18/2013] [Accepted: 07/19/2013] [Indexed: 11/21/2022]
Abstract
Although biomedical information available in articles and patents is increasing exponentially, we continue to rely on the same information retrieval methods and use very few keywords to search millions of documents. We are developing a fundamentally different approach for finding much more precise and complete information with a single query using predicates instead of keywords for both query and document representation. Predicates are triples that are more complex datastructures than keywords and contain more structured information. To make optimal use of them, we developed a new predicate-based vector space model and query-document similarity function with adjusted tf-idf and boost function. Using a test bed of 107,367 PubMed abstracts, we evaluated the first essential function: retrieving information. Cancer researchers provided 20 realistic queries, for which the top 15 abstracts were retrieved using a predicate-based (new) and keyword-based (baseline) approach. Each abstract was evaluated, double-blind, by cancer researchers on a 0-5 point scale to calculate precision (0 versus higher) and relevance (0-5 score). Precision was significantly higher (p<.001) for the predicate-based (80%) than for the keyword-based (71%) approach. Relevance was almost doubled with the predicate-based approach-2.1 versus 1.6 without rank order adjustment (p<.001) and 1.34 versus 0.98 with rank order adjustment (p<.001) for predicate--versus keyword-based approach respectively. Predicates can support more precise searching than keywords, laying the foundation for rich and sophisticated information search.
Collapse
|
12
|
Lee J, Kim S, Lee S, Lee K, Kang J. On the efficacy of per-relation basis performance evaluation for PPI extraction and a high-precision rule-based approach. BMC Med Inform Decis Mak 2013; 13 Suppl 1:S7. [PMID: 23566263 PMCID: PMC3618211 DOI: 10.1186/1472-6947-13-s1-s7] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Background Most previous Protein Protein Interaction (PPI) studies evaluated their algorithms' performance based on "per-instance" precision and recall, in which the instances of an interaction relation were evaluated independently. However, we argue that this standard evaluation method should be revisited. In a large corpus, the same relation can be described in various different forms and, in practice, correctly identifying not all but a small subset of them would often suffice to detect the given interaction. Methods In this regard, we propose a more pragmatic "per-relation" basis performance evaluation method instead of the conventional per-instance basis method. In the per-relation basis method, only a subset of a relation's instances needs to be correctly identified to make the relation positive. In this work, we also introduce a new high-precision rule-based PPI extraction algorithm. While virtually all current PPI extraction studies focus on improving F-score, aiming to balance the performance on both precision and recall, in many realistic scenarios involving large corpora, one can benefit more from a high-precision algorithm than a high-recall counterpart. Results We show that our algorithm not only achieves better per-relation performance than previous solutions but also serves as a good complement to the existing PPI extraction tools. Our algorithm improves the performance of the existing tools through simple pipelining. Conclusion The significance of this research can be found in that this research brought new perspective to the performance evaluation of PPI extraction studies, which we believe is more important in practice than existing evaluation criteria. Given the new evaluation perspective, we also showed the importance of a high-precision extraction tool and validated the efficacy of our rule-based system as the high-precision tool candidate.
Collapse
Affiliation(s)
- Junkyu Lee
- Department of Computer Science, Korea University, Seoul, Korea
| | | | | | | | | |
Collapse
|
13
|
Veuthey AL, Bridge A, Gobeill J, Ruch P, McEntyre JR, Bougueleret L, Xenarios I. Application of text-mining for updating protein post-translational modification annotation in UniProtKB. BMC Bioinformatics 2013; 14:104. [PMID: 23517090 PMCID: PMC3660268 DOI: 10.1186/1471-2105-14-104] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/06/2012] [Accepted: 03/08/2013] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The annotation of protein post-translational modifications (PTMs) is an important task of UniProtKB curators and, with continuing improvements in experimental methodology, an ever greater number of articles are being published on this topic. To help curators cope with this growing body of information we have developed a system which extracts information from the scientific literature for the most frequently annotated PTMs in UniProtKB. RESULTS The procedure uses a pattern-matching and rule-based approach to extract sentences with information on the type and site of modification. A ranked list of protein candidates for the modification is also provided. For PTM extraction, precision varies from 57% to 94%, and recall from 75% to 95%, according to the type of modification. The procedure was used to track new publications on PTMs and to recover potential supporting evidence for phosphorylation sites annotated based on the results of large scale proteomics experiments. CONCLUSIONS The information retrieval and extraction method we have developed in this study forms the basis of a simple tool for the manual curation of protein post-translational modifications in UniProtKB/Swiss-Prot. Our work demonstrates that even simple text-mining tools can be effectively adapted for database curation tasks, providing that a thorough understanding of the working process and requirements are first obtained. This system can be accessed at http://eagl.unige.ch/PTM/.
Collapse
Affiliation(s)
- Anne-Lise Veuthey
- Swiss-Prot Group, SIB Swiss Institute of Bioinformatics, 1 Michel Servet, 1211 Geneva 4, Switzerland.
| | | | | | | | | | | | | |
Collapse
|
14
|
BioEve Search: A Novel Framework to Facilitate Interactive Literature Search. Adv Bioinformatics 2012; 2012:509126. [PMID: 22693501 PMCID: PMC3368157 DOI: 10.1155/2012/509126] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/15/2011] [Revised: 03/07/2012] [Accepted: 03/28/2012] [Indexed: 11/17/2022] Open
Abstract
Background. Recent advances in computational and biological methods in last two decades have remarkably changed the scale of biomedical research and with it began the unprecedented growth in both the production of biomedical data and amount of published literature discussing it. An automated extraction system coupled with a cognitive search and navigation service over these document collections would not only save time and effort, but also pave the way to discover hitherto unknown information implicitly conveyed in the texts. Results. We developed a novel framework (named “BioEve”) that seamlessly integrates Faceted Search (Information Retrieval) with Information Extraction module to provide an interactive search experience for the researchers in life sciences. It enables guided step-by-step search query refinement, by suggesting concepts and entities (like genes, drugs, and diseases) to quickly filter and modify search direction, and thereby facilitating an enriched paradigm where user can discover related concepts and keywords to search while information seeking. Conclusions. The BioEve Search framework makes it easier to enable scalable interactive search over large collection of textual articles and to discover knowledge hidden in thousands of biomedical literature articles with ease.
Collapse
|
15
|
Xu R, Wang Q. A knowledge-driven conditional approach to extract pharmacogenomics specific drug-gene relationships from free text. J Biomed Inform 2012; 45:827-34. [PMID: 22561026 DOI: 10.1016/j.jbi.2012.04.011] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2011] [Revised: 12/29/2011] [Accepted: 04/19/2012] [Indexed: 10/28/2022]
Abstract
An important task in pharmacogenomics (PGx) studies is to identify genetic variants that may impact drug response. The success of many systematic and integrative computational approaches for PGx studies depends on the availability of accurate, comprehensive and machine understandable drug-gene relationship knowledge bases. Scientific literature is one of the most comprehensive knowledge sources for PGx-specific drug-gene relationships. However, the major barrier in accessing this information is that the knowledge is buried in a large amount of free text with limited machine understandability. Therefore there is a need to develop automatic approaches to extract structured PGx-specific drug-gene relationships from unstructured free text literature. In this study, we have developed a conditional relationship extraction approach to extract PGx-specific drug-gene pairs from 20 million MEDLINE abstracts using known drug-gene pairs as prior knowledge. We have demonstrated that the conditional drug-gene relationship extraction approach significantly improves the precision and F1 measure compared to the unconditioned approach (precision: 0.345 vs. 0.11; recall: 0.481 vs. 1.00; F1: 0.402 vs. 0.201). In this study, a method based on co-occurrence is used as the underlying relationship extraction method for its simplicity. It can be replaced by or combined with more advanced methods such as machine learning or natural language processing approaches to further improve the performance of the drug-gene relationship extraction from free text. Our method is not limited to extracting a drug-gene relationship; it can be generalized to extract other types of relationships when related background knowledge bases exist.
Collapse
Affiliation(s)
- Rong Xu
- Medical Informatics Division, Case Western Reserve University, OH, USA.
| | | |
Collapse
|
16
|
Hossain MS, Gresock J, Edmonds Y, Helm R, Potts M, Ramakrishnan N. Connecting the dots between PubMed abstracts. PLoS One 2012; 7:e29509. [PMID: 22235301 PMCID: PMC3250456 DOI: 10.1371/journal.pone.0029509] [Citation(s) in RCA: 28] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2011] [Accepted: 11/29/2011] [Indexed: 11/23/2022] Open
Abstract
Background There are now a multitude of articles published in a diversity of journals providing information about genes, proteins, pathways, and diseases. Each article investigates subsets of a biological process, but to gain insight into the functioning of a system as a whole, we must integrate information from multiple publications. Particularly, unraveling relationships between extra-cellular inputs and downstream molecular response mechanisms requires integrating conclusions from diverse publications. Methodology We present an automated approach to biological knowledge discovery from PubMed abstracts, suitable for “connecting the dots” across the literature. We describe a storytelling algorithm that, given a start and end publication, typically with little or no overlap in content, identifies a chain of intermediate publications from one to the other, such that neighboring publications have significant content similarity. The quality of discovered stories is measured using local criteria such as the size of supporting neighborhoods for each link and the strength of individual links connecting publications, as well as global metrics of dispersion. To ensure that the story stays coherent as it meanders from one publication to another, we demonstrate the design of novel coherence and overlap filters for use as post-processing steps. Conclusions We demonstrate the application of our storytelling algorithm to three case studies: i) a many-one study exploring relationships between multiple cellular inputs and a molecule responsible for cell-fate decisions, ii) a many-many study exploring the relationships between multiple cytokines and multiple downstream transcription factors, and iii) a one-to-one study to showcase the ability to recover a cancer related association, viz. the Warburg effect, from past literature. The storytelling pipeline helps narrow down a scientist's focus from several hundreds of thousands of relevant documents to only around a hundred stories. We argue that our approach can serve as a valuable discovery aid for hypothesis generation and connection exploration in large unstructured biological knowledge bases.
Collapse
Affiliation(s)
- M Shahriar Hossain
- Department of Computer Science, Virginia Tech, Blacksburg, Virginia, United States of America.
| | | | | | | | | | | |
Collapse
|
17
|
Kilicoglu H, Bergler S. EFFECTIVE BIO-EVENT EXTRACTION USING TRIGGER WORDS AND SYNTACTIC DEPENDENCIES. Comput Intell 2011. [DOI: 10.1111/j.1467-8640.2011.00401.x] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022]
|
18
|
Wang HC, Chen YHS, Kao HY, Tsai SJ. Inference of transcriptional regulatory network by bootstrapping patterns. Bioinformatics 2011; 27:1422-8. [DOI: 10.1093/bioinformatics/btr155] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
|
19
|
Khordad M, Mercer RE, Rogan P. Improving Phenotype Name Recognition. ADVANCES IN ARTIFICIAL INTELLIGENCE 2011. [DOI: 10.1007/978-3-642-21043-3_30] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/26/2023]
|
20
|
Mukhopadhyay S, Palakal M, Maddu K. Multi-way association extraction and visualization from biological text documents using hyper-graphs: applications to genetic association studies for diseases. Artif Intell Med 2010; 49:145-54. [PMID: 20382004 DOI: 10.1016/j.artmed.2010.03.002] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2010] [Revised: 03/16/2010] [Accepted: 03/18/2010] [Indexed: 10/19/2022]
Abstract
OBJECTIVES Biological research literature, as in many other domains of human endeavor, represents a rich, ever growing source of knowledge. An important form of such biological knowledge constitutes associations among biological entities such as genes, proteins, diseases, drugs and chemicals, etc. There has been a considerable amount of recent research in extraction of various kinds of binary associations (e.g., gene-gene, gene-protein, protein-protein, etc.) using different text mining approaches. However, an important aspect of such associations (e.g., "gene A activates protein B") is identifying the context in which such associations occur (e.g., "gene A activates protein B in the context of disease C in organ D under the influence of chemical E"). Such contexts can be represented appropriately by a multi-way relationship involving more than two objects (e.g., objects A, B, C, D, E) rather than usual binary relationship (objects A and B). METHODS Such multi-way relations naturally lead to a hyper-graph representation of the knowledge rather than a binary graph. The hyper-graph based multi-way knowledge extraction from biological text literature represents a computationally difficult problem (due to its combinatorial nature) which has not received much attention from the Bioinformatics research community. In this paper, we describe and compare two different approaches to such multi-way hyper-graph extraction: one based on an exhaustive enumeration of all multi-way hyper-edges and the other based on an extension of the well-known A Priori algorithm for structured data to the case unstructured textual data. We also present a representative graph based approach towards visualizing these genetic association hyper-graphs. RESULTS Two case studies are conducted for two biomedical problems (related to the diseases of lung cancer and colorectal cancer respectively), illustrating that the latter approach (using the text-based A Priori method) identifies the same hyper-edges as the former approach (the exhaustive method), but at a much less computational cost. The extracted hyper-relations are presented in the paper as cognition-rich representative graphs, representing the corresponding hyper-graphs. CONCLUSIONS The text-based A Priori algorithm is a practical, useful method to extract hyper-graphs representing multi-way associations among biological objects. These hyper-graphs and their visualization using representative graphs can provide important contextual information for understanding gene-gene associations relevant to specific diseases.
Collapse
Affiliation(s)
- Snehasis Mukhopadhyay
- Department of Computer and Information Science, Indiana University Purdue University Indianapolis, 723 West Michigan Street SL 280J, Indianapolis, IN 46202, USA.
| | | | | |
Collapse
|
21
|
Measuring prediction capacity of individual verbs for the identification of protein interactions. J Biomed Inform 2009; 43:200-7. [PMID: 19818874 DOI: 10.1016/j.jbi.2009.09.007] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2008] [Revised: 07/26/2009] [Accepted: 09/24/2009] [Indexed: 11/20/2022]
Abstract
MOTIVATION The identification of events such as protein-protein interactions (PPIs) from the scientific literature is a complex task. One of the reasons is that there is no formal syntax to denote such relations in the scientific literature. Nonetheless, it is important to understand such relational event representations to improve information extraction solutions (e.g., for gene regulatory events). In this study, we analyze publicly available protein interaction corpora (AIMed, BioInfer, BioCreAtIve II) to determine the scope of verbs used to denote protein interactions and to measure their predictive capacity for the identification of PPI events. Our analysis is based on syntactical language patterns. This restriction has the advantage that the verb mention is used as the independent variable in the experiments enabling comparability of results in the usage of the verbs. The initial selection of verbs has been generated from a systematic analysis of the scientific literature and existing corpora for PPIs. We distinguish modifying interactions (MIs) such as posttranslational modifications (PTMs) from non-modifying interactions (NMIs) and assumed that MIs have a higher predictive capacity due to stronger scientific evidence proving the interaction. We found that MIs are less frequent in the corpus but can be extracted at the same precision levels as PPIs. A significant portion of correct PPI reportings in the BioCreAtIve II corpus use the verb "associate", which semantically does not prove a relation. The performance of every monitored verb is listed and allows the selection of specific verbs to improve the performance of PPI extraction solutions. Programmatic access to the text processing modules is available online (www.ebi.ac.uk/webservices/whatizit/info.jsf) and the full analysis of Medline abstracts will be made through the Web pages of the Rebholz group.
Collapse
|
22
|
Nagel K, Jimeno-Yepes A, Rebholz-Schuhmann D. Annotation of protein residues based on a literature analysis: cross-validation against UniProtKb. BMC Bioinformatics 2009; 10 Suppl 8:S4. [PMID: 19758468 PMCID: PMC2745586 DOI: 10.1186/1471-2105-10-s8-s4] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/29/2022] Open
Abstract
Background A protein annotation database, such as the Universal Protein Resource knowledge base (UniProtKb), is a valuable resource for the validation and interpretation of predicted 3D structure patterns in proteins. Existing studies have focussed on point mutation extraction methods from biomedical literature which can be used to support the time consuming work of manual database curation. However, these methods were limited to point mutation extraction and do not extract features for the annotation of proteins at the residue level. Results This work introduces a system that identifies protein residues in MEDLINE abstracts and annotates them with features extracted from the context written in the surrounding text. MEDLINE abstract texts have been processed to identify protein mentions in combination with taxonomic species and protein residues (F1-measure 0.52). The identified protein-species-residue triplets have been validated and benchmarked against reference data resources (UniProtKb, average F1-measure of 0.54). Then, contextual features were extracted through shallow and deep parsing and the features have been classified into predefined categories (F1-measure ranges from 0.15 to 0.67). Furthermore, the feature sets have been aligned with annotation types in UniProtKb to assess the relevance of the annotations for ongoing curation projects. Altogether, the annotations have been assessed automatically and manually against reference data resources. Conclusion This work proposes a solution for the automatic extraction of functional annotation for protein residues from biomedical articles. The presented approach is an extension to other existing systems in that a wider range of residue entities are considered and that features of residues are extracted as annotations.
Collapse
Affiliation(s)
- Kevin Nagel
- European Bioinformatics Institute, Hinxton, Cambridge, UK.
| | | | | |
Collapse
|
23
|
Yang Z, Lin H, Li Y. BioPPISVMExtractor: a protein-protein interaction extractor for biomedical literature using SVM and rich feature sets. J Biomed Inform 2009; 43:88-96. [PMID: 19706337 DOI: 10.1016/j.jbi.2009.08.013] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2008] [Revised: 08/05/2009] [Accepted: 08/18/2009] [Indexed: 11/18/2022]
Abstract
Protein-protein interactions play a key role in various aspects of the structural and functional organization of the cell. Knowledge about them unveils the molecular mechanisms of biological processes. However, the amount of biomedical literature regarding protein interactions is increasing rapidly and it is difficult for interaction database curators to detect and curate protein interaction information manually. This paper presents a SVM-based system, named BioPPISVMExtractor, to identify protein-protein interactions in biomedical literature. This system uses rich feature sets including word features, keyword feature, protein names distance feature and Link path feature for SVM classification. In addition, the Link Grammar extraction result feature is introduced to improve the precision rate. Experimental evaluations with other state-of-the-art PPI extraction systems tested on the DIP corpus indicate that BioPPISVMExtractor can substantially improve recall at the cost of a moderate decline in precision.
Collapse
Affiliation(s)
- Zhihao Yang
- Department of Computer Science and Engineering, Dalian University of Technology, Dalian 116023, China.
| | | | | |
Collapse
|
24
|
Cohen KB, Palmer M, Hunter L. Nominalization and alternations in biomedical language. PLoS One 2008; 3:e3158. [PMID: 18779866 PMCID: PMC2527518 DOI: 10.1371/journal.pone.0003158] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/18/2008] [Accepted: 06/04/2008] [Indexed: 12/04/2022] Open
Abstract
Background This paper presents data on alternations in the argument structure of common domain-specific verbs and their associated verbal nominalizations in the PennBioIE corpus. Alternation is the term in theoretical linguistics for variations in the surface syntactic form of verbs, e.g. the different forms of stimulate in FSH stimulates follicular development and follicular development is stimulated by FSH. The data is used to assess the implications of alternations for biomedical text mining systems and to test the fit of the sublanguage model to biomedical texts. Methodology/Principal Findings We examined 1,872 tokens of the ten most common domain-specific verbs or their zero-related nouns in the PennBioIE corpus and labelled them for the presence or absence of three alternations. We then annotated the arguments of 746 tokens of the nominalizations related to these verbs and counted alternations related to the presence or absence of arguments and to the syntactic position of non-absent arguments. We found that alternations are quite common both for verbs and for nominalizations. We also found a previously undescribed alternation involving an adjectival present participle. Conclusions/Significance We found that even in this semantically restricted domain, alternations are quite common, and alternations involving nominalizations are exceptionally diverse. Nonetheless, the sublanguage model applies to biomedical language. We also report on a previously undescribed alternation involving an adjectival present participle.
Collapse
Affiliation(s)
- K Bretonnel Cohen
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, Colorado, United States of America.
| | | | | |
Collapse
|
25
|
|
26
|
Witte R, Baker CJO. Towards a systematic evaluation of protein mutation extraction systems. J Bioinform Comput Biol 2008; 5:1339-59. [PMID: 18172932 DOI: 10.1142/s0219720007003193] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2007] [Revised: 09/20/2007] [Accepted: 09/30/2007] [Indexed: 11/18/2022]
Abstract
The development of text analysis systems targeting the extraction of information about mutations from research publications is an emergent topic in biomedical research. Current systems differ in both scope and approach, thus preventing a meaningful comparison of their performance and therefore possible synergies. To overcome this evaluation bottleneck, we developed a comprehensive framework for the systematic analysis of mutation extraction systems, precisely defining tasks and corresponding evaluation metrics, that will allow a comparison of existing and future applications.
Collapse
Affiliation(s)
- René Witte
- Universität Karlsruhe (TH), Institut für Programmstrukturen und Datenorganisation (IPD), Am Fasanengarten 5, 76128, Karlsruhe, Germany.
| | | |
Collapse
|
27
|
Hunter L, Lu Z, Firby J, Baumgartner WA, Johnson HL, Ogren PV, Cohen KB. OpenDMAP: an open source, ontology-driven concept analysis engine, with applications to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression. BMC Bioinformatics 2008; 9:78. [PMID: 18237434 PMCID: PMC2275248 DOI: 10.1186/1471-2105-9-78] [Citation(s) in RCA: 93] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/29/2007] [Accepted: 01/31/2008] [Indexed: 12/03/2022] Open
Abstract
Background Information extraction (IE) efforts are widely acknowledged to be important in harnessing the rapid advance of biomedical knowledge, particularly in areas where important factual information is published in a diverse literature. Here we report on the design, implementation and several evaluations of OpenDMAP, an ontology-driven, integrated concept analysis system. It significantly advances the state of the art in information extraction by leveraging knowledge in ontological resources, integrating diverse text processing applications, and using an expanded pattern language that allows the mixing of syntactic and semantic elements and variable ordering. Results OpenDMAP information extraction systems were produced for extracting protein transport assertions (transport), protein-protein interaction assertions (interaction) and assertions that a gene is expressed in a cell type (expression). Evaluations were performed on each system, resulting in F-scores ranging from .26 – .72 (precision .39 – .85, recall .16 – .85). Additionally, each of these systems was run over all abstracts in MEDLINE, producing a total of 72,460 transport instances, 265,795 interaction instances and 176,153 expression instances. Conclusion OpenDMAP advances the performance standards for extracting protein-protein interaction predications from the full texts of biomedical research articles. Furthermore, this level of performance appears to generalize to other information extraction tasks, including extracting information about predicates of more than two arguments. The output of the information extraction system is always constructed from elements of an ontology, ensuring that the knowledge representation is grounded with respect to a carefully constructed model of reality. The results of these efforts can be used to increase the efficiency of manual curation efforts and to provide additional features in systems that integrate multiple sources for information extraction. The open source OpenDMAP code library is freely available at
Collapse
Affiliation(s)
- Lawrence Hunter
- Center for Computational Pharmacology, University of Colorado School of Medicine, Aurora, CO 80045, USA.
| | | | | | | | | | | | | |
Collapse
|
28
|
Li J, Zhang Z, Li X, Chen H. Kernel-based learning for biomedical relation extraction. ACTA ACUST UNITED AC 2008. [DOI: 10.1002/asi.20791] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|
29
|
Zhou D, He Y. Extracting interactions between proteins from the literature. J Biomed Inform 2007; 41:393-407. [PMID: 18207462 DOI: 10.1016/j.jbi.2007.11.008] [Citation(s) in RCA: 59] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2007] [Revised: 11/21/2007] [Accepted: 11/28/2007] [Indexed: 11/29/2022]
Abstract
During the last decade, biomedicine has witnessed a tremendous development. Large amounts of experimental and computational biomedical data have been generated along with new discoveries, which are accompanied by an exponential increase in the number of biomedical publications describing these discoveries. In the meantime, there has been a great interest with scientific communities in text mining tools to find knowledge such as protein-protein interactions, which is most relevant and useful for specific analysis tasks. This paper provides a outline of the various information extraction methods in biomedical domain, especially for discovery of protein-protein interactions. It surveys methodologies involved in plain texts analyzing and processing, categorizes current work in biomedical information extraction, and provides examples of these methods. Challenges in the field are also presented and possible solutions are discussed.
Collapse
Affiliation(s)
- Deyu Zhou
- Informatics Research Centre, The University of Reading, Reading, RG6 6BX, UK.
| | | |
Collapse
|
30
|
Quiñones KD, Su H, Marshall B, Eggers S, Chen H. User-centered evaluation of Arizona BioPathway: an information extraction, integration, and visualization system. IEEE TRANSACTIONS ON INFORMATION TECHNOLOGY IN BIOMEDICINE : A PUBLICATION OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY 2007; 11:527-36. [PMID: 17912969 DOI: 10.1109/titb.2006.889706] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Explosive growth in biomedical research has made automated information extraction, knowledge integration, and visualization increasingly important and critically needed. The Arizona BioPathway (ABP) system extracts and displays biological regulatory pathway information from the abstracts of journal articles. This study uses relations extracted from more than 200 PubMed abstracts presented in a tabular and graphical user interface with built-in search and aggregation functionality. This paper presents a task-centered assessment of the usefulness and usability of the ABP system focusing on its relation aggregation and visualization functionalities. Results suggest that our graph-based visualization is more efficient in supporting pathway analysis tasks and is perceived as more useful and easier to use as compared to a text-based literature-viewing method. Relation aggregation significantly contributes to knowledge-acquisition efficiency. Together, the graphic and tabular views in the ABP Visualizer provide a flexible and effective interface for pathway relation browsing and analysis. Our study contributes to pathway-related research and biological information extraction by assessing the value of a multiview, relation-based interface that supports user-controlled exploration of pathway information across multiple granularities.
Collapse
Affiliation(s)
- Karin D Quiñones
- Department of Management Information Systems, University of Arizona, Tucson, AZ 85721, USA.
| | | | | | | | | |
Collapse
|
31
|
Sanchez-Graillet O, Poesio M. Negation of protein-protein interactions: analysis and extraction. ACTA ACUST UNITED AC 2007; 23:i424-32. [PMID: 17646327 DOI: 10.1093/bioinformatics/btm184] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022]
Abstract
MOTIVATION Negative information about protein-protein interactions--from uncertainty about the occurrence of an interaction to knowledge that it did not occur--is often of great use to biologists and could lead to important discoveries. Yet, to our knowledge, no proposals focusing on extracting such information have been proposed in the text mining literature. RESULTS In this work, we present an analysis of the types of negative information that is reported, and a heuristic-based system using a full dependency parser to extract such information. We performed a preliminary evaluation study that shows encouraging results of our system. Finally, we have obtained an initial corpus of negative protein-protein interactions as basis for the construction of larger ones. AVAILABILITY The corpus is available by request from the authors.
Collapse
|
32
|
Rinaldi F, Schneider G, Kaljurand K, Hess M, Romacker M. An environment for relation mining over richly annotated corpora: the case of GENIA. BMC Bioinformatics 2006; 7 Suppl 3:S3. [PMID: 17134476 PMCID: PMC1764447 DOI: 10.1186/1471-2105-7-s3-s3] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/11/2022] Open
Abstract
Background The biomedical domain is witnessing a rapid growth of the amount of published scientific results, which makes it increasingly difficult to filter the core information. There is a real need for support tools that 'digest' the published results and extract the most important information. Results We describe and evaluate an environment supporting the extraction of domain-specific relations, such as protein-protein interactions, from a richly-annotated corpus. We use full, deep-linguistic parsing and manually created, versatile patterns, expressing a large set of syntactic alternations, plus semantic ontology information. Conclusion The experiments show that our approach described is capable of delivering high-precision results, while maintaining sufficient levels of recall. The high level of abstraction of the rules used by the system, which are considerably more powerful and versatile than finite-state approaches, allows speedy interactive development and validation.
Collapse
Affiliation(s)
- Fabio Rinaldi
- Institute of Computational Linguistics, IFI, University of Zurich, Switzerland
| | - Gerold Schneider
- Institute of Computational Linguistics, IFI, University of Zurich, Switzerland
| | - Kaarel Kaljurand
- Institute of Computational Linguistics, IFI, University of Zurich, Switzerland
| | - Michael Hess
- Institute of Computational Linguistics, IFI, University of Zurich, Switzerland
| | | |
Collapse
|
33
|
Masseroli M, Kilicoglu H, Lang FM, Rindflesch TC. Argument-predicate distance as a filter for enhancing precision in extracting predications on the genetic etiology of disease. BMC Bioinformatics 2006; 7:291. [PMID: 16762065 PMCID: PMC1564420 DOI: 10.1186/1471-2105-7-291] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/11/2005] [Accepted: 06/08/2006] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genomic functional information is valuable for biomedical research. However, such information frequently needs to be extracted from the scientific literature and structured in order to be exploited by automatic systems. Natural language processing is increasingly used for this purpose although it inherently involves errors. A postprocessing strategy that selects relations most likely to be correct is proposed and evaluated on the output of SemGen, a system that extracts semantic predications on the etiology of genetic diseases. Based on the number of intervening phrases between an argument and its predicate, we defined a heuristic strategy to filter the extracted semantic relations according to their likelihood of being correct. We also applied this strategy to relations identified with co-occurrence processing. Finally, we exploited postprocessed SemGen predications to investigate the genetic basis of Parkinson's disease. RESULTS The filtering procedure for increased precision is based on the intuition that arguments which occur close to their predicate are easier to identify than those at a distance. For example, if gene-gene relations are filtered for arguments at a distance of 1 phrase from the predicate, precision increases from 41.95% (baseline) to 70.75%. Since this proximity filtering is based on syntactic structure, applying it to the results of co-occurrence processing is useful, but not as effective as when applied to the output of natural language processing. In an effort to exploit SemGen predications on the etiology of disease after increasing precision with postprocessing, a gene list was derived from extracted information enhanced with postprocessing filtering and was automatically annotated with GFINDer, a Web application that dynamically retrieves functional and phenotypic information from structured biomolecular resources. Two of the genes in this list are likely relevant to Parkinson's disease but are not associated with this disease in several important databases on genetic disorders. CONCLUSION Information based on the proximity postprocessing method we suggest is of sufficient quality to be profitably used for subsequent applications aimed at uncovering new biomedical knowledge. Although proximity filtering is only marginally effective for enhancing the precision of relations extracted with co-occurrence processing, it is likely to benefit methods based, even partially, on syntactic structure, regardless of the relation.
Collapse
Affiliation(s)
- Marco Masseroli
- Bioengineering Department, Politecnico di Milano, Milan, Italy
| | - Halil Kilicoglu
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, USA
| | - François-Michel Lang
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, USA
| | - Thomas C Rindflesch
- Lister Hill National Center for Biomedical Communications, National Library of Medicine, Bethesda, USA
| |
Collapse
|
34
|
Marshall B, Su H, McDonald D, Eggers S, Chen H. Aggregating automatically extracted regulatory pathway relations. ACTA ACUST UNITED AC 2006; 10:100-8. [PMID: 16445255 DOI: 10.1109/titb.2005.856857] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022]
Abstract
Automatic tools to extract information from biomedical texts are needed to help researchers leverage the vast and increasing body of biomedical literature. While several biomedical relation extraction systems have been created and tested, little work has been done to meaningfully organize the extracted relations. Organizational processes should consolidate multiple references to the same objects over various levels of granularity, connect those references to other resources, and capture contextual information. We propose a feature decomposition approach to relation aggregation to support a five-level aggregation framework. Our BioAggregate tagger uses this approach to identify key features in extracted relation name strings. We show encouraging feature assignment accuracy and report substantial consolidation in a network of extracted relations.
Collapse
|
35
|
Leroy G, Chen H. Genescene: An ontology-enhanced integration of linguistic and co-occurrence based relations in biomedical texts. ACTA ACUST UNITED AC 2005. [DOI: 10.1002/asi.20135] [Citation(s) in RCA: 27] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
|