1
|
Karadeniz İ, Özgür A. Linking entities through an ontology using word embeddings and syntactic re-ranking. BMC Bioinformatics 2019; 20:156. [PMID: 30917789 PMCID: PMC6437991 DOI: 10.1186/s12859-019-2678-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Accepted: 02/13/2019] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Although there is an enormous number of textual resources in the biomedical domain, currently, manually curated resources cover only a small part of the existing knowledge. The vast majority of these information is in unstructured form which contain nonstandard naming conventions. The task of named entity recognition, which is the identification of entity names from text, is not adequate without a standardization step. Linking each identified entity mention in text to an ontology/dictionary concept is an essential task to make sense of the identified entities. This paper presents an unsupervised approach for the linking of named entities to concepts in an ontology/dictionary. We propose an approach for the normalization of biomedical entities through an ontology/dictionary by using word embeddings to represent semantic spaces, and a syntactic parser to give higher weight to the most informative word in the named entity mentions. RESULTS We applied the proposed method to two different normalization tasks: the normalization of bacteria biotope entities through the Onto-Biotope ontology and the normalization of adverse drug reaction entities through the Medical Dictionary for Regulatory Activities (MedDRA). The proposed method achieved a precision score of 65.9%, which is 2.9 percentage points above the state-of-the-art result on the BioNLP Shared Task 2016 Bacteria Biotope test data and a macro-averaged precision score of 68.7% on the Text Analysis Conference 2017 Adverse Drug Reaction test data. CONCLUSIONS The core contribution of this paper is a syntax-based way of combining the individual word vectors to form vectors for the named entity mentions and ontology concepts, which can then be used to measure the similarity between them. The proposed approach is unsupervised and does not require labeled data, making it easily applicable to different domains.
Collapse
Affiliation(s)
- İlknur Karadeniz
- Department of Computer Engineering, Boğaziçi University, İstanbul, 34342, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, İstanbul, 34342, Turkey.
| |
Collapse
|
2
|
Teodoro D, Mottin L, Gobeill J, Gaudinat A, Vachon T, Ruch P. Improving average ranking precision in user searches for biomedical research datasets. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018; 2017:4600047. [PMID: 29220475 PMCID: PMC5714153 DOI: 10.1093/database/bax083] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 10/12/2017] [Indexed: 11/15/2022]
Abstract
Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results. Database URL:https://biocaddie.org/benchmark-data
Collapse
Affiliation(s)
- Douglas Teodoro
- Text Mining Group, SIB Swiss Institute of Bioinformatics, 1227 Geneva, Switzerland.,Department of Information Science, HEG Geneva HES-SO, 1227 Geneva, Switzerland
| | - Luc Mottin
- Text Mining Group, SIB Swiss Institute of Bioinformatics, 1227 Geneva, Switzerland.,Department of Information Science, HEG Geneva HES-SO, 1227 Geneva, Switzerland
| | - Julien Gobeill
- Text Mining Group, SIB Swiss Institute of Bioinformatics, 1227 Geneva, Switzerland.,Department of Information Science, HEG Geneva HES-SO, 1227 Geneva, Switzerland
| | - Arnaud Gaudinat
- Department of Information Science, HEG Geneva HES-SO, 1227 Geneva, Switzerland
| | - Thérèse Vachon
- Novartis Institutes for BioMedical Research-Text Mining Services (NIBR Informatics/TMS), Novartis Pharma AG Postfach, 4002 Basel, Switzerland
| | - Patrick Ruch
- Text Mining Group, SIB Swiss Institute of Bioinformatics, 1227 Geneva, Switzerland.,Department of Information Science, HEG Geneva HES-SO, 1227 Geneva, Switzerland
| |
Collapse
|
3
|
Sogancioglu G, Öztürk H, Özgür A. BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 2018; 33:i49-i58. [PMID: 28881973 PMCID: PMC5870675 DOI: 10.1093/bioinformatics/btx238] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open
Abstract
Motivation The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text. Methods We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods. Results The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric. Availability and implementation A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/.
Collapse
Affiliation(s)
- Gizem Sogancioglu
- Department of Computer Engineering, Bogazici University, Istanbul, Turkey.,R&D and Special Projects Department, Yapı Kredi Technology, Istanbul, Turkey
| | - Hakime Öztürk
- Department of Computer Engineering, Bogazici University, Istanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Bogazici University, Istanbul, Turkey
| |
Collapse
|
4
|
Kim S, Islamaj Doğan R, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, Batista-Navarro R, Carter J, Ananiadou S, Matos S, Santos A, Campos D, Oliveira JL, Singh O, Jonnagaddala J, Dai HJ, Su ECY, Chang YC, Su YC, Chu CH, Chen CC, Hsu WL, Peng Y, Arighi C, Wu CH, Vijay-Shanker K, Aydın F, Hüsünbeyi ZM, Özgür A, Shin SY, Kwon D, Dolinski K, Tyers M, Wilbur WJ, Comeau DC. BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID. Database (Oxford) 2016; 2016:baw121. [PMID: 27589962 PMCID: PMC5009341 DOI: 10.1093/database/baw121] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2016] [Revised: 07/29/2016] [Accepted: 08/02/2016] [Indexed: 11/14/2022]
Abstract
BioC is a simple XML format for text, annotations and relations, and was developed to achieve interoperability for biomedical text processing. Following the success of BioC in BioCreative IV, the BioCreative V BioC track addressed a collaborative task to build an assistant system for BioGRID curation. In this paper, we describe the framework of the collaborative BioC task and discuss our findings based on the user survey. This track consisted of eight subtasks including gene/protein/organism named entity recognition, protein-protein/genetic interaction passage identification and annotation visualization. Using BioC as their data-sharing and communication medium, nine teams, world-wide, participated and contributed either new methods or improvements of existing tools to address different subtasks of the BioC track. Results from different teams were shared in BioC and made available to other teams as they addressed different subtasks of the track. In the end, all submitted runs were merged using a machine learning classifier to produce an optimized output. The biocurator assistant system was evaluated by four BioGRID curators in terms of practical usability. The curators' feedback was overall positive and highlighted the user-friendly design and the convenient gene/protein curation tool based on text mining.Database URL: http://www.biocreative.org/tasks/biocreative-v/track-1-bioc/.
Collapse
Affiliation(s)
- Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Rezarta Islamaj Doğan
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Andrew Chatr-Aryamontri
- Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, QC H3C 3J7, Canada
| | - Christie S Chang
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Rose Oughtred
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Jennifer Rust
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Riza Batista-Navarro
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| | - Jacob Carter
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| | - Sophia Ananiadou
- National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
| | - Sérgio Matos
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - André Santos
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - David Campos
- BMD Software, Lda, Rua Calouste Gulbenkian 1, 3810-074 Aveiro, Portugal
| | - José Luís Oliveira
- DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
| | - Onkar Singh
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
| | - Jitendra Jonnagaddala
- School of Public Health and Community Medicine, University of New South Wales, Kensington NSW 2033, Australia Prince of Wales Clinical School, University of New South Wales, Kensington NSW 2033, Australia
| | - Hong-Jie Dai
- Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan
| | - Emily Chia-Yu Su
- Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
| | - Yung-Chun Chang
- Institute of Information Science, Academia Sinica, Taipei, Taiwan Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Yu-Chen Su
- Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
| | - Chun-Han Chu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Chien Chin Chen
- Department of Information Management, National Taiwan University, Taipei, Taiwan
| | - Wen-Lian Hsu
- Institute of Information Science, Academia Sinica, Taipei, Taiwan
| | - Yifan Peng
- Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA
| | - Cecilia Arighi
- Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
| | - Cathy H Wu
- Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
| | - K Vijay-Shanker
- Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA
| | - Ferhat Aydın
- Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
| | - Zehra Melce Hüsünbeyi
- Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
| | - Arzucan Özgür
- Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
| | - Soo-Yong Shin
- Department of Biomedical Informatics, Asan Medical Center, 138-736 Seoul, South Korea
| | - Dongseop Kwon
- Department of Computer Engineering, Myongji University, 449-728 Yongin, South Korea
| | - Kara Dolinski
- Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
| | - Mike Tyers
- Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, QC H3C 3J7, Canada The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
| |
Collapse
|