Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Aydın F, Hüsünbeyi ZM, Özgür A. Automatic query generation using word embeddings for retrieving passages describing experimental methods. Database (Oxford) 2017;2017:baw166. [PMID: 28077568 PMCID: PMC5225401 DOI: 10.1093/database/baw166] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/26/2016] [Revised: 12/01/2016] [Accepted: 12/01/2016] [Indexed: 01/01/2023]

For:	Aydın F, Hüsünbeyi ZM, Özgür A. Automatic query generation using word embeddings for retrieving passages describing experimental methods. Database (Oxford) 2017;2017:baw166. [PMID: 28077568 PMCID: PMC5225401 DOI: 10.1093/database/baw166] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/26/2016] [Revised: 12/01/2016] [Accepted: 12/01/2016] [Indexed: 01/01/2023]

Number

Cited by Other Article(s)

Karadeniz İ, Özgür A. Linking entities through an ontology using word embeddings and syntactic re-ranking. BMC Bioinformatics 2019;20:156. [PMID: 30917789 PMCID: PMC6437991 DOI: 10.1186/s12859-019-2678-8] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2018] [Accepted: 02/13/2019] [Indexed: 11/10/2022] Open

Abstract

BACKGROUND

Although there is an enormous number of textual resources in the biomedical domain, currently, manually curated resources cover only a small part of the existing knowledge. The vast majority of these information is in unstructured form which contain nonstandard naming conventions. The task of named entity recognition, which is the identification of entity names from text, is not adequate without a standardization step. Linking each identified entity mention in text to an ontology/dictionary concept is an essential task to make sense of the identified entities. This paper presents an unsupervised approach for the linking of named entities to concepts in an ontology/dictionary. We propose an approach for the normalization of biomedical entities through an ontology/dictionary by using word embeddings to represent semantic spaces, and a syntactic parser to give higher weight to the most informative word in the named entity mentions.

RESULTS

We applied the proposed method to two different normalization tasks: the normalization of bacteria biotope entities through the Onto-Biotope ontology and the normalization of adverse drug reaction entities through the Medical Dictionary for Regulatory Activities (MedDRA). The proposed method achieved a precision score of 65.9%, which is 2.9 percentage points above the state-of-the-art result on the BioNLP Shared Task 2016 Bacteria Biotope test data and a macro-averaged precision score of 68.7% on the Text Analysis Conference 2017 Adverse Drug Reaction test data.

CONCLUSIONS

The core contribution of this paper is a syntax-based way of combining the individual word vectors to form vectors for the named entity mentions and ontology concepts, which can then be used to measure the similarity between them. The proposed approach is unsupervised and does not require labeled data, making it easily applicable to different domains.

Collapse

Teodoro D, Mottin L, Gobeill J, Gaudinat A, Vachon T, Ruch P. Improving average ranking precision in user searches for biomedical research datasets. DATABASE-THE JOURNAL OF BIOLOGICAL DATABASES AND CURATION 2018;2017:4600047. [PMID: 29220475 PMCID: PMC5714153 DOI: 10.1093/database/bax083] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/21/2017] [Accepted: 10/12/2017] [Indexed: 11/15/2022]

Abstract

Availability of research datasets is keystone for health and life science study reproducibility and scientific progress. Due to the heterogeneity and complexity of these data, a main challenge to be overcome by research data management systems is to provide users with the best answers for their search queries. In the context of the 2016 bioCADDIE Dataset Retrieval Challenge, we investigate a novel ranking pipeline to improve the search of datasets used in biomedical experiments. Our system comprises a query expansion model based on word embeddings, a similarity measure algorithm that takes into consideration the relevance of the query terms, and a dataset categorization method that boosts the rank of datasets matching query constraints. The system was evaluated using a corpus with 800k datasets and 21 annotated user queries, and provided competitive results when compared to the other challenge participants. In the official run, it achieved the highest infAP, being +22.3% higher than the median infAP of the participant’s best submissions. Overall, it is ranked at top 2 if an aggregated metric using the best official measures per participant is considered. The query expansion method showed positive impact on the system’s performance increasing our baseline up to +5.0% and +3.4% for the infAP and infNDCG metrics, respectively. The similarity measure algorithm showed robust performance in different training conditions, with small performance variations compared to the Divergence from Randomness framework. Finally, the result categorization did not have significant impact on the system’s performance. We believe that our solution could be used to enhance biomedical dataset management systems. The use of data driven expansion methods, such as those based on word embeddings, could be an alternative to the complexity of biomedical terminologies. Nevertheless, due to the limited size of the assessment set, further experiments need to be performed to draw conclusive results.

Database URL:https://biocaddie.org/benchmark-data

Collapse

Sogancioglu G, Öztürk H, Özgür A. BIOSSES: a semantic sentence similarity estimation system for the biomedical domain. Bioinformatics 2018;33:i49-i58. [PMID: 28881973 PMCID: PMC5870675 DOI: 10.1093/bioinformatics/btx238] [Citation(s) in RCA: 40] [Impact Index Per Article: 5.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022] Open

Abstract

Motivation

The amount of information available in textual format is rapidly increasing in the biomedical domain. Therefore, natural language processing (NLP) applications are becoming increasingly important to facilitate the retrieval and analysis of these data. Computing the semantic similarity between sentences is an important component in many NLP tasks including text retrieval and summarization. A number of approaches have been proposed for semantic sentence similarity estimation for generic English. However, our experiments showed that such approaches do not effectively cover biomedical knowledge and produce poor results for biomedical text.

Methods

We propose several approaches for sentence-level semantic similarity computation in the biomedical domain, including string similarity measures and measures based on the distributed vector representations of sentences learned in an unsupervised manner from a large biomedical corpus. In addition, ontology-based approaches are presented that utilize general and domain-specific ontologies. Finally, a supervised regression based model is developed that effectively combines the different similarity computation metrics. A benchmark data set consisting of 100 sentence pairs from the biomedical literature is manually annotated by five human experts and used for evaluating the proposed methods.

Results

The experiments showed that the supervised semantic sentence similarity computation approach obtained the best performance (0.836 correlation with gold standard human annotations) and improved over the state-of-the-art domain-independent systems up to 42.6% in terms of the Pearson correlation metric.

Availability and implementation

A web-based system for biomedical semantic sentence similarity computation, the source code, and the annotated benchmark data set are available at: http://tabilab.cmpe.boun.edu.tr/BIOSSES/.

Collapse

Kim S, Islamaj Doğan R, Chatr-Aryamontri A, Chang CS, Oughtred R, Rust J, Batista-Navarro R, Carter J, Ananiadou S, Matos S, Santos A, Campos D, Oliveira JL, Singh O, Jonnagaddala J, Dai HJ, Su ECY, Chang YC, Su YC, Chu CH, Chen CC, Hsu WL, Peng Y, Arighi C, Wu CH, Vijay-Shanker K, Aydın F, Hüsünbeyi ZM, Özgür A, Shin SY, Kwon D, Dolinski K, Tyers M, Wilbur WJ, Comeau DC. BioCreative V BioC track overview: collaborative biocurator assistant task for BioGRID. Database (Oxford) 2016;2016:baw121. [PMID: 27589962 PMCID: PMC5009341 DOI: 10.1093/database/baw121] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/06/2016] [Revised: 07/29/2016] [Accepted: 08/02/2016] [Indexed: 11/14/2022]

Affiliation(s)

Sun Kim National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
Rezarta Islamaj Doğan National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
Andrew Chatr-Aryamontri Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, QC H3C 3J7, Canada
Christie S Chang Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Rose Oughtred Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Jennifer Rust Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Riza Batista-Navarro National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
Jacob Carter National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
Sophia Ananiadou National Centre for Text Mining, School of Computer Science, University of Manchester, Manchester, UK
Sérgio Matos DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
André Santos DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
David Campos BMD Software, Lda, Rua Calouste Gulbenkian 1, 3810-074 Aveiro, Portugal
José Luís Oliveira DETI/IEETA, University of Aveiro, Campus Universitário de Santiago, 3810-193 Aveiro, Portugal
Onkar Singh Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
Jitendra Jonnagaddala School of Public Health and Community Medicine, University of New South Wales, Kensington NSW 2033, Australia Prince of Wales Clinical School, University of New South Wales, Kensington NSW 2033, Australia
Hong-Jie Dai Department of Computer Science and Information Engineering, National Taitung University, Taitung, Taiwan
Emily Chia-Yu Su Graduate Institute of Biomedical Informatics, College of Medical Science and Technology, Taipei Medical University, Taipei, Taiwan
Yung-Chun Chang Institute of Information Science, Academia Sinica, Taipei, Taiwan Department of Information Management, National Taiwan University, Taipei, Taiwan
Yu-Chen Su Department of Computer Science, National Tsing Hua University, Hsinchu, Taiwan
Chun-Han Chu Institute of Information Science, Academia Sinica, Taipei, Taiwan
Chien Chin Chen Department of Information Management, National Taiwan University, Taipei, Taiwan
Wen-Lian Hsu Institute of Information Science, Academia Sinica, Taipei, Taiwan
Yifan Peng Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA
Cecilia Arighi Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
Cathy H Wu Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA Center for Bioinformatics & Computational Biology, University of Delaware, Newark, DE 19716, USA
K Vijay-Shanker Computer & Information Sciences, University of Delaware, Newark, DE 19716, USA
Ferhat Aydın Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
Zehra Melce Hüsünbeyi Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
Arzucan Özgür Department of Computer Engineering, Boğaziçi University, Bebek, 34342 Istanbul, Turkey
Soo-Yong Shin Department of Biomedical Informatics, Asan Medical Center, 138-736 Seoul, South Korea
Dongseop Kwon Department of Computer Engineering, Myongji University, 449-728 Yongin, South Korea
Kara Dolinski Lewis-Sigler Institute for Integrative Genomics, Princeton University, Princeton, NJ 08544, USA
Mike Tyers Institute for Research in Immunology and Cancer, Université de Montréal, Montréal, QC H3C 3J7, Canada The Lunenfeld-Tanenbaum Research Institute, Mount Sinai Hospital, Toronto, Ontario, Canada
W John Wilbur National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA
Donald C Comeau National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, USA

Collapse