1
|
Khader A, Ensan F. Learning to rank query expansion terms for COVID-19 scholarly search. J Biomed Inform 2023; 142:104386. [PMID: 37178780 PMCID: PMC10174726 DOI: 10.1016/j.jbi.2023.104386] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Revised: 04/19/2023] [Accepted: 05/05/2023] [Indexed: 05/15/2023]
Abstract
OBJECTIVE With the onset of the Coronavirus Disease 2019 (COVID-19) pandemic, there has been a surge in the number of publicly available biomedical information sources, which makes it an increasingly challenging research goal to retrieve a relevant text to a topic of interest. In this paper, we propose a Contextual Query Expansion framework based on the clinical Domain knowledge (CQED) for formalizing an effective search over PubMed to retrieve relevant COVID-19 scholarly articles to a given information need. MATERIALS AND METHODS For the sake of training and evaluation, we use the widely adopted TREC-COVID benchmark. Given a query, the proposed framework utilizes a contextual and a domain-specific neural language model to generate a set of candidate query expansion terms that enrich the original query. Moreover, the framework includes a multi-head attention mechanism that is trained alongside a learning-to-rank model for re-ranking the list of generated expansion candidate terms. The original query and the top-ranked expansion terms are posed to the PubMed search engine for retrieving relevant scholarly articles to an information need. The framework, CQED, can have four different variations, depending upon the learning path adopted for training and re-ranking the candidate expansion terms. RESULTS The model drastically improves the search performance, when compared to the original query. The performance improvement in comparison to the original query, in terms of terms of RECALL@1000 is 190.85% and in terms of NDCG@1000 is 343.55%. Additionally, the model outperforms all existing state-of-the-art baselines. In terms of P@10, the model that has been optimized based on Precision outperforms all baselines (0.7987). On the other hand, in terms of NDCG@10 (0.7986), MAP (0.3450) and bpref (0.4900), the CQED model that has been optimized based on an average of all retrieval measures outperforms all the baselines. CONCLUSION The proposed model successfully expands queries posed to PubMed, and improves search performance, as compared to all existing baselines. A success/failure analysis shows that the model improved the search performance of each of the evaluated queries. Moreover, an ablation study depicted that if ranking of generated candidate terms is not conducted, the overall performance decreases. For future work, we would like to explore the application of the presented query expansion framework in conducting technology-assisted Systematic Literature Reviews (SLR).
Collapse
Affiliation(s)
- Ayesha Khader
- Department of Electrical, Computer, and Biomedical Engineering Toronto Metropolitan University, Toronto, Canada.
| | - Faezeh Ensan
- Department of Electrical, Computer, and Biomedical Engineering Toronto Metropolitan University, Toronto, Canada.
| |
Collapse
|
2
|
Sharifpour R, Wu M, Zhang X. Large-scale analysis of query logs to profile users for dataset search. JOURNAL OF DOCUMENTATION 2022. [DOI: 10.1108/jd-12-2021-0245] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022]
Abstract
PurposeWith an explosion of datasets available on the Web, dataset search has gained attention as an emerging research domain. Understanding users' dataset behaviour is imperative for providing effective data discovery services. In this paper, the authors present a study on users' dataset search behaviour through the analysis of search logs from a research data discovery portal.Design/methodology/approachUsing query and session based features, the authors apply cluster analysis to discover distinct user profiles with different search behaviours. One particular behavioural construct of our interest is users' expertise that the authors generate via computing semantic similarity between users' search queries and the title of metadata records in the displayed search results.FindingsThe findings revealed that there are six distinct classes of user behaviours for dataset search, namely; Expert Research, Expert Search, Expert Explore, Novice Research, Novice Search and Novice Explore.Research limitations/implicationsThe user profiles are derived based on analysis of the search log of the research data catalogue in this study. Further research is needed to generalise the user profiles to other dataset search settings. Future research can take on a confirmatory approach to verify these user groups and establish a deeper understanding of their information needs.Practical implicationsThe findings in this paper have implications for designing search systems that tailor search results matching the diverse information needs of different user groups.Originality/valueWe propose for the first time a taxonomy of users for dataset search based on their domain expertise and search behaviour.
Collapse
|
3
|
Chenaina T, Neji S, Shoeb A. Query Sense Discovery Approach to Realize the User's Search Intent. INTERNATIONAL JOURNAL OF INFORMATION RETRIEVAL RESEARCH 2022. [DOI: 10.4018/ijirr.289609] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
The main goal of information retrieval is getting the most relevant documents to a user’s query. So, a search engine must not only understand the meaning of each keyword in the query but also their relative senses in the context of the query. Discovering the query meaning is a comprehensive and evolutionary process; the precise meaning of the query is established as developing the association between concepts. The meaning determination process is modeled by a dynamic system operating in the semantic space of WordNet. To capture the meaning of a user query, the original query is reformulating into candidate queries by combining the concepts and their synonyms. A semantic score characterizing the overall meaning of such queries is calculated, the one with the highest score was used to perform the search. The results confirm that the proposed "Query Sense Discovery" approach provides a significant improvement in several performance measures.
Collapse
Affiliation(s)
- Tarek Chenaina
- College of Computer Science and Engineering, Taibah University, Yanbu, Saudi Arabia
| | - Sameh Neji
- Faculty of Economics and Management, Sfax University, Tunisia
| | - Abdullah Shoeb
- Faculty of Computers and Information, Fayoum University, Egypt
| |
Collapse
|
4
|
Yeganova L, Kim S, Chen Q, Balasanov G, Wilbur WJ, Lu Z. Better synonyms for enriching biomedical search. J Am Med Inform Assoc 2020; 27:1894-1902. [PMID: 33083825 PMCID: PMC7727334 DOI: 10.1093/jamia/ocaa151] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2020] [Revised: 05/20/2020] [Accepted: 08/20/2020] [Indexed: 01/12/2023] Open
Abstract
OBJECTIVE In a biomedical literature search, the link between a query and a document is often not established, because they use different terms to refer to the same concept. Distributional word embeddings are frequently used for detecting related words by computing the cosine similarity between them. However, previous research has not established either the best embedding methods for detecting synonyms among related word pairs or how effective such methods may be. MATERIALS AND METHODS In this study, we first create the BioSearchSyn set, a manually annotated set of synonyms, to assess and compare 3 widely used word-embedding methods (word2vec, fastText, and GloVe) in their ability to detect synonyms among related pairs of words. We demonstrate the shortcomings of the cosine similarity score between word embeddings for this task: the same scores have very different meanings for the different methods. To address the problem, we propose utilizing pool adjacent violators (PAV), an isotonic regression algorithm, to transform a cosine similarity into a probability of 2 words being synonyms. RESULTS Experimental results using the BioSearchSyn set as a gold standard reveal which embedding methods have the best performance in identifying synonym pairs. The BioSearchSyn set also allows converting cosine similarity scores into probabilities, which provides a uniform interpretation of the synonymy score over different methods. CONCLUSIONS We introduced the BioSearchSyn corpus of 1000 term pairs, which allowed us to identify the best embedding method for detecting synonymy for biomedical search. Using the proposed method, we created PubTermVariants2.0: a large, automatically extracted set of synonym pairs that have augmented PubMed searches since the spring of 2019.
Collapse
Affiliation(s)
- Lana Yeganova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Qingyu Chen
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Grigory Balasanov
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland, USA
| |
Collapse
|
5
|
Massonnaud CR, Kerdelhué G, Grosjean J, Lelong R, Griffon N, Darmoni SJ. Identification of the Best Semantic Expansion to Query PubMed Through Automatic Performance Assessment of Four Search Strategies on All Medical Subject Heading Descriptors: Comparative Study. JMIR Med Inform 2020; 8:e12799. [PMID: 32496201 PMCID: PMC7303830 DOI: 10.2196/12799] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/12/2018] [Revised: 01/20/2020] [Accepted: 03/23/2020] [Indexed: 12/04/2022] Open
Abstract
Background With the continuous expansion of available biomedical data, efficient and effective information retrieval has become of utmost importance. Semantic expansion of queries using synonyms may improve information retrieval. Objective The aim of this study was to automatically construct and evaluate expanded PubMed queries of the form “preferred term”[MH] OR “preferred term”[TIAB] OR “synonym 1”[TIAB] OR “synonym 2”[TIAB] OR …, for each of the 28,313 Medical Subject Heading (MeSH) descriptors, by using different semantic expansion strategies. We sought to propose an innovative method that could automatically evaluate these strategies, based on the three main metrics used in information science (precision, recall, and F-measure). Methods Three semantic expansion strategies were assessed. They differed by the synonyms used to build the queries as follows: MeSH synonyms, Unified Medical Language System (UMLS) mappings, and custom mappings (Catalogue et Index des Sites Médicaux de langue Française [CISMeF]). The precision, recall, and F-measure metrics were automatically computed for the three strategies and for the standard automatic term mapping (ATM) of PubMed. The method to automatically compute the metrics involved computing the number of all relevant citations (A), using National Library of Medicine indexing as the gold standard (“preferred term”[MH]), the number of citations retrieved by the added terms (”synonym 1“[TIAB] OR ”synonym 2“[TIAB] OR …) (B), and the number of relevant citations retrieved by the added terms (combining the previous two queries with an “AND” operator) (C). It was possible to programmatically compute the metrics for each strategy using each of the 28,313 MeSH descriptors as a “preferred term,” corresponding to 239,724 different queries built and sent to the PubMed application program interface. The four search strategies were ranked and compared for each metric. Results ATM had the worst performance for all three metrics among the four strategies. The MeSH strategy had the best mean precision (51%, SD 23%). The UMLS strategy had the best recall and F-measure (41%, SD 31% and 36%, SD 24%, respectively). CISMeF had the second best recall and F-measure (40%, SD 31% and 35%, SD 24%, respectively). However, considering a cutoff of 5%, CISMeF had better precision than UMLS for 1180 descriptors, better recall for 793 descriptors, and better F-measure for 678 descriptors. Conclusions This study highlights the importance of using semantic expansion strategies to improve information retrieval. However, the performances of a given strategy, relatively to another, varied greatly depending on the MeSH descriptor. These results confirm there is no ideal search strategy for all descriptors. Different semantic expansions should be used depending on the descriptor and the user’s objectives. Thus, we developed an interface that allows users to input a descriptor and then proposes the best semantic expansion to maximize the three main metrics (precision, recall, and F-measure).
Collapse
Affiliation(s)
- Clément R Massonnaud
- Department of Biomedical Informatics, Rouen University Hospital, Rouen, France
- Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, U1142, INSERM, Sorbonne Université, Paris, France
| | - Gaétan Kerdelhué
- Department of Biomedical Informatics, Rouen University Hospital, Rouen, France
- Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, U1142, INSERM, Sorbonne Université, Paris, France
| | - Julien Grosjean
- Department of Biomedical Informatics, Rouen University Hospital, Rouen, France
- Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, U1142, INSERM, Sorbonne Université, Paris, France
| | - Romain Lelong
- Department of Biomedical Informatics, Rouen University Hospital, Rouen, France
- Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, U1142, INSERM, Sorbonne Université, Paris, France
| | - Nicolas Griffon
- Department of Biomedical Informatics, Rouen University Hospital, Rouen, France
- Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, U1142, INSERM, Sorbonne Université, Paris, France
| | - Stefan J Darmoni
- Department of Biomedical Informatics, Rouen University Hospital, Rouen, France
- Laboratoire d'Informatique Médicale et d'Ingénierie des Connaissances en e-Santé, U1142, INSERM, Sorbonne Université, Paris, France
| |
Collapse
|
6
|
Leveraging synonymy and polysemy to improve semantic similarity assessments based on intrinsic information content. Artif Intell Rev 2020. [DOI: 10.1007/s10462-019-09725-4] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
|
7
|
Zhu L, Zheng H. Biomedical event extraction with a novel combination strategy based on hybrid deep neural networks. BMC Bioinformatics 2020; 21:47. [PMID: 32028883 PMCID: PMC7006190 DOI: 10.1186/s12859-020-3376-2] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2019] [Accepted: 01/20/2020] [Indexed: 11/10/2022] Open
Abstract
Background Biomedical event extraction is a fundamental and in-demand technology that has attracted substantial interest from many researchers. Previous works have heavily relied on manual designed features and external NLP packages in which the feature engineering is large and complex. Additionally, most of the existing works use the pipeline process that breaks down a task into simple sub-tasks but ignores the interaction between them. To overcome these limitations, we propose a novel event combination strategy based on hybrid deep neural networks to settle the task in a joint end-to-end manner. Results We adapted our method to several annotated corpora of biomedical event extraction tasks. Our method achieved state-of-the-art performance with noticeable overall F1 score improvement compared to that of existing methods for all of these corpora. Conclusions The experimental results demonstrated that our method is effective for biomedical event extraction. The combination strategy can reconstruct complex events from the output of deep neural networks, while the deep neural networks effectively capture the feature representation from the raw text. The biomedical event extraction implementation is available online at http://www.predictor.xin/event_extraction.
Collapse
Affiliation(s)
- Lvxing Zhu
- School of Computer Science and Technology, University of Science and Technology of China, Huangshan Road, Hefei, 230026, People's Republic of China
| | - Haoran Zheng
- School of Computer Science and Technology, University of Science and Technology of China, Huangshan Road, Hefei, 230026, People's Republic of China. .,Anhui Key Laboratory of Software Engineering in Computing and Communication, University of Science and Technology of China, Huangshan Road, Hefei, 230026, People's Republic of China. .,Anhui Province Key Lab. of Big Data Analysis and Application, University of Science and Technology of China, Huangshan Road, Hefei, 230026, People's Republic of China.
| |
Collapse
|
8
|
Lashkari F, Bagheri E, Ghorbani AA. Neural embedding-based indices for semantic search. Inf Process Manag 2019. [DOI: 10.1016/j.ipm.2018.10.015] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/27/2022]
|
9
|
Mura C, Draizen EJ, Bourne PE. Structural biology meets data science: does anything change? Curr Opin Struct Biol 2018; 52:95-102. [PMID: 30267935 DOI: 10.1016/j.sbi.2018.09.003] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2018] [Revised: 08/31/2018] [Accepted: 09/07/2018] [Indexed: 01/22/2023]
Abstract
Data science has emerged from the proliferation of digital data, coupled with advances in algorithms, software and hardware (e.g., GPU computing). Innovations in structural biology have been driven by similar factors, spurring us to ask: can these two fields impact one another in deep and hitherto unforeseen ways? We posit that the answer is yes. New biological knowledge lies in the relationships between sequence, structure, function and disease, all of which play out on the stage of evolution, and data science enables us to elucidate these relationships at scale. Here, we consider the above question from the five key pillars of data science: acquisition, engineering, analytics, visualization and policy, with an emphasis on machine learning as the premier analytics approach.
Collapse
Affiliation(s)
- Cameron Mura
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Eli J Draizen
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA
| | - Philip E Bourne
- Department of Biomedical Engineering, University of Virginia, Charlottesville, VA 22908, USA; Data Science Institute, University of Virginia, Charlottesville, VA 22904, USA.
| |
Collapse
|
10
|
Abstract
PubMed is a free search engine for biomedical literature accessed by millions of users from around the world each day. With the rapid growth of biomedical literature—about two articles are added every minute on average—finding and retrieving the most relevant papers for a given query is increasingly challenging. We present Best Match, a new relevance search algorithm for PubMed that leverages the intelligence of our users and cutting-edge machine-learning technology as an alternative to the traditional date sort order. The Best Match algorithm is trained with past user searches with dozens of relevance-ranking signals (factors), the most important being the past usage of an article, publication date, relevance score, and type of article. This new algorithm demonstrates state-of-the-art retrieval performance in benchmarking experiments as well as an improved user experience in real-world testing (over 20% increase in user click-through rate). Since its deployment in June 2017, we have observed a significant increase (60%) in PubMed searches with relevance sort order: it now assists millions of PubMed searches each week. In this work, we hope to increase the awareness and transparency of this new relevance sort option for PubMed users, enabling them to retrieve information more effectively.
Collapse
|
11
|
Kim S, Yeganova L, Comeau DC, Wilbur WJ, Lu Z. PubMed Phrases, an open set of coherent phrases for searching biomedical literature. Sci Data 2018; 5:180104. [PMID: 29893755 PMCID: PMC5996850 DOI: 10.1038/sdata.2018.104] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/22/2017] [Accepted: 04/06/2018] [Indexed: 11/09/2022] Open
Abstract
In biomedicine, key concepts are often expressed by multiple words (e.g., ‘zinc finger protein’). Previous work has shown treating a sequence of words as a meaningful unit, where applicable, is not only important for human understanding but also beneficial for automatic information seeking. Here we present a collection of PubMed® Phrases that are beneficial for information retrieval and human comprehension. We define these phrases as coherent chunks that are logically connected. To collect the phrase set, we apply the hypergeometric test to detect segments of consecutive terms that are likely to appear together in PubMed. These text segments are then filtered using the BM25 ranking function to ensure that they are beneficial from an information retrieval perspective. Thus, we obtain a set of 705,915 PubMed Phrases. We evaluate the quality of the set by investigating PubMed user click data and manually annotating a sample of 500 randomly selected noun phrases. We also analyze and discuss the usage of these PubMed Phrases in literature search.
Collapse
Affiliation(s)
- Sun Kim
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Lana Yeganova
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Donald C Comeau
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - W John Wilbur
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, Maryland 20894, USA
| |
Collapse
|