1
|
Yeganova L, Kim W, Tian S, Comeau DC, Wilbur WJ, Lu Z. LitSense 2.0: AI-powered biomedical information retrieval with sentence and passage level knowledge discovery. Nucleic Acids Res 2025:gkaf417. [PMID: 40377097 DOI: 10.1093/nar/gkaf417] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2025] [Revised: 04/24/2025] [Accepted: 05/02/2025] [Indexed: 05/18/2025] Open
Abstract
LitSense 2.0 (https://www.ncbi.nlm.nih.gov/research/litsense2/) is an advanced biomedical search system enhanced with dense vector semantic retrieval, designed for accessing literature on sentence and paragraph levels. It provides unified access to 38 million PubMed abstracts and 6.6 million full-length articles in the PubMed Central (PMC) Open Access subset, encompassing 1.4 billion sentences and ∼300 million paragraphs, and is updated weekly. Compared to PubMed and PMC, the primary platforms for biomedical information search, LitSense offers cross-platform functionality by searching seamlessly across both PubMed and PMC and returning relevant results at a more granular level. Building on the success of the original LitSense launched in 2018, LitSense 2.0 introduces two major enhancements. The first is the addition of paragraph-level search: users can now choose to search either against sentences or against paragraphs. The second is improved retrieval accuracy via a state-of-the-art biomedical text encoder, ensuring more reliable identification of relevant results across the entire biomedical literature.
Collapse
Affiliation(s)
- Lana Yeganova
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| | - Won Kim
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| | - Shubo Tian
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| | - Donald C Comeau
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| | - W John Wilbur
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| | - Zhiyong Lu
- Division of Intramural Research (DIR), National Library of Medicine (NLM), National Institutes of Health (NIH), MD 20894 Bethesda, United States
| |
Collapse
|
2
|
Zhang L, Lu W, Chen H, Huang Y, Cheng Q. A comparative evaluation of biomedical similar article recommendation. J Biomed Inform 2022; 131:104106. [PMID: 35661818 DOI: 10.1016/j.jbi.2022.104106] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 05/27/2022] [Accepted: 05/28/2022] [Indexed: 11/28/2022]
Abstract
BACKGROUND Biomedical sciences, with their focus on human health and disease, have attracted unprecedented attention in the 21st century. The proliferation of biomedical sciences has also led to a large number of scientific articles being produced, which makes it difficult for biomedical researchers to find relevant articles and hinders the dissemination of valuable discoveries. To bridge this gap, the research community has initiated the article recommendation task, with the aim of recommending articles to biomedical researchers automatically based on their research interests. Over the past two decades, many recommendation methods have been developed. However, an algorithm-level comparison and rigorous evaluation of the most important methods on a shared dataset is still lacking. METHOD In this study, we first investigate 15 methods for automated article recommendation in the biomedical domain. We then conduct an empirical evaluation of the 15 methods, including six term-based methods, two word embedding methods, three sentence embedding methods, two document embedding methods, and two BERT-based methods. These methods are evaluated in two scenarios: article-oriented recommenders and user-oriented recommenders, with two publicly available datasets: TREC 2005 Genomics and RELISH, respectively. RESULTS Our experimental results show that the text representation models BERT and BioSenVec outperform many existing recommendation methods (e.g., BM25, PMRA, XPRC) and web-based recommendation systems (e.g., MScanner, MedlineRanker, BioReader) on both datasets regarding most of the evaluation metrics, and fine-tuning can improve the performance of the BERT-based methods. CONCLUSIONS Our comparison study is useful for researchers and practitioners in selecting the best modeling strategies for building article recommendation systems in the biomedical domain. The code and datasets are publicly available.
Collapse
Affiliation(s)
- Li Zhang
- School of Information Management, Wuhan University, Wuhan 430074, Hubei Province, China.
| | - Wei Lu
- School of Information Management, Wuhan University, Wuhan 430074, Hubei Province, China.
| | - Haihua Chen
- Department of Information Science, University of North Texas, Denton 76203, TX, USA.
| | - Yong Huang
- School of Information Management, Wuhan University, Wuhan 430074, Hubei Province, China.
| | - Qikai Cheng
- School of Information Management, Wuhan University, Wuhan 430074, Hubei Province, China.
| |
Collapse
|
3
|
Azeem F, Zameer R, Rehman Rashid MA, Rasul I, Ul-Allah S, Siddique MH, Fiaz S, Raza A, Younas A, Rasool A, Ali MA, Anwar S, Siddiqui MH. Genome-wide analysis of potassium transport genes in Gossypium raimondii suggest a role of GrHAK/KUP/KT8, GrAKT2.1 and GrAKT1.1 in response to abiotic stress. PLANT PHYSIOLOGY AND BIOCHEMISTRY : PPB 2022; 170:110-122. [PMID: 34864561 DOI: 10.1016/j.plaphy.2021.11.038] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/15/2021] [Revised: 11/22/2021] [Accepted: 11/23/2021] [Indexed: 06/13/2023]
Abstract
Potassium (K+) is an important macro-nutrient for plants, which comprises almost 10% of plant's dry mass. It plays a crucial role in the growth of plants as well as other important processes related to metabolism and stress tolerance. Plants have a complex and well-organized potassium distribution system (channels and transporters). Cotton is the most important economic crop, which is the primary source of natural fiber. Soil deficiency in K+ can negatively affect yield and fiber quality of cotton. However, potassium transport system in cotton is poorly studied. Current study identified 43 Potassium Transport System (PTS) genes in Gossypium raimondii genome. Based on conserved domains, transmembrane domains, and motif structures, these genes were classified as K+ transporters (2 HKTs, 7 KEAs, and 16 KUP/HAK/KTs) and K+ channels (11 Shakers and 7 TPKs/KCO). The phylogenetic comparison of GrPTS genes from Arabidopsis thaliana, Glycine max, Oryza sativa, Medicago truncatula and Cicer arietinum revealed variations in PTS gene conservation. Evolutionary analysis predicted that most GrPTS genes were segmentally duplicated. Gene structure analysis showed that the intron/exon organization of these genes was conserved in specific-family. Chromosomal localization demonstrated a random distribution of PTS genes across all the thirteen chromosomes except chromosome six. Many stress responsive cis-regulatory elements were predicted in promoter regions of GrPTS genes. The RNA-seq data analysis followed by qRT-PCR validation demonstrated that PTS genes potentially work in groups against environmental factors. Moreover, a transporter gene (GrHAK/KUP/KT8) and two channel genes (GrAKT2.1 and GrAKT1.1) are important candidate genes for plant stress response. These results provide useful information for further functional characterization of PTS genes with the breeding aim of stress-resistant cultivars.
Collapse
Affiliation(s)
- Farrukh Azeem
- Department of Bioinformatics and Biotechnology, Govt. College University, Faisalabad, Pakistan
| | - Roshan Zameer
- Department of Bioinformatics and Biotechnology, Govt. College University, Faisalabad, Pakistan
| | | | - Ijaz Rasul
- Department of Bioinformatics and Biotechnology, Govt. College University, Faisalabad, Pakistan
| | - Sami Ul-Allah
- College of Agriculture, Bahauddin Zakariya University, Bahadur Sub-Campus, Layyah, Pakistan
| | | | - Sajid Fiaz
- Department of Plant Breeding and Genetics, The University of Haripur, 22620, Haripir, Pakistan.
| | - Ali Raza
- Fujian Provincial Key Laboratory of Crop Molecular and Cell Biology, Oil Crops Research Institute, Center of Legume Crop Genetics and Systems Biology/College of Agriculture, Fujian Agriculture and Forestry University (FAFU), Fuzhou, Fujian, 350002, China
| | - Afifa Younas
- Department of Botany, Lahore College for Women University, Lahore, Pakistan
| | - Asima Rasool
- Department of Bioinformatics and Biotechnology, Govt. College University, Faisalabad, Pakistan
| | - Muhammad Amjad Ali
- Department of Plant Pathology, University of Agriculture, Faisalabad, Pakistan
| | - Sultana Anwar
- Department of Agronomy, University of Florida, Gainesville, USA
| | - Manzer H Siddiqui
- Department of Botany and Microbiology, College of Science, King Saud University, Riyadh, Saudi Arabia
| |
Collapse
|
4
|
Islamaj R, Wei CH, Cissel D, Miliaras N, Printseva O, Rodionov O, Sekiya K, Ward J, Lu Z. NLM-Gene, a richly annotated gold standard dataset for gene entities that addresses ambiguity and multi-species gene recognition. J Biomed Inform 2021; 118:103779. [PMID: 33839304 PMCID: PMC11037554 DOI: 10.1016/j.jbi.2021.103779] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/04/2021] [Revised: 03/14/2021] [Accepted: 04/05/2021] [Indexed: 10/21/2022]
Abstract
The automatic recognition of gene names and their corresponding database identifiers in biomedical text is an important first step for many downstream text-mining applications. While current methods for tagging gene entities have been developed for biomedical literature, their performance on species other than human is substantially lower due to the lack of annotation data. We therefore present the NLM-Gene corpus, a high-quality manually annotated corpus for genes developed at the US National Library of Medicine (NLM), covering ambiguous gene names, with an average of 29 gene mentions (10 unique identifiers) per document, and a broader representation of different species (including Homo sapiens, Mus musculus, Rattus norvegicus, Drosophila melanogaster, Arabidopsis thaliana, Danio rerio, etc.) when compared to previous gene annotation corpora. NLM-Gene consists of 550 PubMed abstracts from 156 biomedical journals, doubly annotated by six experienced NLM indexers, randomly paired for each document to control for bias. The annotators worked in three annotation rounds until they reached complete agreement. This gold-standard corpus can serve as a benchmark to develop & test new gene text mining algorithms. Using this new resource, we have developed a new gene finding algorithm based on deep learning which improved both on precision and recall from existing tools. The NLM-Gene annotated corpus is freely available at ftp://ftp.ncbi.nlm.nih.gov/pub/lu/NLMGene. We have also applied this tool to the entire PubMed/PMC with their results freely accessible through our web-based tool PubTator (www.ncbi.nlm.nih.gov/research/pubtator).
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Chih-Hsuan Wei
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - David Cissel
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Nicholas Miliaras
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Olga Printseva
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Oleg Rodionov
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Keiko Sekiya
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Janice Ward
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| |
Collapse
|
5
|
Islamaj R, Leaman R, Kim S, Kwon D, Wei CH, Comeau DC, Peng Y, Cissel D, Coss C, Fisher C, Guzman R, Kochar PG, Koppel S, Trinh D, Sekiya K, Ward J, Whitman D, Schmidt S, Lu Z. NLM-Chem, a new resource for chemical entity recognition in PubMed full text literature. Sci Data 2021; 8:91. [PMID: 33767203 PMCID: PMC7994842 DOI: 10.1038/s41597-021-00875-1] [Citation(s) in RCA: 32] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2020] [Accepted: 01/19/2021] [Indexed: 11/13/2022] Open
Abstract
Automatically identifying chemical and drug names in scientific publications advances information access for this important class of entities in a variety of biomedical disciplines by enabling improved retrieval and linkage to related concepts. While current methods for tagging chemical entities were developed for the article title and abstract, their performance in the full article text is substantially lower. However, the full text frequently contains more detailed chemical information, such as the properties of chemical compounds, their biological effects and interactions with diseases, genes and other chemicals. We therefore present the NLM-Chem corpus, a full-text resource to support the development and evaluation of automated chemical entity taggers. The NLM-Chem corpus consists of 150 full-text articles, doubly annotated by ten expert NLM indexers, with ~5000 unique chemical name annotations, mapped to ~2000 MeSH identifiers. We also describe a substantially improved chemical entity tagger, with automated annotations for all of PubMed and PMC freely accessible through the PubTator web-based interface and API. The NLM-Chem corpus is freely available.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Robert Leaman
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Sun Kim
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Dongseop Kwon
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Chih-Hsuan Wei
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Donald C Comeau
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Yifan Peng
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - David Cissel
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Cathleen Coss
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Carol Fisher
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Rob Guzman
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Preeti Gokal Kochar
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Stella Koppel
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Dorothy Trinh
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Keiko Sekiya
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Janice Ward
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Deborah Whitman
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Susan Schmidt
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health, Bethesda, MD, 20894, USA.
| |
Collapse
|