Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Li J, Sun Y, Johnson RJ, Sciaky D, Wei CH, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu Z. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxford) 2016;2016:baw068. [PMID: 27161011 PMCID: PMC4860626 DOI: 10.1093/database/baw068] [Citation(s) in RCA: 161] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/25/2015] [Accepted: 04/11/2016] [Indexed: 11/14/2022]

For:	Li J, Sun Y, Johnson RJ, Sciaky D, Wei CH, Leaman R, Davis AP, Mattingly CJ, Wiegers TC, Lu Z. BioCreative V CDR task corpus: a resource for chemical disease relation extraction. Database (Oxford) 2016;2016:baw068. [PMID: 27161011 PMCID: PMC4860626 DOI: 10.1093/database/baw068] [Citation(s) in RCA: 161] [Impact Index Per Article: 17.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/25/2015] [Accepted: 04/11/2016] [Indexed: 11/14/2022]

Number

Cited by Other Article(s)

Hong G, Hindle V, Veasley NM, Holscher HD, Kilicoglu H. DiMB-RE: mining the scientific literature for diet-microbiome associations. J Am Med Inform Assoc 2025;32:998-1006. [PMID: 40152137 PMCID: PMC12089768 DOI: 10.1093/jamia/ocaf054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Revised: 03/03/2025] [Accepted: 03/11/2025] [Indexed: 03/29/2025] Open

Abstract

OBJECTIVES

To develop a corpus annotated for diet-microbiome associations from the biomedical literature and train natural language processing (NLP) models to identify these associations, thereby improving the understanding of their role in health and disease, and supporting personalized nutrition strategies.

MATERIALS AND METHODS

We constructed DiMB-RE, a comprehensive corpus annotated with 15 entity types (eg, Nutrient, Microorganism) and 13 relation types (eg, increases, improves) capturing diet-microbiome associations. We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction as well as factuality detection using DiMB-RE. In addition, we benchmarked 2 generative large language models (GPT-4o-mini and GPT-4o) on a subset of the dataset in zero- and one-shot settings.

RESULTS

DiMB-RE consists of 14 450 entities and 4206 relationships from 165 publications (including 30 full-text Results sections). Fine-tuned NLP models performed reasonably well for named entity recognition (0.800 F1 score), while end-to-end relation extraction performance was modest (0.445 F1). The use of Results section annotations improved relation extraction. The impact of trigger detection was mixed. Generative models showed lower accuracy compared to fine-tuned models.

DISCUSSION

To our knowledge, DiMB-RE is the largest and most diverse corpus focusing on diet-microbiome interactions. Natural language processing models fine-tuned on DiMB-RE exhibit lower performance compared to similar corpora, highlighting the complexity of information extraction in this domain. Misclassified entities, missed triggers, and cross-sentence relations are the major sources of relation extraction errors.

CONCLUSION

DiMB-RE can serve as a benchmark corpus for biomedical literature mining. DiMB-RE and the NLP models are available at https://github.com/ScienceNLP-Lab/DiMB-RE.

Collapse

He T, Kreimeyer K, Najjar M, Spiker J, Fatteh M, Anagnostou V, Botsis T. Artificial Intelligence-assisted Biomedical Literature Knowledge Synthesis to Support Decision-making in Precision Oncology. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025;2024:513-522. [PMID: 40417512 PMCID: PMC12099343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]

Li W, Wang H, Li W, Zhao J, Sun Y. Generation-Based Few-Shot BioNER via Local Knowledge Index and Dual Prompts. Interdiscip Sci 2025:10.1007/s12539-025-00709-3. [PMID: 40347393 DOI: 10.1007/s12539-025-00709-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 04/01/2025] [Accepted: 04/05/2025] [Indexed: 05/12/2025]

Chen Q, Hu Y, Peng X, Xie Q, Jin Q, Gilson A, Singer MB, Ai X, Lai PT, Wang Z, Keloth VK, Raja K, Huang J, He H, Lin F, Du J, Zhang R, Zheng WJ, Adelman RA, Lu Z, Xu H. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat Commun 2025;16:3280. [PMID: 40188094 PMCID: PMC11972378 DOI: 10.1038/s41467-025-56989-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 02/07/2025] [Indexed: 04/07/2025] Open

Affiliation(s)

Qingyu Chen Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Yan Hu McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Xueqing Peng Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Qianqian Xie Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Qiao Jin National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Aidan Gilson Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Maxwell B Singer Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Xuguang Ai Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Po-Ting Lai National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Zhizheng Wang National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
Vipina K Keloth Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Kalpana Raja Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Jimin Huang Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Huan He Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Fongci Lin Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Jingcheng Du McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Rui Zhang Division of Computational Health Sciences, Department of Surgery, Medical School, University of Minnesota, Minneapolis, MN, USA Center for Learning Health System Sciences, University of Minnesota, Minneapolis, MN, 55455, USA
W Jim Zheng McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
Ron A Adelman Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
Zhiyong Lu National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
Hua Xu Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.

Collapse

Zhao D, Mu W, Jia X, Liu S, Chu Y, Meng J, Lin H. Few-shot biomedical NER empowered by LLMs-assisted data augmentation and multi-scale feature extraction. BioData Min 2025;18:28. [PMID: 40181396 PMCID: PMC11969866 DOI: 10.1186/s13040-025-00443-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2024] [Accepted: 03/26/2025] [Indexed: 04/05/2025] Open

Zhang L, Zhong Y, Zheng Q, Liu J, Wang Q, Wang J, Chang X. TDGI: Translation-Guided Double-Graph Inference for Document-Level Relation Extraction. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025;47:2647-2659. [PMID: 40030986 DOI: 10.1109/tpami.2025.3528246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]

Ong JCL, Chen MH, Ng N, Elangovan K, Tan NYT, Jin L, Xie Q, Ting DSW, Rodriguez-Monguio R, Bates DW, Liu N. A scoping review on generative AI and large language models in mitigating medication related harm. NPJ Digit Med 2025;8:182. [PMID: 40155703 PMCID: PMC11953325 DOI: 10.1038/s41746-025-01565-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2024] [Accepted: 03/06/2025] [Indexed: 04/01/2025] Open

Wang L, Hao H, Yan X, Zhou TH, Ryu KH. From biomedical knowledge graph construction to semantic querying: a comprehensive approach. Sci Rep 2025;15:8523. [PMID: 40074859 PMCID: PMC11904217 DOI: 10.1038/s41598-025-93334-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 03/06/2025] [Indexed: 03/14/2025] Open

Ramos MC, Collison CJ, White AD. A review of large language models and autonomous agents in chemistry. Chem Sci 2025;16:2514-2572. [PMID: 39829984 PMCID: PMC11739813 DOI: 10.1039/d4sc03921a] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Accepted: 12/03/2024] [Indexed: 01/22/2025] Open

Ding L, Colavizza G, Zhang Z. Partial Annotation Learning for Biomedical Entity Recognition. IEEE J Biomed Health Inform 2025;29:1409-1418. [PMID: 39312441 DOI: 10.1109/jbhi.2024.3466294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]

Borchert F, Llorca I, Roller R, Arnrich B, Schapranow MP. xMEN: a modular toolkit for cross-lingual medical entity normalization. JAMIA Open 2025;8:ooae147. [PMID: 39735785 PMCID: PMC11671143 DOI: 10.1093/jamiaopen/ooae147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2024] [Revised: 12/03/2024] [Accepted: 12/12/2024] [Indexed: 12/31/2024] Open

Bhushan RC, Donthi RK, Chilukuri Y, Srinivasarao U, Swetha P. Biomedical named entity recognition using improved green anaconda-assisted Bi-GRU-based hierarchical ResNet model. BMC Bioinformatics 2025;26:34. [PMID: 39885428 PMCID: PMC11780922 DOI: 10.1186/s12859-024-06008-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Accepted: 12/05/2024] [Indexed: 02/01/2025] Open

Azam M, Chen Y, Arowolo MO, Liu H, Popescu M, Xu D. A comprehensive evaluation of large language models in mining gene relations and pathway knowledge. QUANTITATIVE BIOLOGY 2024;12:360-374. [PMID: 39364206 PMCID: PMC11446478 DOI: 10.1002/qub2.57] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Accepted: 04/15/2024] [Indexed: 10/05/2024]

Yang Y, Lu Y, Zheng Z, Wu H, Lin Y, Qian F, Yan W. MKG-GC: A multi-task learning-based knowledge graph construction framework with personalized application to gastric cancer. Comput Struct Biotechnol J 2024;23:1339-1347. [PMID: 38585647 PMCID: PMC10995799 DOI: 10.1016/j.csbj.2024.03.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/24/2024] [Accepted: 03/24/2024] [Indexed: 04/09/2024] Open

Abstract

Over the past decade, information for precision disease medicine has accumulated in the form of textual data. To effectively utilize this expanding medical text, we proposed a multi-task learning-based framework based on hard parameter sharing for knowledge graph construction (MKG), and then used it to automatically extract gastric cancer (GC)-related biomedical knowledge from the literature and identify GC drug candidates. In MKG, we designed three separate modules, MT-BGIPN, MT-SGTF and MT-ScBERT, for entity recognition, entity normalization, and relation classification, respectively. To address the challenges posed by the long and irregular naming of medical entities, the MT-BGIPN utilized bidirectional gated recurrent unit and interactive pointer network techniques, significantly improving entity recognition accuracy to an average F1 value of 84.5% across datasets. In MT-SGTF, we employed the term frequency-inverse document frequency and the gated attention unit. These combine both semantic and characteristic features of entities, resulting in an average Hits@ 1 score of 94.5% across five datasets. The MT-ScBERT integrated cross-text, entity, and context features, yielding an average F1 value of 86.9% across 11 relation classification datasets. Based on the MKG, we then developed a specific knowledge graph for GC (MKG-GC), which encompasses a total of 9129 entities and 88,482 triplets. Lastly, the MKG-GC was used to predict potential GC drugs using a pre-trained language model called BioKGE-BERT and a drug-disease discriminant model based on CNN-BiLSTM. Remarkably, nine out of the top ten predicted drugs have been previously reported as effective for gastric cancer treatment. Finally, an online platform was created for exploration and visualization of MKG-GC at https://www.yanglab-mi.org.cn/MKG-GC/.

Collapse

Yang Y, Zheng Z, Xu Y, Wei H, Yan W. BioGSF: a graph-driven semantic feature integration framework for biomedical relation extraction. Brief Bioinform 2024;26:bbaf025. [PMID: 39853110 PMCID: PMC11759886 DOI: 10.1093/bib/bbaf025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 12/24/2024] [Accepted: 01/09/2025] [Indexed: 01/26/2025] Open

Ding X, Duan S, Zhang Z. Semantic-guided attention and adaptive gating for document-level relation extraction. Sci Rep 2024;14:26628. [PMID: 39496763 PMCID: PMC11535381 DOI: 10.1038/s41598-024-78051-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Accepted: 10/28/2024] [Indexed: 11/06/2024] Open

Mundotiya RK, Priya J, Kuwarbi D, Singh T. Enhancing Generalizability in Biomedical Entity Recognition: Self-Attention PCA-CLS Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024;21:1934-1941. [PMID: 39012749 DOI: 10.1109/tcbb.2024.3429234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/18/2024]

Yin Y, Kim H, Xiao X, Wei CH, Kang J, Lu Z, Xu H, Fang M, Chen Q. Augmenting biomedical named entity recognition with general-domain resources. J Biomed Inform 2024;159:104731. [PMID: 39368529 DOI: 10.1016/j.jbi.2024.104731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/05/2024] [Accepted: 09/27/2024] [Indexed: 10/07/2024]

Abstract

OBJECTIVE

Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets.

METHODS

We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset.

RESULTS

We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset.

CONCLUSION

This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.

Collapse

Liu S, Wang A, Xiu X, Zhong M, Wu S. Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study. JMIR Med Inform 2024;12:e59782. [PMID: 39419501 PMCID: PMC11528166 DOI: 10.2196/59782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 08/09/2024] [Accepted: 09/15/2024] [Indexed: 10/19/2024] Open

Abstract

BACKGROUND

Named entity recognition (NER) models are essential for extracting structured information from unstructured medical texts by identifying entities such as diseases, treatments, and conditions, enhancing clinical decision-making and research. Innovations in machine learning, particularly those involving Bidirectional Encoder Representations From Transformers (BERT)-based deep learning and large language models, have significantly advanced NER capabilities. However, their performance varies across medical datasets due to the complexity and diversity of medical terminology. Previous studies have often focused on overall performance, neglecting specific challenges in medical contexts and the impact of macrofactors like lexical composition on prediction accuracy. These gaps hinder the development of optimized NER models for medical applications.

OBJECTIVE

This study aims to meticulously evaluate the performance of various NER models in the context of medical text analysis, focusing on how complex medical terminology affects entity recognition accuracy. Additionally, we explored the influence of macrofactors on model performance, seeking to provide insights for refining NER models and enhancing their reliability for medical applications.

METHODS

This study comprehensively evaluated 7 NER models-hidden Markov models, conditional random fields, BERT for Biomedical Text Mining, Big Transformer Models for Efficient Long-Sequence Attention, Decoding-enhanced BERT with Disentangled Attention, Robustly Optimized BERT Pretraining Approach, and Gemma-across 3 medical datasets: Revised Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), BioCreative V CDR, and Anatomical Entity Mention (AnatEM). The evaluation focused on prediction accuracy, resource use (eg, central processing unit and graphics processing unit use), and the impact of fine-tuning hyperparameters. The macrofactors affecting model performance were also screened using the multilevel factor elimination algorithm.

RESULTS

The fine-tuned BERT for Biomedical Text Mining, with balanced resource use, generally achieved the highest prediction accuracy across the Revised JNLPBA and AnatEM datasets, with microaverage (AVG_MICRO) scores of 0.932 and 0.8494, respectively, highlighting its superior proficiency in identifying medical entities. Gemma, fine-tuned using the low-rank adaptation technique, achieved the highest accuracy on the BioCreative V CDR dataset with an AVG_MICRO score of 0.9962 but exhibited variability across the other datasets (AVG_MICRO scores of 0.9088 on the Revised JNLPBA and 0.8029 on AnatEM), indicating a need for further optimization. In addition, our analysis revealed that 2 macrofactors, entity phrase length and the number of entity words in each entity phrase, significantly influenced model performance.

CONCLUSIONS

This study highlights the essential role of NER models in medical informatics, emphasizing the imperative for model optimization via precise data targeting and fine-tuning. The insights from this study will notably improve clinical decision-making and facilitate the creation of more sophisticated and effective medical NER models.

Collapse

Phan CP, Phan B, Chiang JH. Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini. Database (Oxford) 2024;2024:baae104. [PMID: 39383312 PMCID: PMC11463225 DOI: 10.1093/database/baae104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/21/2024] [Accepted: 09/04/2024] [Indexed: 10/11/2024]

Dafrallah S, Akhloufi MA. Hospital Re-Admission Prediction Using Named Entity Recognition and Explainable Machine Learning. Diagnostics (Basel) 2024;14:2151. [PMID: 39410555 PMCID: PMC11475863 DOI: 10.3390/diagnostics14192151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Revised: 09/15/2024] [Accepted: 09/25/2024] [Indexed: 10/20/2024] Open

Lai PT, Coudert E, Aimo L, Axelsen K, Breuza L, de Castro E, Feuermann M, Morgat A, Pourcel L, Pedruzzi I, Poux S, Redaschi N, Rivoire C, Sveshnikova A, Wei CH, Leaman R, Luo L, Lu Z, Bridge A. EnzChemRED, a rich enzyme chemistry relation extraction dataset. Sci Data 2024;11:982. [PMID: 39251610 PMCID: PMC11384730 DOI: 10.1038/s41597-024-03835-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 08/23/2024] [Indexed: 09/11/2024] Open

Affiliation(s)

Po-Ting Lai National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
Elisabeth Coudert Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Lucila Aimo Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Kristian Axelsen Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Lionel Breuza Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Edouard de Castro Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Marc Feuermann Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Anne Morgat Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Lucille Pourcel Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Ivo Pedruzzi Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Sylvain Poux Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Nicole Redaschi Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Catherine Rivoire Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Anastasia Sveshnikova Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
Chih-Hsuan Wei National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
Robert Leaman National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
Ling Luo School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
Zhiyong Lu National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.
Alan Bridge Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.

Collapse

Lu Z, Peng Y, Cohen T, Ghassemi M, Weng C, Tian S. Large language models in biomedicine and health: current research landscape and future directions. J Am Med Inform Assoc 2024;31:1801-1811. [PMID: 39169867 PMCID: PMC11339542 DOI: 10.1093/jamia/ocae202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Indexed: 08/23/2024] Open

Li M, Zhou H, Yang H, Zhang R. RT: a Retrieving and Chain-of-Thought framework for few-shot medical named entity recognition. J Am Med Inform Assoc 2024;31:1929-1938. [PMID: 38708849 PMCID: PMC11339512 DOI: 10.1093/jamia/ocae095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 04/10/2024] [Accepted: 04/15/2024] [Indexed: 05/07/2024] Open

Abstract

OBJECTIVES

This article aims to enhance the performance of larger language models (LLMs) on the few-shot biomedical named entity recognition (NER) task by developing a simple and effective method called Retrieving and Chain-of-Thought (RT) framework and to evaluate the improvement after applying RT framework.

MATERIALS AND METHODS

Given the remarkable advancements in retrieval-based language model and Chain-of-Thought across various natural language processing tasks, we propose a pioneering RT framework designed to amalgamate both approaches. The RT approach encompasses dedicated modules for information retrieval and Chain-of-Thought processes. In the retrieval module, RT discerns pertinent examples from demonstrations during instructional tuning for each input sentence. Subsequently, the Chain-of-Thought module employs a systematic reasoning process to identify entities. We conducted a comprehensive comparative analysis of our RT framework against 16 other models for few-shot NER tasks on BC5CDR and NCBI corpora. Additionally, we explored the impacts of negative samples, output formats, and missing data on performance.

RESULTS

Our proposed RT framework outperforms other LMs for few-shot NER tasks with micro-F1 scores of 93.50 and 91.76 on BC5CDR and NCBI corpora, respectively. We found that using both positive and negative samples, Chain-of-Thought (vs Tree-of-Thought) performed better. Additionally, utilization of a partially annotated dataset has a marginal effect of the model performance.

DISCUSSION

This is the first investigation to combine a retrieval-based LLM and Chain-of-Thought methodology to enhance the performance in biomedical few-shot NER. The retrieval-based LLM aids in retrieving the most relevant examples of the input sentence, offering crucial knowledge to predict the entity in the sentence. We also conducted a meticulous examination of our methodology, incorporating an ablation study.

CONCLUSION

The RT framework with LLM has demonstrated state-of-the-art performance on few-shot NER tasks.

Collapse

Luo L, Ning J, Zhao Y, Wang Z, Ding Z, Chen P, Fu W, Han Q, Xu G, Qiu Y, Pan D, Li J, Li H, Feng W, Tu S, Liu Y, Yang Z, Wang J, Sun Y, Lin H. Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks. J Am Med Inform Assoc 2024;31:1865-1874. [PMID: 38422367 PMCID: PMC11339499 DOI: 10.1093/jamia/ocae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/08/2024] [Accepted: 02/16/2024] [Indexed: 03/02/2024] Open

Affiliation(s)

Ling Luo School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jinzhong Ning School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yingwen Zhao School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zhijun Wang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zeyuan Ding School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Peng Chen School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Weiru Fu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Qinyu Han School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Guangtao Xu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yunzhi Qiu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Dinghao Pan School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jiru Li School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Hao Li School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Wenduo Feng School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Senbo Tu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yuqi Liu School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Zhihao Yang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Jian Wang School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Yuanyuan Sun School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
Hongfei Lin School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China

Collapse

Nastou K, Koutrouli M, Pyysalo S, Jensen LJ. CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes. BIOINFORMATICS ADVANCES 2024;4:vbae116. [PMID: 39411448 PMCID: PMC11474106 DOI: 10.1093/bioadv/vbae116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 07/10/2024] [Accepted: 08/04/2024] [Indexed: 10/19/2024]

Islamaj R, Wei CH, Lai PT, Luo L, Coss C, Gokal Kochar P, Miliaras N, Rodionov O, Sekiya K, Trinh D, Whitman D, Lu Z. The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford) 2024;2024:baae071. [PMID: 39126204 PMCID: PMC11315767 DOI: 10.1093/database/baae071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 06/03/2024] [Accepted: 07/09/2024] [Indexed: 08/12/2024]

Abstract

The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.

Collapse

Aldahdooh J, Tanoli Z, Tang J. Mining drug-target interactions from biomedical literature using chemical and gene descriptions-based ensemble transformer model. BIOINFORMATICS ADVANCES 2024;4:vbae106. [PMID: 39092007 PMCID: PMC11293871 DOI: 10.1093/bioadv/vbae106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 06/30/2024] [Accepted: 07/17/2024] [Indexed: 08/04/2024]

Taub-Tabib H, Shamay Y, Shlain M, Pinhasov M, Polak M, Tiktinsky A, Rahamimov S, Bareket D, Eyal B, Kassis M, Goldberg Y, Kaminski Rosenberg T, Vulfsons S, Ben Sasson M. Identifying symptom etiologies using syntactic patterns and large language models. Sci Rep 2024;14:16190. [PMID: 39003296 PMCID: PMC11246441 DOI: 10.1038/s41598-024-65645-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Accepted: 06/21/2024] [Indexed: 07/15/2024] Open

Wei CH, Allot A, Lai PT, Leaman R, Tian S, Luo L, Jin Q, Wang Z, Chen Q, Lu Z. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Res 2024;52:W540-W546. [PMID: 38572754 PMCID: PMC11223843 DOI: 10.1093/nar/gkae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/02/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open

Yuan J, Zhang F, Qiu Y, Lin H, Zhang Y. Document-level biomedical relation extraction via hierarchical tree graph and relation segmentation module. Bioinformatics 2024;40:btae418. [PMID: 38917409 PMCID: PMC11629692 DOI: 10.1093/bioinformatics/btae418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Revised: 05/27/2024] [Accepted: 06/24/2024] [Indexed: 06/27/2024] Open

Kosonocky CW, Wilke CO, Marcotte EM, Ellington AD. Mining patents with large language models elucidates the chemical function landscape. DIGITAL DISCOVERY 2024;3:1150-1159. [PMID: 38873033 PMCID: PMC11167698 DOI: 10.1039/d4dd00011k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 05/03/2024] [Indexed: 06/15/2024]

Wang M, Vijayaraghavan A, Beck T, Posma JM. Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition. J Proteome Res 2024;23:1915-1925. [PMID: 38733346 PMCID: PMC11165580 DOI: 10.1021/acs.jproteome.3c00367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 01/30/2024] [Accepted: 04/29/2024] [Indexed: 05/13/2024]

Livne M, Miftahutdinov Z, Tutubalina E, Kuznetsov M, Polykovskiy D, Brundyn A, Jhunjhunwala A, Costa A, Aliper A, Aspuru-Guzik A, Zhavoronkov A. nach0: multimodal natural and chemical languages foundation model. Chem Sci 2024;15:8380-8389. [PMID: 38846388 PMCID: PMC11151847 DOI: 10.1039/d4sc00966e] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Accepted: 04/26/2024] [Indexed: 06/09/2024] Open

Singh A, Krishnamoorthy S, Ortega JE. NeighBERT: Medical Entity Linking Using Relation-Induced Dense Retrieval. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2024;8:353-369. [PMID: 38681752 PMCID: PMC11052986 DOI: 10.1007/s41666-023-00136-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 05/08/2023] [Accepted: 07/03/2023] [Indexed: 05/01/2024]

Molinet B, Marro S, Cabrio E, Villata S. Explanatory argumentation in natural language for correct and incorrect medical diagnoses. J Biomed Semantics 2024;15:8. [PMID: 38816758 PMCID: PMC11138001 DOI: 10.1186/s13326-024-00306-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 04/12/2024] [Indexed: 06/01/2024] Open

Rouhizadeh H, Nikishina I, Yazdani A, Bornet A, Zhang B, Ehrsam J, Gaudet-Blavignac C, Naderi N, Teodoro D. A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models. Sci Data 2024;11:455. [PMID: 38704422 PMCID: PMC11069517 DOI: 10.1038/s41597-024-03317-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 04/25/2024] [Indexed: 05/06/2024] Open

Alamro H, Gojobori T, Essack M, Gao X. BioBBC: a multi-feature model that enhances the detection of biomedical entities. Sci Rep 2024;14:7697. [PMID: 38565624 PMCID: PMC10987643 DOI: 10.1038/s41598-024-58334-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 03/27/2024] [Indexed: 04/04/2024] Open

Zhou X, Fu Q, Xia Y, Wang Y, Lu Y, Chen Y, Chen J. LoGo-GR: A Local to Global Graphical Reasoning Framework for Extracting Structured Information From Biomedical Literature. IEEE J Biomed Health Inform 2024;28:2314-2325. [PMID: 38265897 DOI: 10.1109/jbhi.2024.3358169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]

Abstract

In the biomedical literature, entities are often distributed within multiple sentences and exhibit complex interactions. As the volume of literature has increased dramatically, it has become impractical to manually extract and maintain biomedical knowledge, which would entail enormous costs. Fortunately, document-level relation extraction can capture associations between entities from complex text, helping researchers efficiently mine structured knowledge from the vast medical literature. However, how to effectively synthesize rich global information from context and accurately capture local dependencies between entities is still a great challenge. In this paper, we propose a Local to Global Graphical Reasoning framework (LoGo-GR) based on a novel Biased Graph Attention mechanism (B-GAT). It learns global context feature and information of local relation path dependencies from mention-level interaction graph and entity-level path graph respectively, and collaborates with global and local reasoning to capture complex interactions between entities from document-level text. In particular, B-GAT integrates structural dependencies into the standard graph attention mechanism (GAT) as attention biases to adaptively guide information aggregation in graphical reasoning. We evaluate our method on three publicly biomedical document-level datasets: Drug-Mutation Interaction (DV), Chemical-induced Disease (CDR), and Gene-Disease Association (GDA). LoGo-GR has advanced and stable performance compared to other state-of-the-art methods (it achieves state-of-the-art performance with 96.14%-97.39% F1 on DV dataset, advanced performance with 68.89% F1 and 84.22% F1 on CDR and GDA datasets, respectively). In addition, LoGo-GR also shows advanced performance on general-domain document-level relation extraction dataset, DocRED, which proves that it is an effective and robust document-level relation extraction framework.

Collapse

Keloth VK, Hu Y, Xie Q, Peng X, Wang Y, Zheng A, Selek M, Raja K, Wei CH, Jin Q, Lu Z, Chen Q, Xu H. Advancing entity recognition in biomedicine via instruction tuning of large language models. Bioinformatics 2024;40:btae163. [PMID: 38514400 PMCID: PMC11001490 DOI: 10.1093/bioinformatics/btae163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 02/18/2024] [Accepted: 03/19/2024] [Indexed: 03/23/2024] Open

Abstract

MOTIVATION

Large Language Models (LLMs) have the potential to revolutionize the field of Natural Language Processing, excelling not only in text generation and reasoning tasks but also in their ability for zero/few-shot learning, swiftly adapting to new tasks with minimal fine-tuning. LLMs have also demonstrated great promise in biomedical and healthcare applications. However, when it comes to Named Entity Recognition (NER), particularly within the biomedical domain, LLMs fall short of the effectiveness exhibited by fine-tuned domain-specific models. One key reason is that NER is typically conceptualized as a sequence labeling task, whereas LLMs are optimized for text generation and reasoning tasks.

RESULTS

We developed an instruction-based learning paradigm that transforms biomedical NER from a sequence labeling task into a generation task. This paradigm is end-to-end and streamlines the training and evaluation process by automatically repurposing pre-existing biomedical NER datasets. We further developed BioNER-LLaMA using the proposed paradigm with LLaMA-7B as the foundational LLM. We conducted extensive testing on BioNER-LLaMA across three widely recognized biomedical NER datasets, consisting of entities related to diseases, chemicals, and genes. The results revealed that BioNER-LLaMA consistently achieved higher F1-scores ranging from 5% to 30% compared to the few-shot learning capabilities of GPT-4 on datasets with different biomedical entities. We show that a general-domain LLM can match the performance of rigorously fine-tuned PubMedBERT models and PMC-LLaMA, biomedical-specific language model. Our findings underscore the potential of our proposed paradigm in developing general-domain LLMs that can rival SOTA performances in multi-task, multi-domain scenarios in biomedical and health applications.

AVAILABILITY AND IMPLEMENTATION

Datasets and other resources are available at https://github.com/BIDS-Xu-Lab/BioNER-LLaMA.

Collapse

Affiliation(s)

Vipina K Keloth Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
Yan Hu McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX-77030, United States
Qianqian Xie Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
Xueqing Peng Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
Yan Wang Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
Andrew Zheng William P. Clements High School, Sugar Land, TX-77479, United States
Melih Selek Stephen F. Austin High School, Sugar Land, TX-77498, United States
Kalpana Raja Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
Chih Hsuan Wei National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
Qiao Jin National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
Zhiyong Lu National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
Qingyu Chen Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
Hua Xu Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States

Collapse

Irrera O, Marchesin S, Silvello G. MetaTron: advancing biomedical annotation empowering relation annotation and collaboration. BMC Bioinformatics 2024;25:112. [PMID: 38486137 PMCID: PMC10941452 DOI: 10.1186/s12859-024-05730-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 03/04/2024] [Indexed: 03/17/2024] Open

Abstract

BACKGROUND

The constant growth of biomedical data is accompanied by the need for new methodologies to effectively and efficiently extract machine-readable knowledge for training and testing purposes. A crucial aspect in this regard is creating large, often manually or semi-manually, annotated corpora vital for developing effective and efficient methods for tasks like relation extraction, topic recognition, and entity linking. However, manual annotation is expensive and time-consuming especially if not assisted by interactive, intuitive, and collaborative computer-aided tools. To support healthcare experts in the annotation process and foster annotated corpora creation, we present MetaTron. MetaTron is an open-source and free-to-use web-based annotation tool to annotate biomedical data interactively and collaboratively; it supports both mention-level and document-level annotations also integrating automatic built-in predictions. Moreover, MetaTron enables relation annotation with the support of ontologies, functionalities often overlooked by off-the-shelf annotation tools.

RESULTS

We conducted a qualitative analysis to compare MetaTron with a set of manual annotation tools including TeamTat, INCEpTION, LightTag, MedTAG, and brat, on three sets of criteria: technical, data, and functional. A quantitative evaluation allowed us to assess MetaTron performances in terms of time and number of clicks to annotate a set of documents. The results indicated that MetaTron fulfills almost all the selected criteria and achieves the best performances.

CONCLUSIONS

MetaTron stands out as one of the few annotation tools targeting the biomedical domain supporting the annotation of relations, and fully customizable with documents in several formats-PDF included, as well as abstracts retrieved from PubMed, Semantic Scholar, and OpenAIRE. To meet any user need, we released MetaTron both as an online instance and as a Docker image locally deployable.

Collapse

Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, Kim H, Moxon S, Reese JT, Haendel MA, Robinson PN, Mungall CJ. Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics 2024;40:btae104. [PMID: 38383067 PMCID: PMC10924283 DOI: 10.1093/bioinformatics/btae104] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 12/16/2023] [Accepted: 02/20/2024] [Indexed: 02/23/2024] Open

Abstract

MOTIVATION

Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas.

RESULTS

Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM.

AVAILABILITY AND IMPLEMENTATION

SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.

Collapse

Dagdelen J, Dunn A, Lee S, Walker N, Rosen AS, Ceder G, Persson KA, Jain A. Structured information extraction from scientific text with large language models. Nat Commun 2024;15:1418. [PMID: 38360817 PMCID: PMC10869356 DOI: 10.1038/s41467-024-45563-x] [Citation(s) in RCA: 55] [Impact Index Per Article: 55.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Accepted: 01/22/2024] [Indexed: 02/17/2024] Open

Azam M, Chen Y, Arowolo MO, Liu H, Popescu M, Xu D. A Comprehensive Evaluation of Large Language Models in Mining Gene Interactions and Pathway Knowledge. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.21.576542. [PMID: 38328046 PMCID: PMC10849485 DOI: 10.1101/2024.01.21.576542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]

Abstract

Background

Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways is useful but cannot keep up with the exponential growth of the literature. Large-scale language models (LLMs), notable for their vast parameter sizes and comprehensive training on extensive text corpora, have great potential in automated text mining of biological pathways.

Method

This study assesses the effectiveness of 21 LLMs, including both API-based models and open-source models. The evaluation focused on two key aspects: gene regulatory relations (specifically, 'activation', 'inhibition', and 'phosphorylation') and KEGG pathway component recognition. The performance of these models was analyzed using statistical metrics such as precision, recall, F1 scores, and the Jaccard similarity index.

Results

Our results indicated a significant disparity in model performance. Among the API-based models, ChatGPT-4 and Claude-Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open-source models lagged their API-based counterparts, where Falcon-180b-chat and llama1-7b led with the highest performance in gene regulatory relations (F1 of 0.2787 and 0.1923, respectively) and KEGG pathway recognition (Jaccard similarity index of 0.2237 and 0. 2207, respectively).

Conclusion

LLMs are valuable in biomedical research, especially in gene network analysis and pathway mapping. However, their effectiveness varies, necessitating careful model selection. This work also provided a case study and insight into using LLMs as knowledge graphs.

Collapse

Shao L, Chen B, Zhang Z, Zhang Z, Chen X. Artificial intelligence generated content (AIGC) in medicine: A narrative review. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024;21:1672-1711. [PMID: 38303483 DOI: 10.3934/mbe.2024073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]

Le ND, Nguyen NTH. A metric learning-based method for biomedical entity linking. Front Res Metr Anal 2023;8:1247094. [PMID: 38173988 PMCID: PMC10762861 DOI: 10.3389/frma.2023.1247094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Accepted: 11/29/2023] [Indexed: 01/05/2024] Open

Abstract

Biomedical entity linking task is the task of mapping mention(s) that occur in a particular textual context to a unique concept or entity in a knowledge base, e.g., the Unified Medical Language System (UMLS). One of the most challenging aspects of the entity linking task is the ambiguity of mentions, i.e., (1) mentions whose surface forms are very similar, but which map to different entities in different contexts, and (2) entities that can be expressed using diverse types of mentions. Recent studies have used BERT-based encoders to encode mentions and entities into distinguishable representations such that their similarity can be measured using distance metrics. However, most real-world biomedical datasets suffer from severe imbalance, i.e., some classes have many instances while others appear only once or are completely absent from the training data. A common way to address this issue is to down-sample the dataset, i.e., to reduce the number instances of the majority classes to make the dataset more balanced. In the context of entity linking, down-sampling reduces the ability of the model to comprehensively learn the representations of mentions in different contexts, which is very important. To tackle this issue, we propose a metric-based learning method that treats a given entity and its mentions as a whole, regardless of the number of mentions in the training set. Specifically, our method uses a triplet loss-based function in conjunction with a clustering technique to learn the representation of mentions and entities. Through evaluations on two challenging biomedical datasets, i.e., MedMentions and BC5CDR, we show that our proposed method is able to address the issue of imbalanced data and to perform competitively with other state-of-the-art models. Moreover, our method significantly reduces computational cost in both training and inference steps. Our source code is publicly available here.

Collapse

Kosonocky CW, Wilke CO, Marcotte EM, Ellington AD. Mining Patents with Large Language Models Elucidates the Chemical Function Landscape. ARXIV 2023:arXiv:2309.08765v2. [PMID: 38196747 PMCID: PMC10775343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/11/2024]

Nachtegael C, De Stefani J, Lenaerts T. A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction. PLoS One 2023;18:e0292356. [PMID: 38100453 PMCID: PMC10723703 DOI: 10.1371/journal.pone.0292356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 09/19/2023] [Indexed: 12/17/2023] Open

Abstract

Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.

Collapse

Kartchner D, Deng J, Lohiya S, Kopparthi T, Bathala P, Domingo-Fernández D, Mitchell CS. A Comprehensive Evaluation of Biomedical Entity Linking Models. PROCEEDINGS OF THE CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING. CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING 2023;2023:14462-14478. [PMID: 38756862 PMCID: PMC11097978 DOI: 10.18653/v1/2023.emnlp-main.893] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/18/2024]

Peng C, Yang X, Chen A, Smith KE, PourNejatian N, Costa AB, Martin C, Flores MG, Zhang Y, Magoc T, Lipori G, Mitchell DA, Ospina NS, Ahmed MM, Hogan WR, Shenkman EA, Guo Y, Bian J, Wu Y. A study of generative large language model for medical research and healthcare. NPJ Digit Med 2023;6:210. [PMID: 37973919 PMCID: PMC10654385 DOI: 10.1038/s41746-023-00958-w] [Citation(s) in RCA: 86] [Impact Index Per Article: 43.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 11/01/2023] [Indexed: 11/19/2023] Open

Affiliation(s)

Cheng Peng Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
Xi Yang Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
Aokun Chen Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
Kaleb E Smith NVIDIA, Santa Clara, CA, USA
Nima PourNejatian NVIDIA, Santa Clara, CA, USA
Anthony B Costa NVIDIA, Santa Clara, CA, USA
Cheryl Martin NVIDIA, Santa Clara, CA, USA
Mona G Flores NVIDIA, Santa Clara, CA, USA
Ying Zhang Research Computing, University of Florida, Gainesville, FL, USA
Tanja Magoc Integrated Data Repository Research Services, University of Florida, Gainesville, FL, USA
Gloria Lipori Integrated Data Repository Research Services, University of Florida, Gainesville, FL, USA Lillian S. Wells Department of Neurosurgery, Clinical and Translational Science Institute, University of Florida, Gainesville, FL, USA
Duane A Mitchell Lillian S. Wells Department of Neurosurgery, Clinical and Translational Science Institute, University of Florida, Gainesville, FL, USA
Naykky S Ospina Division of Endocrinology, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
Mustafa M Ahmed Division of Cardiovascular Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
William R Hogan Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
Elizabeth A Shenkman Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
Yi Guo Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
Jiang Bian Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
Yonghui Wu Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA. Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA.

Collapse