1
|
Hong G, Hindle V, Veasley NM, Holscher HD, Kilicoglu H. DiMB-RE: mining the scientific literature for diet-microbiome associations. J Am Med Inform Assoc 2025; 32:998-1006. [PMID: 40152137 PMCID: PMC12089768 DOI: 10.1093/jamia/ocaf054] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/22/2024] [Revised: 03/03/2025] [Accepted: 03/11/2025] [Indexed: 03/29/2025] Open
Abstract
OBJECTIVES To develop a corpus annotated for diet-microbiome associations from the biomedical literature and train natural language processing (NLP) models to identify these associations, thereby improving the understanding of their role in health and disease, and supporting personalized nutrition strategies. MATERIALS AND METHODS We constructed DiMB-RE, a comprehensive corpus annotated with 15 entity types (eg, Nutrient, Microorganism) and 13 relation types (eg, increases, improves) capturing diet-microbiome associations. We fine-tuned and evaluated state-of-the-art NLP models for named entity, trigger, and relation extraction as well as factuality detection using DiMB-RE. In addition, we benchmarked 2 generative large language models (GPT-4o-mini and GPT-4o) on a subset of the dataset in zero- and one-shot settings. RESULTS DiMB-RE consists of 14 450 entities and 4206 relationships from 165 publications (including 30 full-text Results sections). Fine-tuned NLP models performed reasonably well for named entity recognition (0.800 F1 score), while end-to-end relation extraction performance was modest (0.445 F1). The use of Results section annotations improved relation extraction. The impact of trigger detection was mixed. Generative models showed lower accuracy compared to fine-tuned models. DISCUSSION To our knowledge, DiMB-RE is the largest and most diverse corpus focusing on diet-microbiome interactions. Natural language processing models fine-tuned on DiMB-RE exhibit lower performance compared to similar corpora, highlighting the complexity of information extraction in this domain. Misclassified entities, missed triggers, and cross-sentence relations are the major sources of relation extraction errors. CONCLUSION DiMB-RE can serve as a benchmark corpus for biomedical literature mining. DiMB-RE and the NLP models are available at https://github.com/ScienceNLP-Lab/DiMB-RE.
Collapse
Affiliation(s)
- Gibong Hong
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL 61820, United States
| | - Veronica Hindle
- Department of Food Science and Human Nutrition, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| | - Nadine M Veasley
- Division of Nutritional Sciences, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| | - Hannah D Holscher
- Department of Food Science and Human Nutrition, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
- Division of Nutritional Sciences, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
- Personalized Nutrition Initiative, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| | - Halil Kilicoglu
- School of Information Sciences, University of Illinois Urbana-Champaign, Champaign, IL 61820, United States
- Division of Nutritional Sciences, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
- Personalized Nutrition Initiative, University of Illinois Urbana-Champaign, Urbana, IL 61801, United States
| |
Collapse
|
2
|
He T, Kreimeyer K, Najjar M, Spiker J, Fatteh M, Anagnostou V, Botsis T. Artificial Intelligence-assisted Biomedical Literature Knowledge Synthesis to Support Decision-making in Precision Oncology. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2025; 2024:513-522. [PMID: 40417512 PMCID: PMC12099343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 05/27/2025]
Abstract
The delivery of effective targeted therapies requires comprehensive analyses of the molecular profiling of tumors and matching with clinical phenotypes in the context of existing knowledge described in biomedical literature, registries, and knowledge bases. We evaluated the performance of natural language processing (NLP) approaches in supporting knowledge retrieval and synthesis from the biomedical literature. We tested PubTator 3.0, Bidirectional Encoder Representations from Transformers (BERT), and Large Language Models (LLMs) and evaluated their ability to support named entity recognition (NER) and relation extraction (RE) from biomedical texts. PubTator 3.0 and the BioBERT model performed best in the NER task (best F1-score 0.93 and 0.89, respectively), while BioBERT outperformed all other solutions in the RE task (best F1-score 0.79) and a specific use case it was applied to by recognizing nearly all entity mentions and most of the relations. Our findings support the use of AI-assisted approaches in facilitating precision oncology decision-making.
Collapse
Affiliation(s)
- Ting He
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD
- Division of Quantitative Sciences, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD
| | - Kory Kreimeyer
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD
- Division of Quantitative Sciences, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD
| | - Mimi Najjar
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD
- The Johns Hopkins Molecular Tumor Board, Johns Hopkins School of Medicine, Baltimore, MD
| | - Jonathan Spiker
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD
- Division of Quantitative Sciences, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD
| | - Maria Fatteh
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD
- The Johns Hopkins Molecular Tumor Board, Johns Hopkins School of Medicine, Baltimore, MD
| | - Valsamo Anagnostou
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD
- The Johns Hopkins Molecular Tumor Board, Johns Hopkins School of Medicine, Baltimore, MD
| | - Taxiarchis Botsis
- Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD
- Division of Quantitative Sciences, Sidney Kimmel Comprehensive Cancer Center, Johns Hopkins School of Medicine, Baltimore, MD
| |
Collapse
|
3
|
Li W, Wang H, Li W, Zhao J, Sun Y. Generation-Based Few-Shot BioNER via Local Knowledge Index and Dual Prompts. Interdiscip Sci 2025:10.1007/s12539-025-00709-3. [PMID: 40347393 DOI: 10.1007/s12539-025-00709-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2024] [Revised: 04/01/2025] [Accepted: 04/05/2025] [Indexed: 05/12/2025]
Abstract
Few-shot Biomedical Named Entity Recognition (BioNER) presents significant challenges due to limited training data and the presence of nested and discontinuous entities. To tackle these issues, a novel approach GKP-BioNER, Generation-based Few-Shot BioNER via Local Knowledge Index and Dual Prompts, is proposed. It redefines BioNER as a generation task by integrating hard and soft prompts. Specifically, GKP-BioNER constructs a localized knowledge index using a Wikipedia dump, facilitating the retrieval of semantically relevant texts to the original sentence. These texts are then reordered to prioritize the most semantically relevant content to the input data, serving as hard prompts. This helps the model to address challenges demanding domain-specific insights. Simultaneously, GKP-BioNER preserves the integrity of the pre-trained models while introducing learnable parameters as soft prompts to guide the self-attention layer, allowing the model to adapt to the context. Moreover, a soft prompt mechanism is designed to support knowledge transfer across domains. Extensive experiments on five datasets demonstrate that GKP-BioNER significantly outperforms eight state-of-the-art methods. It shows robust performance in low-resource and complex scenarios across various domains, highlighting its strength in knowledge transfer and broad applicability.
Collapse
Affiliation(s)
- Weixin Li
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China
| | - Hong Wang
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China.
| | - Wei Li
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China
| | - Jun Zhao
- School of Information Science and Engineering, Shandong Normal University, Jinan, 250358, China
| | - Yanshen Sun
- Department of Computer Science, Virginia Tech, Blacksburg, 24061, USA
| |
Collapse
|
4
|
Chen Q, Hu Y, Peng X, Xie Q, Jin Q, Gilson A, Singer MB, Ai X, Lai PT, Wang Z, Keloth VK, Raja K, Huang J, He H, Lin F, Du J, Zhang R, Zheng WJ, Adelman RA, Lu Z, Xu H. Benchmarking large language models for biomedical natural language processing applications and recommendations. Nat Commun 2025; 16:3280. [PMID: 40188094 PMCID: PMC11972378 DOI: 10.1038/s41467-025-56989-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/17/2023] [Accepted: 02/07/2025] [Indexed: 04/07/2025] Open
Abstract
The rapid growth of biomedical literature poses challenges for manual knowledge curation and synthesis. Biomedical Natural Language Processing (BioNLP) automates the process. While Large Language Models (LLMs) have shown promise in general domains, their effectiveness in BioNLP tasks remains unclear due to limited benchmarks and practical guidelines. We perform a systematic evaluation of four LLMs-GPT and LLaMA representatives-on 12 BioNLP benchmarks across six applications. We compare their zero-shot, few-shot, and fine-tuning performance with the traditional fine-tuning of BERT or BART models. We examine inconsistencies, missing information, hallucinations, and perform cost analysis. Here, we show that traditional fine-tuning outperforms zero- or few-shot LLMs in most tasks. However, closed-source LLMs like GPT-4 excel in reasoning-related tasks such as medical question answering. Open-source LLMs still require fine-tuning to close performance gaps. We find issues like missing information and hallucinations in LLM outputs. These results offer practical insights for applying LLMs in BioNLP.
Collapse
Affiliation(s)
- Qingyu Chen
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Yan Hu
- McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
| | - Xueqing Peng
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Qianqian Xie
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Qiao Jin
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Aidan Gilson
- Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Maxwell B Singer
- Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Xuguang Ai
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Po-Ting Lai
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Zhizheng Wang
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA
| | - Vipina K Keloth
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Kalpana Raja
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Jimin Huang
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Huan He
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Fongci Lin
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Jingcheng Du
- McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
| | - Rui Zhang
- Division of Computational Health Sciences, Department of Surgery, Medical School, University of Minnesota, Minneapolis, MN, USA
- Center for Learning Health System Sciences, University of Minnesota, Minneapolis, MN, 55455, USA
| | - W Jim Zheng
- McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX, USA
| | - Ron A Adelman
- Department of Ophthalmology and Visual Science, Yale School of Medicine, Yale University, New Haven, CT, USA
| | - Zhiyong Lu
- National Library of Medicine, National Institutes of Health, Bethesda, MD, USA.
| | - Hua Xu
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, Yale University, New Haven, CT, USA.
| |
Collapse
|
5
|
Zhao D, Mu W, Jia X, Liu S, Chu Y, Meng J, Lin H. Few-shot biomedical NER empowered by LLMs-assisted data augmentation and multi-scale feature extraction. BioData Min 2025; 18:28. [PMID: 40181396 PMCID: PMC11969866 DOI: 10.1186/s13040-025-00443-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2024] [Accepted: 03/26/2025] [Indexed: 04/05/2025] Open
Abstract
Named Entity Recognition (NER) is a fundamental task in processing biomedical text. Due to the limited availability of labeled data, researchers have investigated few-shot learning methods to tackle this challenge. However, replicating the performance of fully supervised methods remains difficult in few-shot scenarios. This paper addresses two main issues. In terms of data augmentation, existing methods primarily focus on replacing content in the original text, which can potentially distort the semantics. Furthermore, current approaches often neglect sentence features at multiple scales. To overcome these challenges, we utilize ChatGPT to generate enriched data with distinct semantics for the same entities, thereby reducing noisy data. Simultaneously, we employ dynamic convolution to capture multi-scale semantic information in sentences and enhance feature representation based on PubMedBERT. We evaluated the experiments on four biomedical NER datasets (BC5CDR-Disease, NCBI, BioNLP11EPI, BioNLP13GE), and the results exceeded the current state-of-the-art models in most few-shot scenarios, including mainstream large language models like ChatGPT. The results confirm the effectiveness of the proposed method in data augmentation and model generalization.
Collapse
Affiliation(s)
- Di Zhao
- School of Computer Science and Engineering, Dalian Minzu University, Jinshitan Street, Jinzhou District, Dalian, 116650, Liaoning, China.
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, Liaoning, China.
- Postdoctoral Workstation of Dalian Yongia Electronic Technology Co., Ltd, Dalian, 116024, Liaoning, China.
| | - Wenxuan Mu
- School of Computer Science and Engineering, Dalian Minzu University, Jinshitan Street, Jinzhou District, Dalian, 116650, Liaoning, China
| | - Xiangxing Jia
- School of Computer Science and Engineering, Dalian Minzu University, Jinshitan Street, Jinzhou District, Dalian, 116650, Liaoning, China
| | - Shuang Liu
- School of Computer Science and Engineering, Dalian Minzu University, Jinshitan Street, Jinzhou District, Dalian, 116650, Liaoning, China
| | - Yonghe Chu
- Nantong University, Nantong, 226019, Jiangsu, China
| | - Jiana Meng
- School of Computer Science and Engineering, Dalian Minzu University, Jinshitan Street, Jinzhou District, Dalian, 116650, Liaoning, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, 116024, Liaoning, China
| |
Collapse
|
6
|
Zhang L, Zhong Y, Zheng Q, Liu J, Wang Q, Wang J, Chang X. TDGI: Translation-Guided Double-Graph Inference for Document-Level Relation Extraction. IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE 2025; 47:2647-2659. [PMID: 40030986 DOI: 10.1109/tpami.2025.3528246] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 03/05/2025]
Abstract
Document-level relation extraction (DocRE) aims at predicting relations of all entity pairs in one document, which plays an important role in information extraction. DocRE is more challenging than previous sentence-level relation extraction, as it often requires coreference and logical reasoning across multiple sentences. Graph-based methods are the mainstream solution to this complex reasoning in DocRE. They generally construct the heterogeneous graphs with entities, mentions, and sentences as nodes, co-occurrence and co-reference relations as edges. Their performance is difficult to further break through because the semantics and direction of the relation are not jointly considered in graph inference process. To this end, we propose a novel translation-guided double-graph inference network named TDGI for DocRE. On one hand, TDGI includes two relation semantics-aware and direction-aware reasoning graphs, i.e., mention graph and entity graph, to mine relations among long-distance entities more explicitly. Each graph consists of three elements: vectorized nodes, edges, and direction weights. On the other hand, we devise an interesting translation-based graph updating strategy that guides the embeddings of mention/entity nodes, relation edges, and direction weights following the specific translation algebraic structure, thereby to enhance the reasoning skills of TDGI. In the training procedure of TDGI, we minimize the relation multi-classification loss and triple contrastive loss together to guarantee the model's stability and robustness. Comprehensive experiments on three widely-used datasets show that TDGI achieves outstanding performance comparing with state-of-the-art baselines.
Collapse
|
7
|
Ong JCL, Chen MH, Ng N, Elangovan K, Tan NYT, Jin L, Xie Q, Ting DSW, Rodriguez-Monguio R, Bates DW, Liu N. A scoping review on generative AI and large language models in mitigating medication related harm. NPJ Digit Med 2025; 8:182. [PMID: 40155703 PMCID: PMC11953325 DOI: 10.1038/s41746-025-01565-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2024] [Accepted: 03/06/2025] [Indexed: 04/01/2025] Open
Abstract
Medication-related harm has a significant impact on global healthcare costs and patient outcomes. Generative artificial intelligence (GenAI) and large language models (LLM) have emerged as a promising tool in mitigating risks of medication-related harm. This review evaluates the scope and effectiveness of GenAI and LLM in reducing medication-related harm. We screened 4 databases for literature published from 1st January 2012 to 15th October 2024. A total of 3988 articles were identified, and 30 met the criteria for inclusion into the final review. Generative AI and LLMs were applied in three key applications: drug-drug interaction identification and prediction, clinical decision support, and pharmacovigilance. While the performance and utility of these models varied, they generally showed promise in early identification, classification of adverse drug events, and supporting decision-making for medication management. However, no studies tested these models prospectively, suggesting a need for further investigation into integration and real-world application.
Collapse
Affiliation(s)
- Jasmine Chiat Ling Ong
- Division of Pharmacy, Singapore General Hospital, Singapore, Singapore
- Department of Pharmacy, University of California, San Francisco, CA, USA
- Duke-NUS Medical School, Singapore, Singapore
| | | | - Ning Ng
- Artificial Intelligence Office, Singapore Health Services, Singapore, Singapore
| | - Kabilan Elangovan
- Artificial Intelligence Office, Singapore Health Services, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
| | | | - Liyuan Jin
- Duke-NUS Medical School, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
| | - Qihuang Xie
- School of Pharmacy, National University of Singapore, Singapore, Singapore
| | - Daniel Shu Wei Ting
- Artificial Intelligence Office, Singapore Health Services, Singapore, Singapore
- Singapore Eye Research Institute, Singapore National Eye Centre, Singapore, Singapore
- Byers Eye Institute, Stanford University, California, CA, USA
| | - Rosa Rodriguez-Monguio
- Department of Clinical Pharmacy, School of Pharmacy, University of California, San Francisco, CA, USA
- Medication Outcomes Center, University of California, San Francisco, CA, USA
| | - David W Bates
- Harvard T.H. Chan School of Public Health, Boston, MA, USA
- Division of General Internal Medicine, Brigham and Women's Hospital, Boston, MA, USA
| | - Nan Liu
- Centre for Quantitative Medicine, Duke-NUS Medical School, Singapore, Singapore.
- Programme in Health Services and Systems Research, Duke-NUS Medical School, Singapore, Singapore.
- NUS AI Institute, National University of Singapore, Singapore, Singapore.
| |
Collapse
|
8
|
Wang L, Hao H, Yan X, Zhou TH, Ryu KH. From biomedical knowledge graph construction to semantic querying: a comprehensive approach. Sci Rep 2025; 15:8523. [PMID: 40074859 PMCID: PMC11904217 DOI: 10.1038/s41598-025-93334-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2024] [Accepted: 03/06/2025] [Indexed: 03/14/2025] Open
Abstract
In the biomedical field, the construction and application of knowledge graphs are becoming increasingly important because they can effectively integrate and manage large amounts of complex medical information. This study provides a whole-process approach for the biomedical field, from constructing knowledge graphs to semantic query based on knowledge graphs. In the knowledge graph construction stage, we propose the BioPLBC model, which incorporates BioBERT context-embedded features, part of speech and lexical morphological features to achieve entity annotation of medical texts. Based on the constructed biomedical knowledge graph, we also propose the Adaptive Locating and Expanding Query (ALEQ) algorithm, which improves the query speed by locating and dynamically expanding the query subregion. The experimental results indicate that the BioPLBC model consistently achieves higher accuracy than the baseline model across all datasets, while the ALEQ algorithm achieves different degrees of improvement in query accuracy and speed.
Collapse
Affiliation(s)
- Ling Wang
- School of Computer Science, Northeast Electric Power University, 169 Changchun Street, Jilin, 132012, China
| | - Haoyu Hao
- School of Computer Science, Northeast Electric Power University, 169 Changchun Street, Jilin, 132012, China
| | - Xue Yan
- School of Computer Science, Northeast Electric Power University, 169 Changchun Street, Jilin, 132012, China
| | - Tie Hua Zhou
- School of Computer Science, Northeast Electric Power University, 169 Changchun Street, Jilin, 132012, China.
| | - Keun Ho Ryu
- Data Science Laboratory, Faculty of Information Technology, Ton Duc Thang University, Ho Chi Minh City, 700000, Vietnam
- Research Institute, Bigsun System Co., Ltd., Seoul, 06266, Republic of Korea
- Database and Bioinformatics Laboratory, College of Electrical and Computer Engineering, Chungbuk National University, Cheongju, 28644, Republic of Korea
| |
Collapse
|
9
|
Ramos MC, Collison CJ, White AD. A review of large language models and autonomous agents in chemistry. Chem Sci 2025; 16:2514-2572. [PMID: 39829984 PMCID: PMC11739813 DOI: 10.1039/d4sc03921a] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2024] [Accepted: 12/03/2024] [Indexed: 01/22/2025] Open
Abstract
Large language models (LLMs) have emerged as powerful tools in chemistry, significantly impacting molecule design, property prediction, and synthesis optimization. This review highlights LLM capabilities in these domains and their potential to accelerate scientific discovery through automation. We also review LLM-based autonomous agents: LLMs with a broader set of tools to interact with their surrounding environment. These agents perform diverse tasks such as paper scraping, interfacing with automated laboratories, and synthesis planning. As agents are an emerging topic, we extend the scope of our review of agents beyond chemistry and discuss across any scientific domains. This review covers the recent history, current capabilities, and design of LLMs and autonomous agents, addressing specific challenges, opportunities, and future directions in chemistry. Key challenges include data quality and integration, model interpretability, and the need for standard benchmarks, while future directions point towards more sophisticated multi-modal agents and enhanced collaboration between agents and experimental methods. Due to the quick pace of this field, a repository has been built to keep track of the latest studies: https://github.com/ur-whitelab/LLMs-in-science.
Collapse
Affiliation(s)
- Mayk Caldas Ramos
- FutureHouse Inc. San Francisco CA USA
- Department of Chemical Engineering, University of Rochester Rochester NY USA
| | - Christopher J Collison
- School of Chemistry and Materials Science, Rochester Institute of Technology Rochester NY USA
| | - Andrew D White
- FutureHouse Inc. San Francisco CA USA
- Department of Chemical Engineering, University of Rochester Rochester NY USA
| |
Collapse
|
10
|
Ding L, Colavizza G, Zhang Z. Partial Annotation Learning for Biomedical Entity Recognition. IEEE J Biomed Health Inform 2025; 29:1409-1418. [PMID: 39312441 DOI: 10.1109/jbhi.2024.3466294] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 09/25/2024]
Abstract
Named Entity Recognition (NER) is a key task to support biomedical research. In Biomedical Named Entity Recognition (BioNER), obtaining high-quality expert annotated data is laborious and expensive, leading to the development of automatic approaches such as distant supervision. However, manually and automatically generated data often suffer from the unlabeled entity problem, whereby many entity annotations are missing, degrading the performance of full annotation NER models. To conquer this issue, we undertake a systematic exploration of the efficacy of partial annotation learning methods for BioNER, which encompasses a comprehensive evaluation conducted across a spectrum of distinct simulated scenarios of missing entity annotations. Furthermore, we propose a TS-PubMedBERT-Partial-CRF partial annotation learning model. We standardize a compilation of 16 BioNER corpora, encompassing a range of five distinct entity types, to establish a gold standard. And we compare against the state-of-the-art partial annotation model EER-PubMedBERT, the widely acknowledged partial annotation model BiLSTM-Partial-CRF model, and the state-of-the-art full annotation learning BioNER model PubMedBERT tagger. Results show that partial annotation learning-based methods can effectively learn from biomedical corpora with missing entity annotations. Our proposed model outperforms alternatives and, specifically, the PubMedBERT tagger by 38% in F1-score under high missing entity rates. Moreover, the recall of entity mentions in our model demonstrates a competitive alignment with the upper threshold observed on the fully annotated dataset.
Collapse
|
11
|
Borchert F, Llorca I, Roller R, Arnrich B, Schapranow MP. xMEN: a modular toolkit for cross-lingual medical entity normalization. JAMIA Open 2025; 8:ooae147. [PMID: 39735785 PMCID: PMC11671143 DOI: 10.1093/jamiaopen/ooae147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2024] [Revised: 12/03/2024] [Accepted: 12/12/2024] [Indexed: 12/31/2024] Open
Abstract
Objective To improve performance of medical entity normalization across many languages, especially when fewer language resources are available compared to English. Materials and Methods We propose xMEN, a modular system for cross-lingual (x) medical entity normalization (MEN), accommodating both low- and high-resource scenarios. To account for the scarcity of aliases for many target languages and terminologies, we leverage multilingual aliases via cross-lingual candidate generation. For candidate ranking, we incorporate a trainable cross-encoder (CE) model if annotations for the target task are available. To balance the output of general-purpose candidate generators with subsequent trainable re-rankers, we introduce a novel rank regularization term in the loss function for training CEs. For re-ranking without gold-standard annotations, we introduce multiple new weakly labeled datasets using machine translation and projection of annotations from a high-resource language. Results xMEN improves the state-of-the-art performance across various benchmark datasets for several European languages. Weakly supervised CEs are effective when no training data is available for the target task. Discussion We perform an analysis of normalization errors, revealing that complex entities are still challenging to normalize. New modules and benchmark datasets can be easily integrated in the future. Conclusion xMEN exhibits strong performance for medical entity normalization in many languages, even when no labeled data and few terminology aliases for the target language are available. To enable reproducible benchmarks in the future, we make the system available as an open-source Python toolkit. The pre-trained models and source code are available online: https://github.com/hpi-dhc/xmen.
Collapse
Affiliation(s)
- Florian Borchert
- Hasso Plattner Institute for Digital Engineering, University of Potsdam, Potsdam 14482, Germany
| | - Ignacio Llorca
- Hasso Plattner Institute for Digital Engineering, University of Potsdam, Potsdam 14482, Germany
| | - Roland Roller
- Speech and Language Technology Lab, German Research Center for Artificial Intelligence (DFKI), Berlin 10559, Germany
| | - Bert Arnrich
- Hasso Plattner Institute for Digital Engineering, University of Potsdam, Potsdam 14482, Germany
| | - Matthieu-P Schapranow
- Hasso Plattner Institute for Digital Engineering, University of Potsdam, Potsdam 14482, Germany
| |
Collapse
|
12
|
Bhushan RC, Donthi RK, Chilukuri Y, Srinivasarao U, Swetha P. Biomedical named entity recognition using improved green anaconda-assisted Bi-GRU-based hierarchical ResNet model. BMC Bioinformatics 2025; 26:34. [PMID: 39885428 PMCID: PMC11780922 DOI: 10.1186/s12859-024-06008-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2024] [Accepted: 12/05/2024] [Indexed: 02/01/2025] Open
Abstract
BACKGROUND Biomedical text mining is a technique that extracts essential information from scientific articles using named entity recognition (NER). Traditional NER methods rely on dictionaries, rules, or curated corpora, which may not always be accessible. To overcome these challenges, deep learning (DL) methods have emerged. However, DL-based NER methods may need help identifying long-distance relationships within text and require significant annotated datasets. RESULTS This research has proposed a novel model to address the challenges in natural language processing. The Improved Green anaconda-assisted Bi-GRU based Hierarchical ResNet BNER model (IGa-BiHR BNERM) is the model. IGa-BiHR BNERM model has shown promising results in accurately identifying named entities. The MACCROBAT dataset was obtained from Kaggle and underwent several pre-processing steps such as Stop Word Filtering, WordNet processing, Removal of non-alphanumeric characters, stemming Segmentation, and Tokenization, which is standardized and improves its quality. The pre-processed text was fed into a feature extraction model like the Robustly Optimized BERT -Whole Word Masking model. This model provides word embeddings with semantic information. Then, the BNER process utilized an Improved Green Anaconda-assisted Bi-GRU-based Hierarchical ResNet BNER model (IGa-BiHR BNERM). CONCLUSION To improve the training phase of the IGa-BiHR BNERM, the Improved Green Anaconda Optimization technique was used to select optimal weight parameter coefficients for training the model parameters. After the model was tested using the MACCROBAT dataset, it outperformed previous models with a tremendous accuracy rate of 99.11%. This model effectively and accurately identifies biomedical names within the text, significantly advancing this field.
Collapse
Affiliation(s)
| | - Rakesh Kumar Donthi
- Department of CSE GITAM (Deemed to be) UNIVERSITY Hyderabad, Rudraram, India
| | - Yojitha Chilukuri
- St. Jude Childrens Cancer Research Hospital, 262 Danny Thomas Place, Memphis, TN, 38105, USA
| | | | - Polisetty Swetha
- Department of Information Technology, Vardhaman College of Engineering, Shamshabad, Hyderabad, India
| |
Collapse
|
13
|
Azam M, Chen Y, Arowolo MO, Liu H, Popescu M, Xu D. A comprehensive evaluation of large language models in mining gene relations and pathway knowledge. QUANTITATIVE BIOLOGY 2024; 12:360-374. [PMID: 39364206 PMCID: PMC11446478 DOI: 10.1002/qub2.57] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/20/2024] [Accepted: 04/15/2024] [Indexed: 10/05/2024]
Abstract
Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways cannot keep up with the exponential growth of new discoveries in the literature. Large-scale language models (LLMs) trained on extensive text corpora contain rich biological information, and they can be mined as a biological knowledge graph. This study assesses 21 LLMs, including both application programming interface (API)-based models and open-source models in their capacities of retrieving biological knowledge. The evaluation focuses on predicting gene regulatory relations (activation, inhibition, and phosphorylation) and the Kyoto Encyclopedia of Genes and Genomes (KEGG) pathway components. Results indicated a significant disparity in model performance. API-based models GPT-4 and Claude-Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open-source models lagged behind their API-based counterparts, whereas Falcon-180b and llama2-7b had the highest F1 scores of 0.2787 and 0.1923 in gene regulatory relations, respectively. The KEGG pathway recognition had a Jaccard similarity index of 0.2237 for Falcon-180b and 0.2207 for llama2-7b. Our study suggests that LLMs are informative in gene network analysis and pathway mapping, but their effectiveness varies, necessitating careful model selection. This work also provides a case study and insight into using LLMs das knowledge graphs. Our code is publicly available at the website of GitHub (Muh-aza).
Collapse
Affiliation(s)
- Muhammad Azam
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
| | - Yibo Chen
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
- Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri, USA
| | - Micheal Olaolu Arowolo
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
| | - Haowang Liu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
| | - Mihail Popescu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri, USA
- Department of Biomedical Informatics, Biostatistics and Medical Epidemiology, University of Missouri, Columbia, Missouri, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
- Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
14
|
Yang Y, Lu Y, Zheng Z, Wu H, Lin Y, Qian F, Yan W. MKG-GC: A multi-task learning-based knowledge graph construction framework with personalized application to gastric cancer. Comput Struct Biotechnol J 2024; 23:1339-1347. [PMID: 38585647 PMCID: PMC10995799 DOI: 10.1016/j.csbj.2024.03.021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2024] [Revised: 03/24/2024] [Accepted: 03/24/2024] [Indexed: 04/09/2024] Open
Abstract
Over the past decade, information for precision disease medicine has accumulated in the form of textual data. To effectively utilize this expanding medical text, we proposed a multi-task learning-based framework based on hard parameter sharing for knowledge graph construction (MKG), and then used it to automatically extract gastric cancer (GC)-related biomedical knowledge from the literature and identify GC drug candidates. In MKG, we designed three separate modules, MT-BGIPN, MT-SGTF and MT-ScBERT, for entity recognition, entity normalization, and relation classification, respectively. To address the challenges posed by the long and irregular naming of medical entities, the MT-BGIPN utilized bidirectional gated recurrent unit and interactive pointer network techniques, significantly improving entity recognition accuracy to an average F1 value of 84.5% across datasets. In MT-SGTF, we employed the term frequency-inverse document frequency and the gated attention unit. These combine both semantic and characteristic features of entities, resulting in an average Hits@ 1 score of 94.5% across five datasets. The MT-ScBERT integrated cross-text, entity, and context features, yielding an average F1 value of 86.9% across 11 relation classification datasets. Based on the MKG, we then developed a specific knowledge graph for GC (MKG-GC), which encompasses a total of 9129 entities and 88,482 triplets. Lastly, the MKG-GC was used to predict potential GC drugs using a pre-trained language model called BioKGE-BERT and a drug-disease discriminant model based on CNN-BiLSTM. Remarkably, nine out of the top ten predicted drugs have been previously reported as effective for gastric cancer treatment. Finally, an online platform was created for exploration and visualization of MKG-GC at https://www.yanglab-mi.org.cn/MKG-GC/.
Collapse
Affiliation(s)
- Yang Yang
- Computing Science and Artificial Intelligence College, Suzhou City University, Suzhou 215004, China
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Yuwei Lu
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Zixuan Zheng
- School of Computer Science & Technology, Soochow University, Suzhou 215000, China
| | - Hao Wu
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
| | - Yuxin Lin
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Department of Urology, the First Affiliated Hospital of Soochow University, Suzhou 215000, China
| | - Fuliang Qian
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Medical Center of Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| | - Wenying Yan
- Department of Bioinformatics, School of Biology and Basic Medical Sciences, Suzhou Medical College of Soochow University, Suzhou 215123, China
- Center for Systems Biology, Soochow University, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, Suzhou 215123, China
| |
Collapse
|
15
|
Yang Y, Zheng Z, Xu Y, Wei H, Yan W. BioGSF: a graph-driven semantic feature integration framework for biomedical relation extraction. Brief Bioinform 2024; 26:bbaf025. [PMID: 39853110 PMCID: PMC11759886 DOI: 10.1093/bib/bbaf025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/04/2024] [Revised: 12/24/2024] [Accepted: 01/09/2025] [Indexed: 01/26/2025] Open
Abstract
The automatic and accurate extraction of diverse biomedical relations from literature constitutes the core elements of medical knowledge graphs, which are indispensable for healthcare artificial intelligence. Currently, fine-tuning through stacking various neural networks on pre-trained language models (PLMs) represents a common framework for end-to-end resolution of the biomedical relation extraction (RE) problem. Nevertheless, sequence-based PLMs, to a certain extent, fail to fully exploit the connections between semantics and the topological features formed by these connections. In this study, we presented a graph-driven framework named BioGSF for RE from the literature by integrating shortest dependency paths (SDP) with entity-pair graph through the employment of the graph neural network model. Initially, we leveraged dependency relationships to obtain the SDP between entities and incorporated this information into the entity-pair graph. Subsequently, the graph attention network was utilized to acquire the topological information of the entity-pair graph. Ultimately, the obtained topological information was combined with the semantic features of the contextual information for relation classification. Our method was evaluated on two distinct datasets, namely S4 and BioRED. The outcomes reveal that BioGSF not only attains the superior performance among previous models with a micro-F1 score of 96.68% (S4) and 96.03% (BioRED), but also demands the shortest running times. BioGSF emerges as an efficient framework for biomedical RE.
Collapse
Affiliation(s)
- Yang Yang
- Computing Science and Artificial Intelligence College, Suzhou City University, No. 1188 Wuzhong Avenue, Wuzhong District Suzhou, Suzhou 215004, China
- Suzhou Key Lab of Multi-modal Data Fusion and Intelligent Healthcare, No. 1188 Wuzhong Avenue, Wuzhong District Suzhou, Suzhou 215004, China
- School of Computer Science & Technology, Soochow University, No. 1 Shizi Street, Suzhou 215000, China
| | - Zixuan Zheng
- School of Computer Science & Technology, Soochow University, No. 1 Shizi Street, Suzhou 215000, China
| | - Yuyang Xu
- School of Computer Science & Technology, Soochow University, No. 1 Shizi Street, Suzhou 215000, China
| | - Huifang Wei
- School of Basic Medical Sciences, Suzhou Medical College of Soochow University, No. 199 Renai Road, SIP, Suzhou 215123, China
| | - Wenying Yan
- Suzhou Key Lab of Multi-modal Data Fusion and Intelligent Healthcare, No. 1188 Wuzhong Avenue, Wuzhong District Suzhou, Suzhou 215004, China
- School of Basic Medical Sciences, Suzhou Medical College of Soochow University, No. 199 Renai Road, SIP, Suzhou 215123, China
- Jiangsu Province Engineering Research Center of Precision Diagnostics and Therapeutics Development, Soochow University, No. 199 Renai Road, SIP, Suzhou 215123, China
| |
Collapse
|
16
|
Ding X, Duan S, Zhang Z. Semantic-guided attention and adaptive gating for document-level relation extraction. Sci Rep 2024; 14:26628. [PMID: 39496763 PMCID: PMC11535381 DOI: 10.1038/s41598-024-78051-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2024] [Accepted: 10/28/2024] [Indexed: 11/06/2024] Open
Abstract
In natural language processing, document-level relation extraction is a complex task that aims to predict the relationships among entities by capturing contextual interactions from an unstructured document. Existing graph- and transformer-based models capture long-range relational facts across sentences. However, they still cannot fully exploit the semantic information from multiple interactive sentences, resulting in the exclusion of influential sentences for related entities. To address this problem, a novel Semantic-guided Attention and Adaptively Gated (SAAG) model is developed for document-level relation extraction. First, a semantic-guided attention module is designed to guide sentence representation by assigning different attention scores to different words. The multihead attention mechanism is then used to capture the attention of different subspaces further to generate a document context representation. Finally, the SAAG model exploits the semantic information by leveraging a gating mechanism that can dynamically distinguish between local and global contexts. The experimental results demonstrate that the SAAG model outperforms previous models on two public datasets.
Collapse
Affiliation(s)
- Xiaoyao Ding
- Department of Intelligent Culture and Tourism, The Open University of Henan, Zhengzhou, 450046, China.
| | - Shaopeng Duan
- Department of Information Engineering, The Open University of Henan, Zhengzhou, 450046, China
| | - Zheng Zhang
- Resource Construction and Management Center, The Open University of Henan, Zhengzhou, 450046, China
| |
Collapse
|
17
|
Mundotiya RK, Priya J, Kuwarbi D, Singh T. Enhancing Generalizability in Biomedical Entity Recognition: Self-Attention PCA-CLS Model. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2024; 21:1934-1941. [PMID: 39012749 DOI: 10.1109/tcbb.2024.3429234] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/18/2024]
Abstract
One of the primary tasks in the early stages of data mining involves the identification of entities from biomedical corpora. Traditional approaches relying on robust feature engineering face challenges when learning from available (un-)annotated data using data-driven models like deep learning-based architectures. Despite leveraging large corpora and advanced deep learning models, domain generalization remains an issue. Attention mechanisms are effective in capturing longer sentence dependencies and extracting semantic and syntactic information from limited annotated datasets. To address out-of-vocabulary challenges in biomedical text, the PCA-CLS (Position and Contextual Attention with CNN-LSTM-Softmax) model combines global self-attention and character-level convolutional neural network techniques. The model's performance is evaluated on eight distinct biomedical domain datasets encompassing entities such as genes, drugs, diseases, and species. The PCA-CLS model outperforms several state-of-the-art models, achieving notable F-scores, including 88.19% on BC2GM, 85.44% on JNLPBA, 90.80% on BC5CDR-chemical, 87.07% on BC5CDR-disease, 89.18% on BC4CHEMD, 88.81% on NCBI, and 91.59% on the s800 dataset.
Collapse
|
18
|
Yin Y, Kim H, Xiao X, Wei CH, Kang J, Lu Z, Xu H, Fang M, Chen Q. Augmenting biomedical named entity recognition with general-domain resources. J Biomed Inform 2024; 159:104731. [PMID: 39368529 DOI: 10.1016/j.jbi.2024.104731] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/14/2024] [Revised: 09/05/2024] [Accepted: 09/27/2024] [Indexed: 10/07/2024]
Abstract
OBJECTIVE Training a neural network-based biomedical named entity recognition (BioNER) model usually requires extensive and costly human annotations. While several studies have employed multi-task learning with multiple BioNER datasets to reduce human effort, this approach does not consistently yield performance improvements and may introduce label ambiguity in different biomedical corpora. We aim to tackle those challenges through transfer learning from easily accessible resources with fewer concept overlaps with biomedical datasets. METHODS We proposed GERBERA, a simple-yet-effective method that utilized general-domain NER datasets for training. We performed multi-task learning to train a pre-trained biomedical language model with both the target BioNER dataset and the general-domain dataset. Subsequently, we fine-tuned the models specifically for the BioNER dataset. RESULTS We systematically evaluated GERBERA on five datasets of eight entity types, collectively consisting of 81,410 instances. Despite using fewer biomedical resources, our models demonstrated superior performance compared to baseline models trained with additional BioNER datasets. Specifically, our models consistently outperformed the baseline models in six out of eight entity types, achieving an average improvement of 0.9% over the best baseline performance across eight entities. Our method was especially effective in amplifying performance on BioNER datasets characterized by limited data, with a 4.7% improvement in F1 scores on the JNLPBA-RNA dataset. CONCLUSION This study introduces a new training method that leverages cost-effective general-domain NER datasets to augment BioNER models. This approach significantly improves BioNER model performance, making it a valuable asset for scenarios with scarce or costly biomedical datasets. We make data, codes, and models publicly available via https://github.com/qingyu-qc/bioner_gerbera.
Collapse
Affiliation(s)
- Yu Yin
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Hyunjae Kim
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Xiao Xiao
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom
| | - Chih Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Jaewoo Kang
- Department of Computer Science, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841, Republic of Korea
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD, 0894, United States of America
| | - Hua Xu
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America
| | - Meng Fang
- Department of Computer Science, University of Liverpool, Liverpool L69 3DR, United Kingdom.
| | - Qingyu Chen
- Department of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT, 06510, United States of America.
| |
Collapse
|
19
|
Liu S, Wang A, Xiu X, Zhong M, Wu S. Evaluating Medical Entity Recognition in Health Care: Entity Model Quantitative Study. JMIR Med Inform 2024; 12:e59782. [PMID: 39419501 PMCID: PMC11528166 DOI: 10.2196/59782] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/23/2024] [Revised: 08/09/2024] [Accepted: 09/15/2024] [Indexed: 10/19/2024] Open
Abstract
BACKGROUND Named entity recognition (NER) models are essential for extracting structured information from unstructured medical texts by identifying entities such as diseases, treatments, and conditions, enhancing clinical decision-making and research. Innovations in machine learning, particularly those involving Bidirectional Encoder Representations From Transformers (BERT)-based deep learning and large language models, have significantly advanced NER capabilities. However, their performance varies across medical datasets due to the complexity and diversity of medical terminology. Previous studies have often focused on overall performance, neglecting specific challenges in medical contexts and the impact of macrofactors like lexical composition on prediction accuracy. These gaps hinder the development of optimized NER models for medical applications. OBJECTIVE This study aims to meticulously evaluate the performance of various NER models in the context of medical text analysis, focusing on how complex medical terminology affects entity recognition accuracy. Additionally, we explored the influence of macrofactors on model performance, seeking to provide insights for refining NER models and enhancing their reliability for medical applications. METHODS This study comprehensively evaluated 7 NER models-hidden Markov models, conditional random fields, BERT for Biomedical Text Mining, Big Transformer Models for Efficient Long-Sequence Attention, Decoding-enhanced BERT with Disentangled Attention, Robustly Optimized BERT Pretraining Approach, and Gemma-across 3 medical datasets: Revised Joint Workshop on Natural Language Processing in Biomedicine and its Applications (JNLPBA), BioCreative V CDR, and Anatomical Entity Mention (AnatEM). The evaluation focused on prediction accuracy, resource use (eg, central processing unit and graphics processing unit use), and the impact of fine-tuning hyperparameters. The macrofactors affecting model performance were also screened using the multilevel factor elimination algorithm. RESULTS The fine-tuned BERT for Biomedical Text Mining, with balanced resource use, generally achieved the highest prediction accuracy across the Revised JNLPBA and AnatEM datasets, with microaverage (AVG_MICRO) scores of 0.932 and 0.8494, respectively, highlighting its superior proficiency in identifying medical entities. Gemma, fine-tuned using the low-rank adaptation technique, achieved the highest accuracy on the BioCreative V CDR dataset with an AVG_MICRO score of 0.9962 but exhibited variability across the other datasets (AVG_MICRO scores of 0.9088 on the Revised JNLPBA and 0.8029 on AnatEM), indicating a need for further optimization. In addition, our analysis revealed that 2 macrofactors, entity phrase length and the number of entity words in each entity phrase, significantly influenced model performance. CONCLUSIONS This study highlights the essential role of NER models in medical informatics, emphasizing the imperative for model optimization via precise data targeting and fine-tuning. The insights from this study will notably improve clinical decision-making and facilitate the creation of more sophisticated and effective medical NER models.
Collapse
Affiliation(s)
- Shengyu Liu
- Department of Medical Data Sharing, Institute of Medical Information & Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Anran Wang
- Department of Medical Data Sharing, Institute of Medical Information & Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Xiaolei Xiu
- Department of Medical Data Sharing, Institute of Medical Information & Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Ming Zhong
- Department of Medical Data Sharing, Institute of Medical Information & Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| | - Sizhu Wu
- Department of Medical Data Sharing, Institute of Medical Information & Library, Chinese Academy of Medical Sciences & Peking Union Medical College, Beijing, China
| |
Collapse
|
20
|
Phan CP, Phan B, Chiang JH. Optimized biomedical entity relation extraction method with data augmentation and classification using GPT-4 and Gemini. Database (Oxford) 2024; 2024:baae104. [PMID: 39383312 PMCID: PMC11463225 DOI: 10.1093/database/baae104] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2024] [Revised: 08/21/2024] [Accepted: 09/04/2024] [Indexed: 10/11/2024]
Abstract
Despite numerous research efforts by teams participating in the BioCreative VIII Track 01 employing various techniques to achieve the high accuracy of biomedical relation tasks, the overall performance in this area still has substantial room for improvement. Large language models bring a new opportunity to improve the performance of existing techniques in natural language processing tasks. This paper presents our improved method for relation extraction, which involves integrating two renowned large language models: Gemini and GPT-4. Our new approach utilizes GPT-4 to generate augmented data for training, followed by an ensemble learning technique to combine the outputs of diverse models to create a more precise prediction. We then employ a method using Gemini responses as input to fine-tune the BioNLP-PubMed-Bert classification model, which leads to improved performance as measured by precision, recall, and F1 scores on the same test dataset used in the challenge evaluation. Database URL: https://biocreative.bioinformatics.udel.edu/tasks/biocreative-viii/track-1/.
Collapse
Affiliation(s)
- Cong-Phuoc Phan
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| | - Ben Phan
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| | - Jung-Hsien Chiang
- Department of Computer Science and Information Engineering, National Cheng Kung University, No. 1, University Road, Tainan City 701, Taiwan
| |
Collapse
|
21
|
Dafrallah S, Akhloufi MA. Hospital Re-Admission Prediction Using Named Entity Recognition and Explainable Machine Learning. Diagnostics (Basel) 2024; 14:2151. [PMID: 39410555 PMCID: PMC11475863 DOI: 10.3390/diagnostics14192151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/27/2024] [Revised: 09/15/2024] [Accepted: 09/25/2024] [Indexed: 10/20/2024] Open
Abstract
Early hospital readmission refers to unplanned emergency admission of patients within 30 days of discharge. Predicting early readmission risk before discharge can help to reduce the cost of readmissions for hospitals and decrease the death rate for Intensive Care Unit patients. In this paper, we propose a novel approach for prediction of unplanned hospital readmissions using discharge notes from the MIMIC-III database. This approach is based on first extracting relevant information from clinical reports using a pretrained Named Entity Recognition model called BioMedical-NER, which is built on Bidirectional Encoder Representations from Transformers architecture, with the extracted features then used to train machine learning models to predict unplanned readmissions. Our proposed approach achieves better results on clinical reports compared to the state-of-the-art methods, with an average precision of 88.4% achieved by the Gradient Boosting algorithm. In addition, explainable Artificial Intelligence techniques are applied to provide deeper comprehension of the predictive results.
Collapse
Affiliation(s)
| | - Moulay A. Akhloufi
- Perception, Robotics and Intelligent Machines (PRIME), Department of Computer Science, Université de Moncton, Moncton, NB E1A 3E9, Canada;
| |
Collapse
|
22
|
Lai PT, Coudert E, Aimo L, Axelsen K, Breuza L, de Castro E, Feuermann M, Morgat A, Pourcel L, Pedruzzi I, Poux S, Redaschi N, Rivoire C, Sveshnikova A, Wei CH, Leaman R, Luo L, Lu Z, Bridge A. EnzChemRED, a rich enzyme chemistry relation extraction dataset. Sci Data 2024; 11:982. [PMID: 39251610 PMCID: PMC11384730 DOI: 10.1038/s41597-024-03835-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2024] [Accepted: 08/23/2024] [Indexed: 09/11/2024] Open
Abstract
Expert curation is essential to capture knowledge of enzyme functions from the scientific literature in FAIR open knowledgebases but cannot keep pace with the rate of new discoveries and new publications. In this work we present EnzChemRED, for Enzyme Chemistry Relation Extraction Dataset, a new training and benchmarking dataset to support the development of Natural Language Processing (NLP) methods such as (large) language models that can assist enzyme curation. EnzChemRED consists of 1,210 expert curated PubMed abstracts where enzymes and the chemical reactions they catalyze are annotated using identifiers from the protein knowledgebase UniProtKB and the chemical ontology ChEBI. We show that fine-tuning language models with EnzChemRED significantly boosts their ability to identify proteins and chemicals in text (86.30% F1 score) and to extract the chemical conversions (86.66% F1 score) and the enzymes that catalyze those conversions (83.79% F1 score). We apply our methods to abstracts at PubMed scale to create a draft map of enzyme functions in literature to guide curation efforts in UniProtKB and the reaction knowledgebase Rhea.
Collapse
Grants
- U24 HG007822 NHGRI NIH HHS
- U41 HG007822 NHGRI NIH HHS
- NIH Intramural Research Program, National Library of Medicine
- Expert curation and evaluation of EnzChemRED at Swiss-Prot were supported by the Swiss Federal Government through the State Secretariat for Education, Research and Innovation (SERI) and the National Human Genome Research Institute (NHGRI), Office of Director [OD/DPCPSI/ODSS], National Institute of Allergy and Infectious Diseases (NIAID), National Institute on Aging (NIA), National Institute of General Medical Sciences (NIGMS), National Institute of Diabetes and Digestive and Kidney Diseases (NIDDK), National Eye Institute (NEI), National Cancer Institute (NCI), National Heart, Lung, and Blood Institute (NHLBI) of the National Institutes of Health [U24HG007822], and by the European Union's Horizon Europe Framework Programme (grant number 101080997), supported in Switzerland through the State Secretariat for Education, Research and Innovation (SERI).
- Fundamental Research Funds for the Central Universities [DUT23RC(3)014 to L.L.]
Collapse
Affiliation(s)
- Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Elisabeth Coudert
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lucila Aimo
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Kristian Axelsen
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lionel Breuza
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Edouard de Castro
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Marc Feuermann
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Anne Morgat
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Lucille Pourcel
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Ivo Pedruzzi
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Sylvain Poux
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Nicole Redaschi
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Catherine Rivoire
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Anastasia Sveshnikova
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland
| | - Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, 116024, Dalian, China
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD, 20894, USA.
| | - Alan Bridge
- Swiss-Prot group, SIB Swiss Institute of Bioinformatics, Centre Medical Universitaire, CH-1211, Geneva, 4, Switzerland.
| |
Collapse
|
23
|
Lu Z, Peng Y, Cohen T, Ghassemi M, Weng C, Tian S. Large language models in biomedicine and health: current research landscape and future directions. J Am Med Inform Assoc 2024; 31:1801-1811. [PMID: 39169867 PMCID: PMC11339542 DOI: 10.1093/jamia/ocae202] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2024] [Indexed: 08/23/2024] Open
Affiliation(s)
- Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States
| | - Yifan Peng
- Department of Population Health Sciences, Weill Cornell Medicine, New York, NY 10065, United States
| | - Trevor Cohen
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA 98195, United States
| | - Marzyeh Ghassemi
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA 02139, United States
| | - Chunhua Weng
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, United States
| | - Shubo Tian
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD 20894, United States
| |
Collapse
|
24
|
Li M, Zhou H, Yang H, Zhang R. RT: a Retrieving and Chain-of-Thought framework for few-shot medical named entity recognition. J Am Med Inform Assoc 2024; 31:1929-1938. [PMID: 38708849 PMCID: PMC11339512 DOI: 10.1093/jamia/ocae095] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2023] [Revised: 04/10/2024] [Accepted: 04/15/2024] [Indexed: 05/07/2024] Open
Abstract
OBJECTIVES This article aims to enhance the performance of larger language models (LLMs) on the few-shot biomedical named entity recognition (NER) task by developing a simple and effective method called Retrieving and Chain-of-Thought (RT) framework and to evaluate the improvement after applying RT framework. MATERIALS AND METHODS Given the remarkable advancements in retrieval-based language model and Chain-of-Thought across various natural language processing tasks, we propose a pioneering RT framework designed to amalgamate both approaches. The RT approach encompasses dedicated modules for information retrieval and Chain-of-Thought processes. In the retrieval module, RT discerns pertinent examples from demonstrations during instructional tuning for each input sentence. Subsequently, the Chain-of-Thought module employs a systematic reasoning process to identify entities. We conducted a comprehensive comparative analysis of our RT framework against 16 other models for few-shot NER tasks on BC5CDR and NCBI corpora. Additionally, we explored the impacts of negative samples, output formats, and missing data on performance. RESULTS Our proposed RT framework outperforms other LMs for few-shot NER tasks with micro-F1 scores of 93.50 and 91.76 on BC5CDR and NCBI corpora, respectively. We found that using both positive and negative samples, Chain-of-Thought (vs Tree-of-Thought) performed better. Additionally, utilization of a partially annotated dataset has a marginal effect of the model performance. DISCUSSION This is the first investigation to combine a retrieval-based LLM and Chain-of-Thought methodology to enhance the performance in biomedical few-shot NER. The retrieval-based LLM aids in retrieving the most relevant examples of the input sentence, offering crucial knowledge to predict the entity in the sentence. We also conducted a meticulous examination of our methodology, incorporating an ablation study. CONCLUSION The RT framework with LLM has demonstrated state-of-the-art performance on few-shot NER tasks.
Collapse
Affiliation(s)
- Mingchen Li
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN 55455, United States
| | - Huixue Zhou
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN 55455, United States
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, United States
| | - Han Yang
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN 55455, United States
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN 55455, United States
| | - Rui Zhang
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN 55455, United States
| |
Collapse
|
25
|
Luo L, Ning J, Zhao Y, Wang Z, Ding Z, Chen P, Fu W, Han Q, Xu G, Qiu Y, Pan D, Li J, Li H, Feng W, Tu S, Liu Y, Yang Z, Wang J, Sun Y, Lin H. Taiyi: a bilingual fine-tuned large language model for diverse biomedical tasks. J Am Med Inform Assoc 2024; 31:1865-1874. [PMID: 38422367 PMCID: PMC11339499 DOI: 10.1093/jamia/ocae037] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/12/2023] [Revised: 01/08/2024] [Accepted: 02/16/2024] [Indexed: 03/02/2024] Open
Abstract
OBJECTIVE Most existing fine-tuned biomedical large language models (LLMs) focus on enhancing performance in monolingual biomedical question answering and conversation tasks. To investigate the effectiveness of the fine-tuned LLMs on diverse biomedical natural language processing (NLP) tasks in different languages, we present Taiyi, a bilingual fine-tuned LLM for diverse biomedical NLP tasks. MATERIALS AND METHODS We first curated a comprehensive collection of 140 existing biomedical text mining datasets (102 English and 38 Chinese datasets) across over 10 task types. Subsequently, these corpora were converted to the instruction data used to fine-tune the general LLM. During the supervised fine-tuning phase, a 2-stage strategy is proposed to optimize the model performance across various tasks. RESULTS Experimental results on 13 test sets, which include named entity recognition, relation extraction, text classification, and question answering tasks, demonstrate that Taiyi achieves superior performance compared to general LLMs. The case study involving additional biomedical NLP tasks further shows Taiyi's considerable potential for bilingual biomedical multitasking. CONCLUSION Leveraging rich high-quality biomedical corpora and developing effective fine-tuning strategies can significantly improve the performance of LLMs within the biomedical domain. Taiyi shows the bilingual multitasking capability through supervised fine-tuning. However, those tasks such as information extraction that are not generation tasks in nature remain challenging for LLM-based generative approaches, and they still underperform the conventional discriminative approaches using smaller language models.
Collapse
Affiliation(s)
- Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jinzhong Ning
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yingwen Zhao
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhijun Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zeyuan Ding
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Peng Chen
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Weiru Fu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Qinyu Han
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Guangtao Xu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yunzhi Qiu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Dinghao Pan
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jiru Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Hao Li
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Wenduo Feng
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Senbo Tu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yuqi Liu
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Jian Wang
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yuanyuan Sun
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| |
Collapse
|
26
|
Nastou K, Koutrouli M, Pyysalo S, Jensen LJ. CoNECo: a Corpus for Named Entity recognition and normalization of protein Complexes. BIOINFORMATICS ADVANCES 2024; 4:vbae116. [PMID: 39411448 PMCID: PMC11474106 DOI: 10.1093/bioadv/vbae116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 05/29/2024] [Revised: 07/10/2024] [Accepted: 08/04/2024] [Indexed: 10/19/2024]
Abstract
Motivation Despite significant progress in biomedical information extraction, there is a lack of resources for Named Entity Recognition (NER) and Named Entity Normalization (NEN) of protein-containing complexes. Current resources inadequately address the recognition of protein-containing complex names across different organisms, underscoring the crucial need for a dedicated corpus. Results We introduce the Complex Named Entity Corpus (CoNECo), an annotated corpus for NER and NEN of complexes. CoNECo comprises 1621 documents with 2052 entities, 1976 of which are normalized to Gene Ontology. We divided the corpus into training, development, and test sets and trained both a transformer-based and dictionary-based tagger on them. Evaluation on the test set demonstrated robust performance, with F-scores of 73.7% and 61.2%, respectively. Subsequently, we applied the best taggers for comprehensive tagging of the entire openly accessible biomedical literature. Availability and implementation All resources, including the annotated corpus, training data, and code, are available to the community through Zenodo https://zenodo.org/records/11263147 and GitHub https://zenodo.org/records/10693653.
Collapse
Affiliation(s)
- Katerina Nastou
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Mikaela Koutrouli
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| | - Sampo Pyysalo
- TurkuNLP Group, Department of Computing, University of Turku, Turku, Finland
| | - Lars Juhl Jensen
- Novo Nordisk Foundation Center for Protein Research, University of Copenhagen, Copenhagen 2200, Denmark
| |
Collapse
|
27
|
Islamaj R, Wei CH, Lai PT, Luo L, Coss C, Gokal Kochar P, Miliaras N, Rodionov O, Sekiya K, Trinh D, Whitman D, Lu Z. The biomedical relationship corpus of the BioRED track at the BioCreative VIII challenge and workshop. Database (Oxford) 2024; 2024:baae071. [PMID: 39126204 PMCID: PMC11315767 DOI: 10.1093/database/baae071] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/19/2024] [Revised: 06/03/2024] [Accepted: 07/09/2024] [Indexed: 08/12/2024]
Abstract
The automatic recognition of biomedical relationships is an important step in the semantic understanding of the information contained in the unstructured text of the published literature. The BioRED track at BioCreative VIII aimed to foster the development of such methods by providing the participants the BioRED-BC8 corpus, a collection of 1000 PubMed documents manually curated for diseases, gene/proteins, chemicals, cell lines, gene variants, and species, as well as pairwise relationships between them which are disease-gene, chemical-gene, disease-variant, gene-gene, chemical-disease, chemical-chemical, chemical-variant, and variant-variant. Furthermore, relationships are categorized into the following semantic categories: positive correlation, negative correlation, binding, conversion, drug interaction, comparison, cotreatment, and association. Unlike most of the previous publicly available corpora, all relationships are expressed at the document level as opposed to the sentence level, and as such, the entities are normalized to the corresponding concept identifiers of the standardized vocabularies, namely, diseases and chemicals are normalized to MeSH, genes (and proteins) to National Center for Biotechnology Information (NCBI) Gene, species to NCBI Taxonomy, cell lines to Cellosaurus, and gene/protein variants to Single Nucleotide Polymorphism Database. Finally, each annotated relationship is categorized as 'novel' depending on whether it is a novel finding or experimental verification in the publication it is expressed in. This distinction helps differentiate novel findings from other relationships in the same text that provides known facts and/or background knowledge. The BioRED-BC8 corpus uses the previous BioRED corpus of 600 PubMed articles as the training dataset and includes a set of newly published 400 articles to serve as the test data for the challenge. All test articles were manually annotated for the BioCreative VIII challenge by expert biocurators at the National Library of Medicine, using the original annotation guidelines, where each article is doubly annotated in a three-round annotation process until full agreement is reached between all curators. This manuscript details the characteristics of the BioRED-BC8 corpus as a critical resource for biomedical named entity recognition and relation extraction. Using this new resource, we have demonstrated advancements in biomedical text-mining algorithm development. Database URL: https://codalab.lisn.upsaclay.fr/competitions/16381.
Collapse
Affiliation(s)
- Rezarta Islamaj
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Chih-Hsuan Wei
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Po-Ting Lai
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Ling Luo
- School of Computer Science and Technology, Dalian University of Technology, No.2 Linggong Road, Ganjingzi District, Dalian, Liaoning 116024, China
| | - Cathleen Coss
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Preeti Gokal Kochar
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Nicholas Miliaras
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Oleg Rodionov
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Keiko Sekiya
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Dorothy Trinh
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Deborah Whitman
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| | - Zhiyong Lu
- National Library of Medicine (NLM), National Institutes of Health (NIH), 8600 Rockville Pike, Bethesda, MD 20894, United States
| |
Collapse
|
28
|
Aldahdooh J, Tanoli Z, Tang J. Mining drug-target interactions from biomedical literature using chemical and gene descriptions-based ensemble transformer model. BIOINFORMATICS ADVANCES 2024; 4:vbae106. [PMID: 39092007 PMCID: PMC11293871 DOI: 10.1093/bioadv/vbae106] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/12/2024] [Revised: 06/30/2024] [Accepted: 07/17/2024] [Indexed: 08/04/2024]
Abstract
Motivation Drug-target interactions (DTIs) play a pivotal role in drug discovery, as it aims to identify potential drug targets and elucidate their mechanism of action. In recent years, the application of natural language processing (NLP), particularly when combined with pre-trained language models, has gained considerable momentum in the biomedical domain, with the potential to mine vast amounts of texts to facilitate the efficient extraction of DTIs from the literature. Results In this article, we approach the task of DTIs as an entity-relationship extraction problem, utilizing different pre-trained transformer language models, such as BERT, to extract DTIs. Our results indicate that an ensemble approach, by combining gene descriptions from the Entrez Gene database with chemical descriptions from the Comparative Toxicogenomics Database (CTD), is critical for achieving optimal performance. The proposed model achieves an F1 score of 80.6 on the hidden DrugProt test set, which is the top-ranked performance among all the submitted models in the official evaluation. Furthermore, we conduct a comparative analysis to evaluate the effectiveness of various gene textual descriptions sourced from Entrez Gene and UniProt databases to gain insights into their impact on the performance. Our findings highlight the potential of NLP-based text mining using gene and chemical descriptions to improve drug-target extraction tasks. Availability and implementation Datasets utilized in this study are accessible at https://dtis.drugtargetcommons.org/.
Collapse
Affiliation(s)
- Jehad Aldahdooh
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki 00290, Finland
- Doctoral Programme in Computer Science, University of Helsinki, Helsinki 00290, Finland
| | - Ziaurrehman Tanoli
- Institute for Molecular Medicine Finland, University of Helsinki, Helsinki 00290, Finland
- BioICAWtech, Organization, Helsinki 00290, Finland
| | - Jing Tang
- Research Program in Systems Oncology, Faculty of Medicine, University of Helsinki, Helsinki 00290, Finland
| |
Collapse
|
29
|
Taub-Tabib H, Shamay Y, Shlain M, Pinhasov M, Polak M, Tiktinsky A, Rahamimov S, Bareket D, Eyal B, Kassis M, Goldberg Y, Kaminski Rosenberg T, Vulfsons S, Ben Sasson M. Identifying symptom etiologies using syntactic patterns and large language models. Sci Rep 2024; 14:16190. [PMID: 39003296 PMCID: PMC11246441 DOI: 10.1038/s41598-024-65645-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2023] [Accepted: 06/21/2024] [Indexed: 07/15/2024] Open
Abstract
Differential diagnosis is a crucial aspect of medical practice, as it guides clinicians to accurate diagnoses and effective treatment plans. Traditional resources, such as medical books and services like UpToDate, are constrained by manual curation, potentially missing out on novel or less common findings. This paper introduces and analyzes two novel methods to mine etiologies from scientific literature. The first method employs a traditional Natural Language Processing (NLP) approach based on syntactic patterns. By using a novel application of human-guided pattern bootstrapping patterns are derived quickly, and symptom etiologies are extracted with significant coverage. The second method utilizes generative models, specifically GPT-4, coupled with a fact verification pipeline, marking a pioneering application of generative techniques in etiology extraction. Analyzing this second method shows that while it is highly precise, it offers lesser coverage compared to the syntactic approach. Importantly, combining both methodologies yields synergistic outcomes, enhancing the depth and reliability of etiology mining.
Collapse
Affiliation(s)
| | - Yosi Shamay
- Faculty of Biomedical Engineering, Technion, Haifa, Israel
| | | | | | | | | | | | | | - Ben Eyal
- Allen Institute for AI, Seattle, USA
| | | | - Yoav Goldberg
- Allen Institute for AI, Seattle, USA
- Computer Science Department, Bar Ilan University, Ramat Gan, Israel
| | | | - Simon Vulfsons
- Institute for Pain Medicine, Rambam Health Campus, Haifa, Israel
| | - Maayan Ben Sasson
- Institute for Pain Medicine, Rambam Health Campus, Haifa, Israel.
- Alan Edwards Pain Management Unit, McGill University Health Centre, Montreal, QC, Canada.
| |
Collapse
|
30
|
Wei CH, Allot A, Lai PT, Leaman R, Tian S, Luo L, Jin Q, Wang Z, Chen Q, Lu Z. PubTator 3.0: an AI-powered literature resource for unlocking biomedical knowledge. Nucleic Acids Res 2024; 52:W540-W546. [PMID: 38572754 PMCID: PMC11223843 DOI: 10.1093/nar/gkae235] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Revised: 03/02/2024] [Accepted: 03/21/2024] [Indexed: 04/05/2024] Open
Abstract
PubTator 3.0 (https://www.ncbi.nlm.nih.gov/research/pubtator3/) is a biomedical literature resource using state-of-the-art AI techniques to offer semantic and relation searches for key concepts like proteins, genetic variants, diseases and chemicals. It currently provides over one billion entity and relation annotations across approximately 36 million PubMed abstracts and 6 million full-text articles from the PMC open access subset, updated weekly. PubTator 3.0's online interface and API utilize these precomputed entity relations and synonyms to provide advanced search capabilities and enable large-scale analyses, streamlining many complex information needs. We showcase the retrieval quality of PubTator 3.0 using a series of entity pair queries, demonstrating that PubTator 3.0 retrieves a greater number of articles than either PubMed or Google Scholar, with higher precision in the top 20 results. We further show that integrating ChatGPT (GPT-4) with PubTator APIs dramatically improves the factuality and verifiability of its responses. In summary, PubTator 3.0 offers a comprehensive set of features and tools that allow researchers to navigate the ever-expanding wealth of biomedical literature, expediting research and unlocking valuable insights for scientific discovery.
Collapse
Affiliation(s)
- Chih-Hsuan Wei
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Alexis Allot
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Po-Ting Lai
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Robert Leaman
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Shubo Tian
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Ling Luo
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qiao Jin
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhizheng Wang
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Qingyu Chen
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| | - Zhiyong Lu
- National Center for Biotechnology Information (NCBI), National Library of Medicine (NLM), National Institutes of Health (NIH), Bethesda, MD 20894, USA
| |
Collapse
|
31
|
Yuan J, Zhang F, Qiu Y, Lin H, Zhang Y. Document-level biomedical relation extraction via hierarchical tree graph and relation segmentation module. Bioinformatics 2024; 40:btae418. [PMID: 38917409 PMCID: PMC11629692 DOI: 10.1093/bioinformatics/btae418] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/23/2024] [Revised: 05/27/2024] [Accepted: 06/24/2024] [Indexed: 06/27/2024] Open
Abstract
MOTIVATION Biomedical relation extraction at the document level (Bio-DocRE) involves extracting relation instances from biomedical texts that span multiple sentences, often containing various entity concepts such as genes, diseases, chemicals, variants, etc. Currently, this task is usually implemented based on graphs or transformers. However, most work directly models entity features to relation prediction, ignoring the effectiveness of entity pair information as an intermediate state for relation prediction. In this article, we decouple this task into a three-stage process to capture sufficient information for improving relation prediction. RESULTS We propose an innovative framework HTGRS for Bio-DocRE, which constructs a hierarchical tree graph (HTG) to integrate key information sources in the document, achieving relation reasoning based on entity. In addition, inspired by the idea of semantic segmentation, we conceptualize the task as a table-filling problem and develop a relation segmentation (RS) module to enhance relation reasoning based on the entity pair. Extensive experiments on three datasets show that the proposed framework outperforms the state-of-the-art methods and achieves superior performance. AVAILABILITY AND IMPLEMENTATION Our source code is available at https://github.com/passengeryjy/HTGRS.
Collapse
Affiliation(s)
- Jianyuan Yuan
- School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
| | - Fengyu Zhang
- School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
| | - Yimeng Qiu
- School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian 116024, China
| | - Yijia Zhang
- School of Information Science and Technology, Dalian Maritime University, Dalian 116026, China
| |
Collapse
|
32
|
Kosonocky CW, Wilke CO, Marcotte EM, Ellington AD. Mining patents with large language models elucidates the chemical function landscape. DIGITAL DISCOVERY 2024; 3:1150-1159. [PMID: 38873033 PMCID: PMC11167698 DOI: 10.1039/d4dd00011k] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 01/18/2024] [Accepted: 05/03/2024] [Indexed: 06/15/2024]
Abstract
The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631 K molecule-function pairs, was created using an LLM- and embedding-based method to obtain 1.5 K unique functional labels for approximately 100 K randomly selected molecules from their corresponding 188 K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate through several examples that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an alternative approach to traditional structure-based methods in the pursuit of designing novel functional molecules.
Collapse
Affiliation(s)
- Clayton W Kosonocky
- Department of Molecular Biosciences, University of Texas at Austin Austin TX 78705 USA
| | - Claus O Wilke
- Department of Integrative Biology, University of Texas at Austin Austin TX 78705 USA
| | - Edward M Marcotte
- Department of Molecular Biosciences, University of Texas at Austin Austin TX 78705 USA
- Center for Systems and Synthetic Biology, University of Texas at Austin Austin TX 78705 USA
| | - Andrew D Ellington
- Department of Molecular Biosciences, University of Texas at Austin Austin TX 78705 USA
- Center for Systems and Synthetic Biology, University of Texas at Austin Austin TX 78705 USA
| |
Collapse
|
33
|
Wang M, Vijayaraghavan A, Beck T, Posma JM. Vocabulary Matters: An Annotation Pipeline and Four Deep Learning Algorithms for Enzyme Named Entity Recognition. J Proteome Res 2024; 23:1915-1925. [PMID: 38733346 PMCID: PMC11165580 DOI: 10.1021/acs.jproteome.3c00367] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 01/30/2024] [Accepted: 04/29/2024] [Indexed: 05/13/2024]
Abstract
Enzymes are indispensable in many biological processes, and with biomedical literature growing exponentially, effective literature review becomes increasingly challenging. Natural language processing methods offer solutions to streamline this process. This study aims to develop an annotated enzyme corpus for training and evaluating enzyme named entity recognition (NER) models. A novel pipeline, combining dictionary matching and rule-based keyword searching, automatically annotated enzyme entities in >4800 full-text publications. Four deep learning NER models were created with different vocabularies (BioBERT/SciBERT) and architectures (BiLSTM/transformer) and evaluated on 526 manually annotated full-text publications. The annotation pipeline achieved an F1-score of 0.86 (precision = 1.00, recall = 0.76), surpassed by fine-tuned transformers for F1-score (BioBERT: 0.89, SciBERT: 0.88) and recall (0.86) with BiLSTM models having higher precision (0.94) than transformers (0.92). The annotation pipeline runs in seconds on standard laptops with almost perfect precision, but was outperformed by fine-tuned transformers in terms of F1-score and recall, demonstrating generalizability beyond the training data. In comparison, SciBERT-based models exhibited higher precision, and BioBERT-based models exhibited higher recall, highlighting the importance of vocabulary and architecture. These models, representing the first enzyme NER algorithms, enable more effective enzyme text mining and information extraction. Codes for automated annotation and model generation are available from https://github.com/omicsNLP/enzymeNER and https://zenodo.org/doi/10.5281/zenodo.10581586.
Collapse
Affiliation(s)
- Meiqi Wang
- Section
of Bioinformatics, Division of Systems Medicine, Department of Metabolism,
Digestion and Reproduction, Imperial College
London, London W12 0NN, U.K.
| | - Avish Vijayaraghavan
- Section
of Bioinformatics, Division of Systems Medicine, Department of Metabolism,
Digestion and Reproduction, Imperial College
London, London W12 0NN, U.K.
- UKRI
Centre for Doctoral Training in AI for Healthcare, Department of Computing, Imperial College London, London SW7 2AZ, U.K.
| | - Tim Beck
- School
of Medicine, University of Nottingham, Biodiscovery
Institute, Nottingham NG7 2RD, U.K.
- Health
Data Research (HDR) U.K., London NW1 2BE, U.K.
| | - Joram M. Posma
- Section
of Bioinformatics, Division of Systems Medicine, Department of Metabolism,
Digestion and Reproduction, Imperial College
London, London W12 0NN, U.K.
- Health
Data Research (HDR) U.K., London NW1 2BE, U.K.
| |
Collapse
|
34
|
Livne M, Miftahutdinov Z, Tutubalina E, Kuznetsov M, Polykovskiy D, Brundyn A, Jhunjhunwala A, Costa A, Aliper A, Aspuru-Guzik A, Zhavoronkov A. nach0: multimodal natural and chemical languages foundation model. Chem Sci 2024; 15:8380-8389. [PMID: 38846388 PMCID: PMC11151847 DOI: 10.1039/d4sc00966e] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2024] [Accepted: 04/26/2024] [Indexed: 06/09/2024] Open
Abstract
Large Language Models (LLMs) have substantially driven scientific progress in various domains, and many papers have demonstrated their ability to tackle complex problems with creative solutions. Our paper introduces a new foundation model, nach0, capable of solving various chemical and biological tasks: biomedical question answering, named entity recognition, molecular generation, molecular synthesis, attributes prediction, and others. nach0 is a multi-domain and multi-task encoder-decoder LLM pre-trained on unlabeled text from scientific literature, patents, and molecule strings to incorporate a range of chemical and linguistic knowledge. We employed instruction tuning, where specific task-related instructions are utilized to fine-tune nach0 for the final set of tasks. To train nach0 effectively, we leverage the NeMo framework, enabling efficient parallel optimization of both base and large model versions. Extensive experiments demonstrate that our model outperforms state-of-the-art baselines on single-domain and cross-domain tasks. Furthermore, it can generate high-quality outputs in molecular and textual formats, showcasing its effectiveness in multi-domain setups.
Collapse
Affiliation(s)
- Micha Livne
- NVIDIA 2788 San Tomas Expressway Santa Clara 95051 CA USA
| | - Zulfat Miftahutdinov
- Insilico Medicine Canada Inc. 3710-1250 René-Lévesque West Montreal Quebec Canada
| | - Elena Tutubalina
- Insilico Medicine Hong Kong Ltd. Unit 310, 3/F, Building 8W, Phase 2, Hong Kong Science Park, Pak Shek Kok New Territories Hong Kong
| | - Maksim Kuznetsov
- Insilico Medicine Canada Inc. 3710-1250 René-Lévesque West Montreal Quebec Canada
| | - Daniil Polykovskiy
- Insilico Medicine Canada Inc. 3710-1250 René-Lévesque West Montreal Quebec Canada
| | - Annika Brundyn
- NVIDIA 2788 San Tomas Expressway Santa Clara 95051 CA USA
| | | | - Anthony Costa
- NVIDIA 2788 San Tomas Expressway Santa Clara 95051 CA USA
| | - Alex Aliper
- Insilico Medicine AI Ltd. Level 6, Unit 08, Block A, IRENA HQ Building, Masdar City Abu Dhabi United Arab Emirates
| | - Alán Aspuru-Guzik
- University of Toronto Lash Miller Building 80 St. George Street Toronto Ontario Canada
| | - Alex Zhavoronkov
- Insilico Medicine Hong Kong Ltd. Unit 310, 3/F, Building 8W, Phase 2, Hong Kong Science Park, Pak Shek Kok New Territories Hong Kong
| |
Collapse
|
35
|
Singh A, Krishnamoorthy S, Ortega JE. NeighBERT: Medical Entity Linking Using Relation-Induced Dense Retrieval. JOURNAL OF HEALTHCARE INFORMATICS RESEARCH 2024; 8:353-369. [PMID: 38681752 PMCID: PMC11052986 DOI: 10.1007/s41666-023-00136-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/09/2022] [Revised: 05/08/2023] [Accepted: 07/03/2023] [Indexed: 05/01/2024]
Abstract
One of the common tasks in clinical natural language processing is medical entity linking (MEL) which involves mention detection followed by linking the mention to an entity in a knowledge base. One reason that MEL has not been solved is due to a problem that occurs in language where ambiguous texts can be resolved to several named entities. This problem is exacerbated when processing the text found in electronic health records. Recent work has shown that deep learning models based on transformers outperform previous methods on linking at higher rates of performance. We introduce NeighBERT, a custom pre-training technique which extends BERT (Devlin et al [1]) by encoding how entities are related within a knowledge graph. This technique adds relational context that has been traditionally missing in original BERT, helping resolve the ambiguity found in clinical text. In our experiments, NeighBERT improves the precision, recall, and F1-score of the state of the art by 1-3 points for named entity recognition and 10-15 points for MEL on two widely known clinical datasets. Supplementary Information The online version contains supplementary material available at 10.1007/s41666-023-00136-3.
Collapse
Affiliation(s)
- Ayush Singh
- inQbator AI, Evernorth Health Services, Saint Louis, MO USA
| | | | - John E. Ortega
- inQbator AI, Evernorth Health Services, Saint Louis, MO USA
| |
Collapse
|
36
|
Molinet B, Marro S, Cabrio E, Villata S. Explanatory argumentation in natural language for correct and incorrect medical diagnoses. J Biomed Semantics 2024; 15:8. [PMID: 38816758 PMCID: PMC11138001 DOI: 10.1186/s13326-024-00306-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2023] [Accepted: 04/12/2024] [Indexed: 06/01/2024] Open
Abstract
BACKGROUND A huge amount of research is carried out nowadays in Artificial Intelligence to propose automated ways to analyse medical data with the aim to support doctors in delivering medical diagnoses. However, a main issue of these approaches is the lack of transparency and interpretability of the achieved results, making it hard to employ such methods for educational purposes. It is therefore necessary to develop new frameworks to enhance explainability in these solutions. RESULTS In this paper, we present a novel full pipeline to generate automatically natural language explanations for medical diagnoses. The proposed solution starts from a clinical case description associated with a list of correct and incorrect diagnoses and, through the extraction of the relevant symptoms and findings, enriches the information contained in the description with verified medical knowledge from an ontology. Finally, the system returns a pattern-based explanation in natural language which elucidates why the correct (incorrect) diagnosis is the correct (incorrect) one. The main contribution of the paper is twofold: first, we propose two novel linguistic resources for the medical domain (i.e, a dataset of 314 clinical cases annotated with the medical entities from UMLS, and a database of biological boundaries for common findings), and second, a full Information Extraction pipeline to extract symptoms and findings from the clinical cases and match them with the terms in a medical ontology and to the biological boundaries. An extensive evaluation of the proposed approach shows the our method outperforms comparable approaches. CONCLUSIONS Our goal is to offer AI-assisted educational support framework to form clinical residents to formulate sound and exhaustive explanations for their diagnoses to patients.
Collapse
Affiliation(s)
- Benjamin Molinet
- Université Côte d'Azur, CNRS, Inria, I3S, Rte des Lucioles, Sophia Antipolis, 06900, Alpes-Maritimes, France.
| | - Santiago Marro
- Université Côte d'Azur, CNRS, Inria, I3S, Rte des Lucioles, Sophia Antipolis, 06900, Alpes-Maritimes, France
| | - Elena Cabrio
- Université Côte d'Azur, CNRS, Inria, I3S, Rte des Lucioles, Sophia Antipolis, 06900, Alpes-Maritimes, France
| | - Serena Villata
- Université Côte d'Azur, CNRS, Inria, I3S, Rte des Lucioles, Sophia Antipolis, 06900, Alpes-Maritimes, France
| |
Collapse
|
37
|
Rouhizadeh H, Nikishina I, Yazdani A, Bornet A, Zhang B, Ehrsam J, Gaudet-Blavignac C, Naderi N, Teodoro D. A Dataset for Evaluating Contextualized Representation of Biomedical Concepts in Language Models. Sci Data 2024; 11:455. [PMID: 38704422 PMCID: PMC11069517 DOI: 10.1038/s41597-024-03317-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2023] [Accepted: 04/25/2024] [Indexed: 05/06/2024] Open
Abstract
Due to the complexity of the biomedical domain, the ability to capture semantically meaningful representations of terms in context is a long-standing challenge. Despite important progress in the past years, no evaluation benchmark has been developed to evaluate how well language models represent biomedical concepts according to their corresponding context. Inspired by the Word-in-Context (WiC) benchmark, in which word sense disambiguation is reformulated as a binary classification task, we propose a novel dataset, BioWiC, to evaluate the ability of language models to encode biomedical terms in context. BioWiC comprises 20'156 instances, covering over 7'400 unique biomedical terms, making it the largest WiC dataset in the biomedical domain. We evaluate BioWiC both intrinsically and extrinsically and show that it could be used as a reliable benchmark for evaluating context-dependent embeddings in biomedical corpora. In addition, we conduct several experiments using a variety of discriminative and generative large language models to establish robust baselines that can serve as a foundation for future research.
Collapse
Affiliation(s)
- Hossein Rouhizadeh
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland.
| | - Irina Nikishina
- Department of Informatics, University of Hamburg, Hamburg, Germany
| | - Anthony Yazdani
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Alban Bornet
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Boya Zhang
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Julien Ehrsam
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland
- Division of Medical Information Sciences, Diagnostic Department, Geneva University Hospitals, Geneva, Switzerland
| | - Christophe Gaudet-Blavignac
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland
- Division of Medical Information Sciences, Diagnostic Department, Geneva University Hospitals, Geneva, Switzerland
| | - Nona Naderi
- Laboratoire Interdisciplinaire des Sciences du Numerique, CNRS, Paris-Saclay University, Orsay, France
| | - Douglas Teodoro
- Department of Radiology and Medical Informatics, Faculty of Medicine, University of Geneva, Geneva, Switzerland.
| |
Collapse
|
38
|
Alamro H, Gojobori T, Essack M, Gao X. BioBBC: a multi-feature model that enhances the detection of biomedical entities. Sci Rep 2024; 14:7697. [PMID: 38565624 PMCID: PMC10987643 DOI: 10.1038/s41598-024-58334-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2023] [Accepted: 03/27/2024] [Indexed: 04/04/2024] Open
Abstract
The rapid increase in biomedical publications necessitates efficient systems to automatically handle Biomedical Named Entity Recognition (BioNER) tasks in unstructured text. However, accurately detecting biomedical entities is quite challenging due to the complexity of their names and the frequent use of abbreviations. In this paper, we propose BioBBC, a deep learning (DL) model that utilizes multi-feature embeddings and is constructed based on the BERT-BiLSTM-CRF to address the BioNER task. BioBBC consists of three main layers; an embedding layer, a Long Short-Term Memory (Bi-LSTM) layer, and a Conditional Random Fields (CRF) layer. BioBBC takes sentences from the biomedical domain as input and identifies the biomedical entities mentioned within the text. The embedding layer generates enriched contextual representation vectors of the input by learning the text through four types of embeddings: part-of-speech tags (POS tags) embedding, char-level embedding, BERT embedding, and data-specific embedding. The BiLSTM layer produces additional syntactic and semantic feature representations. Finally, the CRF layer identifies the best possible tag sequence for the input sentence. Our model is well-constructed and well-optimized for detecting different types of biomedical entities. Based on experimental results, our model outperformed state-of-the-art (SOTA) models with significant improvements based on six benchmark BioNER datasets.
Collapse
Affiliation(s)
- Hind Alamro
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- College of Computing, Umm Al-Qura University, Mecca, Saudi Arabia
| | - Takashi Gojobori
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia
| | - Magbubah Essack
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
- Computational Bioscience Research Center (CBRC), King Abdullah University of Science and Technology (KAUST), Thuwal, Saudi Arabia.
| |
Collapse
|
39
|
Zhou X, Fu Q, Xia Y, Wang Y, Lu Y, Chen Y, Chen J. LoGo-GR: A Local to Global Graphical Reasoning Framework for Extracting Structured Information From Biomedical Literature. IEEE J Biomed Health Inform 2024; 28:2314-2325. [PMID: 38265897 DOI: 10.1109/jbhi.2024.3358169] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
In the biomedical literature, entities are often distributed within multiple sentences and exhibit complex interactions. As the volume of literature has increased dramatically, it has become impractical to manually extract and maintain biomedical knowledge, which would entail enormous costs. Fortunately, document-level relation extraction can capture associations between entities from complex text, helping researchers efficiently mine structured knowledge from the vast medical literature. However, how to effectively synthesize rich global information from context and accurately capture local dependencies between entities is still a great challenge. In this paper, we propose a Local to Global Graphical Reasoning framework (LoGo-GR) based on a novel Biased Graph Attention mechanism (B-GAT). It learns global context feature and information of local relation path dependencies from mention-level interaction graph and entity-level path graph respectively, and collaborates with global and local reasoning to capture complex interactions between entities from document-level text. In particular, B-GAT integrates structural dependencies into the standard graph attention mechanism (GAT) as attention biases to adaptively guide information aggregation in graphical reasoning. We evaluate our method on three publicly biomedical document-level datasets: Drug-Mutation Interaction (DV), Chemical-induced Disease (CDR), and Gene-Disease Association (GDA). LoGo-GR has advanced and stable performance compared to other state-of-the-art methods (it achieves state-of-the-art performance with 96.14%-97.39% F1 on DV dataset, advanced performance with 68.89% F1 and 84.22% F1 on CDR and GDA datasets, respectively). In addition, LoGo-GR also shows advanced performance on general-domain document-level relation extraction dataset, DocRED, which proves that it is an effective and robust document-level relation extraction framework.
Collapse
|
40
|
Keloth VK, Hu Y, Xie Q, Peng X, Wang Y, Zheng A, Selek M, Raja K, Wei CH, Jin Q, Lu Z, Chen Q, Xu H. Advancing entity recognition in biomedicine via instruction tuning of large language models. Bioinformatics 2024; 40:btae163. [PMID: 38514400 PMCID: PMC11001490 DOI: 10.1093/bioinformatics/btae163] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2023] [Revised: 02/18/2024] [Accepted: 03/19/2024] [Indexed: 03/23/2024] Open
Abstract
MOTIVATION Large Language Models (LLMs) have the potential to revolutionize the field of Natural Language Processing, excelling not only in text generation and reasoning tasks but also in their ability for zero/few-shot learning, swiftly adapting to new tasks with minimal fine-tuning. LLMs have also demonstrated great promise in biomedical and healthcare applications. However, when it comes to Named Entity Recognition (NER), particularly within the biomedical domain, LLMs fall short of the effectiveness exhibited by fine-tuned domain-specific models. One key reason is that NER is typically conceptualized as a sequence labeling task, whereas LLMs are optimized for text generation and reasoning tasks. RESULTS We developed an instruction-based learning paradigm that transforms biomedical NER from a sequence labeling task into a generation task. This paradigm is end-to-end and streamlines the training and evaluation process by automatically repurposing pre-existing biomedical NER datasets. We further developed BioNER-LLaMA using the proposed paradigm with LLaMA-7B as the foundational LLM. We conducted extensive testing on BioNER-LLaMA across three widely recognized biomedical NER datasets, consisting of entities related to diseases, chemicals, and genes. The results revealed that BioNER-LLaMA consistently achieved higher F1-scores ranging from 5% to 30% compared to the few-shot learning capabilities of GPT-4 on datasets with different biomedical entities. We show that a general-domain LLM can match the performance of rigorously fine-tuned PubMedBERT models and PMC-LLaMA, biomedical-specific language model. Our findings underscore the potential of our proposed paradigm in developing general-domain LLMs that can rival SOTA performances in multi-task, multi-domain scenarios in biomedical and health applications. AVAILABILITY AND IMPLEMENTATION Datasets and other resources are available at https://github.com/BIDS-Xu-Lab/BioNER-LLaMA.
Collapse
Affiliation(s)
- Vipina K Keloth
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| | - Yan Hu
- McWilliams School of Biomedical Informatics, University of Texas Health Science at Houston, Houston, TX-77030, United States
| | - Qianqian Xie
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| | - Xueqing Peng
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| | - Yan Wang
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| | - Andrew Zheng
- William P. Clements High School, Sugar Land, TX-77479, United States
| | - Melih Selek
- Stephen F. Austin High School, Sugar Land, TX-77498, United States
| | - Kalpana Raja
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| | - Chih Hsuan Wei
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
| | - Qiao Jin
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
| | - Zhiyong Lu
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
| | - Qingyu Chen
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
- National Center for Biotechnology Information, National Library of Medicine, National Institutes of Health, Bethesda, MD-20894, United States
| | - Hua Xu
- Section of Biomedical Informatics and Data Science, School of Medicine, Yale University, New Haven, CT-06510, United States
| |
Collapse
|
41
|
Irrera O, Marchesin S, Silvello G. MetaTron: advancing biomedical annotation empowering relation annotation and collaboration. BMC Bioinformatics 2024; 25:112. [PMID: 38486137 PMCID: PMC10941452 DOI: 10.1186/s12859-024-05730-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2023] [Accepted: 03/04/2024] [Indexed: 03/17/2024] Open
Abstract
BACKGROUND The constant growth of biomedical data is accompanied by the need for new methodologies to effectively and efficiently extract machine-readable knowledge for training and testing purposes. A crucial aspect in this regard is creating large, often manually or semi-manually, annotated corpora vital for developing effective and efficient methods for tasks like relation extraction, topic recognition, and entity linking. However, manual annotation is expensive and time-consuming especially if not assisted by interactive, intuitive, and collaborative computer-aided tools. To support healthcare experts in the annotation process and foster annotated corpora creation, we present MetaTron. MetaTron is an open-source and free-to-use web-based annotation tool to annotate biomedical data interactively and collaboratively; it supports both mention-level and document-level annotations also integrating automatic built-in predictions. Moreover, MetaTron enables relation annotation with the support of ontologies, functionalities often overlooked by off-the-shelf annotation tools. RESULTS We conducted a qualitative analysis to compare MetaTron with a set of manual annotation tools including TeamTat, INCEpTION, LightTag, MedTAG, and brat, on three sets of criteria: technical, data, and functional. A quantitative evaluation allowed us to assess MetaTron performances in terms of time and number of clicks to annotate a set of documents. The results indicated that MetaTron fulfills almost all the selected criteria and achieves the best performances. CONCLUSIONS MetaTron stands out as one of the few annotation tools targeting the biomedical domain supporting the annotation of relations, and fully customizable with documents in several formats-PDF included, as well as abstracts retrieved from PubMed, Semantic Scholar, and OpenAIRE. To meet any user need, we released MetaTron both as an online instance and as a Docker image locally deployable.
Collapse
Affiliation(s)
- Ornella Irrera
- Department of Information Engineering, University of Padova, Padua, Italy.
| | - Stefano Marchesin
- Department of Information Engineering, University of Padova, Padua, Italy
| | - Gianmaria Silvello
- Department of Information Engineering, University of Padova, Padua, Italy
| |
Collapse
|
42
|
Caufield JH, Hegde H, Emonet V, Harris NL, Joachimiak MP, Matentzoglu N, Kim H, Moxon S, Reese JT, Haendel MA, Robinson PN, Mungall CJ. Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES): a method for populating knowledge bases using zero-shot learning. Bioinformatics 2024; 40:btae104. [PMID: 38383067 PMCID: PMC10924283 DOI: 10.1093/bioinformatics/btae104] [Citation(s) in RCA: 16] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Revised: 12/16/2023] [Accepted: 02/20/2024] [Indexed: 02/23/2024] Open
Abstract
MOTIVATION Creating knowledge bases and ontologies is a time consuming task that relies on manual curation. AI/NLP approaches can assist expert curators in populating these knowledge bases, but current approaches rely on extensive training data, and are not able to populate arbitrarily complex nested knowledge schemas. RESULTS Here we present Structured Prompt Interrogation and Recursive Extraction of Semantics (SPIRES), a Knowledge Extraction approach that relies on the ability of Large Language Models (LLMs) to perform zero-shot learning and general-purpose query answering from flexible prompts and return information conforming to a specified schema. Given a detailed, user-defined knowledge schema and an input text, SPIRES recursively performs prompt interrogation against an LLM to obtain a set of responses matching the provided schema. SPIRES uses existing ontologies and vocabularies to provide identifiers for matched elements. We present examples of applying SPIRES in different domains, including extraction of food recipes, multi-species cellular signaling pathways, disease treatments, multi-step drug mechanisms, and chemical to disease relationships. Current SPIRES accuracy is comparable to the mid-range of existing Relation Extraction methods, but greatly surpasses an LLM's native capability of grounding entities with unique identifiers. SPIRES has the advantage of easy customization, flexibility, and, crucially, the ability to perform new tasks in the absence of any new training data. This method supports a general strategy of leveraging the language interpreting capabilities of LLMs to assemble knowledge bases, assisting manual knowledge curation and acquisition while supporting validation with publicly-available databases and ontologies external to the LLM. AVAILABILITY AND IMPLEMENTATION SPIRES is available as part of the open source OntoGPT package: https://github.com/monarch-initiative/ontogpt.
Collapse
Affiliation(s)
- J Harry Caufield
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Harshad Hegde
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Vincent Emonet
- Institute of Data Science, Faculty of Science and Engineering, Maastricht University, 6200 MD Maastricht, The Netherlands
| | - Nomi L Harris
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Marcin P Joachimiak
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | | | - HyeongSik Kim
- Robert Bosch LLC, Sunnyvale, CA 94085, United States
| | - Sierra Moxon
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Justin T Reese
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| | - Melissa A Haendel
- Department of Biomedical Informatics, University of Colorado, Anschutz Medical Campus, Aurora, CO 80217, United States
| | | | - Christopher J Mungall
- Biosystems Data Science, Division of Environmental Genomics and Systems Biology, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, United States
| |
Collapse
|
43
|
Dagdelen J, Dunn A, Lee S, Walker N, Rosen AS, Ceder G, Persson KA, Jain A. Structured information extraction from scientific text with large language models. Nat Commun 2024; 15:1418. [PMID: 38360817 PMCID: PMC10869356 DOI: 10.1038/s41467-024-45563-x] [Citation(s) in RCA: 55] [Impact Index Per Article: 55.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2023] [Accepted: 01/22/2024] [Indexed: 02/17/2024] Open
Abstract
Extracting structured knowledge from scientific text remains a challenging task for machine learning models. Here, we present a simple approach to joint named entity recognition and relation extraction and demonstrate how pretrained large language models (GPT-3, Llama-2) can be fine-tuned to extract useful records of complex scientific knowledge. We test three representative tasks in materials chemistry: linking dopants and host materials, cataloging metal-organic frameworks, and general composition/phase/morphology/application information extraction. Records are extracted from single sentences or entire paragraphs, and the output can be returned as simple English sentences or a more structured format such as a list of JSON objects. This approach represents a simple, accessible, and highly flexible route to obtaining large databases of structured specialized scientific knowledge extracted from research papers.
Collapse
Affiliation(s)
- John Dagdelen
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Materials Science and Engineering Department, University of California, Berkeley, CA, USA
| | - Alexander Dunn
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Materials Science and Engineering Department, University of California, Berkeley, CA, USA
| | - Sanghoon Lee
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Materials Science and Engineering Department, University of California, Berkeley, CA, USA
| | | | - Andrew S Rosen
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Materials Science and Engineering Department, University of California, Berkeley, CA, USA
| | - Gerbrand Ceder
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Materials Science and Engineering Department, University of California, Berkeley, CA, USA
| | - Kristin A Persson
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA
- Materials Science and Engineering Department, University of California, Berkeley, CA, USA
| | - Anubhav Jain
- Lawrence Berkeley National Laboratory, Berkeley, CA, USA.
| |
Collapse
|
44
|
Azam M, Chen Y, Arowolo MO, Liu H, Popescu M, Xu D. A Comprehensive Evaluation of Large Language Models in Mining Gene Interactions and Pathway Knowledge. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.01.21.576542. [PMID: 38328046 PMCID: PMC10849485 DOI: 10.1101/2024.01.21.576542] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/09/2024]
Abstract
Background Understanding complex biological pathways, including gene-gene interactions and gene regulatory networks, is critical for exploring disease mechanisms and drug development. Manual literature curation of biological pathways is useful but cannot keep up with the exponential growth of the literature. Large-scale language models (LLMs), notable for their vast parameter sizes and comprehensive training on extensive text corpora, have great potential in automated text mining of biological pathways. Method This study assesses the effectiveness of 21 LLMs, including both API-based models and open-source models. The evaluation focused on two key aspects: gene regulatory relations (specifically, 'activation', 'inhibition', and 'phosphorylation') and KEGG pathway component recognition. The performance of these models was analyzed using statistical metrics such as precision, recall, F1 scores, and the Jaccard similarity index. Results Our results indicated a significant disparity in model performance. Among the API-based models, ChatGPT-4 and Claude-Pro showed superior performance, with an F1 score of 0.4448 and 0.4386 for the gene regulatory relation prediction, and a Jaccard similarity index of 0.2778 and 0.2657 for the KEGG pathway prediction, respectively. Open-source models lagged their API-based counterparts, where Falcon-180b-chat and llama1-7b led with the highest performance in gene regulatory relations (F1 of 0.2787 and 0.1923, respectively) and KEGG pathway recognition (Jaccard similarity index of 0.2237 and 0. 2207, respectively). Conclusion LLMs are valuable in biomedical research, especially in gene network analysis and pathway mapping. However, their effectiveness varies, necessitating careful model selection. This work also provided a case study and insight into using LLMs as knowledge graphs.
Collapse
Affiliation(s)
- Muhammad Azam
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
| | - Yibo Chen
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
- Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri, USA
| | - Micheal Olaolu Arowolo
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
| | - Haowang Liu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
| | - Mihail Popescu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
- Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri, USA
- Department of Biomedical Informatics, Biostatistics and Medical Epidemiology, University of Missouri, Columbia, Missouri, USA
| | - Dong Xu
- Department of Electrical Engineering and Computer Science, University of Missouri, Columbia, Missouri, USA
- Bond Life Sciences Center, University of Missouri, Columbia, Missouri, USA
- Institute for Data Science and Informatics, University of Missouri, Columbia, Missouri, USA
| |
Collapse
|
45
|
Shao L, Chen B, Zhang Z, Zhang Z, Chen X. Artificial intelligence generated content (AIGC) in medicine: A narrative review. MATHEMATICAL BIOSCIENCES AND ENGINEERING : MBE 2024; 21:1672-1711. [PMID: 38303483 DOI: 10.3934/mbe.2024073] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/03/2024]
Abstract
Recently, artificial intelligence generated content (AIGC) has been receiving increased attention and is growing exponentially. AIGC is generated based on the intentional information extracted from human-provided instructions by generative artificial intelligence (AI) models. AIGC quickly and automatically generates large amounts of high-quality content. Currently, there is a shortage of medical resources and complex medical procedures in medicine. Due to its characteristics, AIGC can help alleviate these problems. As a result, the application of AIGC in medicine has gained increased attention in recent years. Therefore, this paper provides a comprehensive review on the recent state of studies involving AIGC in medicine. First, we present an overview of AIGC. Furthermore, based on recent studies, the application of AIGC in medicine is reviewed from two aspects: medical image processing and medical text generation. The basic generative AI models, tasks, target organs, datasets and contribution of studies are considered and summarized. Finally, we also discuss the limitations and challenges faced by AIGC and propose possible solutions with relevant studies. We hope this review can help readers understand the potential of AIGC in medicine and obtain some innovative ideas in this field.
Collapse
Affiliation(s)
- Liangjing Shao
- Academy for Engineering & Technology, Fudan University, Shanghai 200433, China
- Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Fudan University, Shanghai 200032, China
| | - Benshuang Chen
- Academy for Engineering & Technology, Fudan University, Shanghai 200433, China
- Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Fudan University, Shanghai 200032, China
| | - Ziqun Zhang
- Information office, Fudan University, Shanghai 200032, China
| | - Zhen Zhang
- Baoshan Branch of Ren Ji Hospital, Shanghai Jiao Tong University School of Medicine, Shanghai 200444, China
| | - Xinrong Chen
- Academy for Engineering & Technology, Fudan University, Shanghai 200433, China
- Shanghai Key Laboratory of Medical Image Computing and Computer Assisted Intervention, Fudan University, Shanghai 200032, China
| |
Collapse
|
46
|
Le ND, Nguyen NTH. A metric learning-based method for biomedical entity linking. Front Res Metr Anal 2023; 8:1247094. [PMID: 38173988 PMCID: PMC10762861 DOI: 10.3389/frma.2023.1247094] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2023] [Accepted: 11/29/2023] [Indexed: 01/05/2024] Open
Abstract
Biomedical entity linking task is the task of mapping mention(s) that occur in a particular textual context to a unique concept or entity in a knowledge base, e.g., the Unified Medical Language System (UMLS). One of the most challenging aspects of the entity linking task is the ambiguity of mentions, i.e., (1) mentions whose surface forms are very similar, but which map to different entities in different contexts, and (2) entities that can be expressed using diverse types of mentions. Recent studies have used BERT-based encoders to encode mentions and entities into distinguishable representations such that their similarity can be measured using distance metrics. However, most real-world biomedical datasets suffer from severe imbalance, i.e., some classes have many instances while others appear only once or are completely absent from the training data. A common way to address this issue is to down-sample the dataset, i.e., to reduce the number instances of the majority classes to make the dataset more balanced. In the context of entity linking, down-sampling reduces the ability of the model to comprehensively learn the representations of mentions in different contexts, which is very important. To tackle this issue, we propose a metric-based learning method that treats a given entity and its mentions as a whole, regardless of the number of mentions in the training set. Specifically, our method uses a triplet loss-based function in conjunction with a clustering technique to learn the representation of mentions and entities. Through evaluations on two challenging biomedical datasets, i.e., MedMentions and BC5CDR, we show that our proposed method is able to address the issue of imbalanced data and to perform competitively with other state-of-the-art models. Moreover, our method significantly reduces computational cost in both training and inference steps. Our source code is publicly available here.
Collapse
Affiliation(s)
- Ngoc D. Le
- Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam
- Vietnam National University, Ho Chi Minh City, Vietnam
| | - Nhung T. H. Nguyen
- Department of Computer Science, School of Engineering, University of Manchester, Manchester, United Kingdom
| |
Collapse
|
47
|
Kosonocky CW, Wilke CO, Marcotte EM, Ellington AD. Mining Patents with Large Language Models Elucidates the Chemical Function Landscape. ARXIV 2023:arXiv:2309.08765v2. [PMID: 38196747 PMCID: PMC10775343] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/11/2024]
Abstract
The fundamental goal of small molecule discovery is to generate chemicals with target functionality. While this often proceeds through structure-based methods, we set out to investigate the practicality of orthogonal methods that leverage the extensive corpus of chemical literature. We hypothesize that a sufficiently large text-derived chemical function dataset would mirror the actual landscape of chemical functionality. Such a landscape would implicitly capture complex physical and biological interactions given that chemical function arises from both a molecule's structure and its interacting partners. To evaluate this hypothesis, we built a Chemical Function (CheF) dataset of patent-derived functional labels. This dataset, comprising 631K molecule-function pairs, was created using an LLM- and embedding-based method to obtain functional labels for approximately 100K molecules from their corresponding 188K unique patents. We carry out a series of analyses demonstrating that the CheF dataset contains a semantically coherent textual representation of the functional landscape congruent with chemical structural relationships, thus approximating the actual chemical function landscape. We then demonstrate that this text-based functional landscape can be leveraged to identify drugs with target functionality using a model able to predict functional profiles from structure alone. We believe that functional label-guided molecular discovery may serve as an orthogonal approach to traditional structure-based methods in the pursuit of designing novel functional molecules.
Collapse
|
48
|
Nachtegael C, De Stefani J, Lenaerts T. A study of deep active learning methods to reduce labelling efforts in biomedical relation extraction. PLoS One 2023; 18:e0292356. [PMID: 38100453 PMCID: PMC10723703 DOI: 10.1371/journal.pone.0292356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2023] [Accepted: 09/19/2023] [Indexed: 12/17/2023] Open
Abstract
Automatic biomedical relation extraction (bioRE) is an essential task in biomedical research in order to generate high-quality labelled data that can be used for the development of innovative predictive methods. However, building such fully labelled, high quality bioRE data sets of adequate size for the training of state-of-the-art relation extraction models is hindered by an annotation bottleneck due to limitations on time and expertise of researchers and curators. We show here how Active Learning (AL) plays an important role in resolving this issue and positively improve bioRE tasks, effectively overcoming the labelling limits inherent to a data set. Six different AL strategies are benchmarked on seven bioRE data sets, using PubMedBERT as the base model, evaluating their area under the learning curve (AULC) as well as intermediate results measurements. The results demonstrate that uncertainty-based strategies, such as Least-Confident or Margin Sampling, are statistically performing better in terms of F1-score, accuracy and precision, than other types of AL strategies. However, in terms of recall, a diversity-based strategy, called Core-set, outperforms all strategies. AL strategies are shown to reduce the annotation need (in order to reach a performance at par with training on all data), from 6% to 38%, depending on the data set; with Margin Sampling and Least-Confident Sampling strategies moreover obtaining the best AULCs compared to the Random Sampling baseline. We show through the experiments the importance of using AL methods to reduce the amount of labelling needed to construct high-quality data sets leading to optimal performance of deep learning models. The code and data sets to reproduce all the results presented in the article are available at https://github.com/oligogenic/Deep_active_learning_bioRE.
Collapse
Affiliation(s)
- Charlotte Nachtegael
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
| | - Jacopo De Stefani
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Technology, Policy and Management Faculty, Technische Universiteit Delft, Delft, Netherlands
| | - Tom Lenaerts
- Interuniversity Institute of Bioinformatics in Brussels, Université Libre de Bruxelles-Vrije Universiteit Brussel, Bruxelles, Belgium
- Machine Learning Group, Université Libre de Bruxelles, Bruxelles, Belgium
- Artificial Intelligence Laboratory, Vrije Universiteit Brussel, Bruxelles, Belgium
| |
Collapse
|
49
|
Kartchner D, Deng J, Lohiya S, Kopparthi T, Bathala P, Domingo-Fernández D, Mitchell CS. A Comprehensive Evaluation of Biomedical Entity Linking Models. PROCEEDINGS OF THE CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING. CONFERENCE ON EMPIRICAL METHODS IN NATURAL LANGUAGE PROCESSING 2023; 2023:14462-14478. [PMID: 38756862 PMCID: PMC11097978 DOI: 10.18653/v1/2023.emnlp-main.893] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/18/2024]
Abstract
Biomedical entity linking (BioEL) is the process of connecting entities referenced in documents to entries in biomedical databases such as the Unified Medical Language System (UMLS) or Medical Subject Headings (MeSH). The study objective was to comprehensively evaluate nine recent state-of-the-art biomedical entity linking models under a unified framework. We compare these models along axes of (1) accuracy, (2) speed, (3) ease of use, (4) generalization, and (5) adaptability to new ontologies and datasets. We additionally quantify the impact of various preprocessing choices such as abbreviation detection. Systematic evaluation reveals several notable gaps in current methods. In particular, current methods struggle to correctly link genes and proteins and often have difficulty effectively incorporating context into linking decisions. To expedite future development and baseline testing, we release our unified evaluation framework and all included models on GitHub at https://github.com/davidkartchner/biomedical-entity-linking.
Collapse
|
50
|
Peng C, Yang X, Chen A, Smith KE, PourNejatian N, Costa AB, Martin C, Flores MG, Zhang Y, Magoc T, Lipori G, Mitchell DA, Ospina NS, Ahmed MM, Hogan WR, Shenkman EA, Guo Y, Bian J, Wu Y. A study of generative large language model for medical research and healthcare. NPJ Digit Med 2023; 6:210. [PMID: 37973919 PMCID: PMC10654385 DOI: 10.1038/s41746-023-00958-w] [Citation(s) in RCA: 86] [Impact Index Per Article: 43.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2023] [Accepted: 11/01/2023] [Indexed: 11/19/2023] Open
Abstract
There are enormous enthusiasm and concerns in applying large language models (LLMs) to healthcare. Yet current assumptions are based on general-purpose LLMs such as ChatGPT, which are not developed for medical use. This study develops a generative clinical LLM, GatorTronGPT, using 277 billion words of text including (1) 82 billion words of clinical text from 126 clinical departments and approximately 2 million patients at the University of Florida Health and (2) 195 billion words of diverse general English text. We train GatorTronGPT using a GPT-3 architecture with up to 20 billion parameters and evaluate its utility for biomedical natural language processing (NLP) and healthcare text generation. GatorTronGPT improves biomedical natural language processing. We apply GatorTronGPT to generate 20 billion words of synthetic text. Synthetic NLP models trained using synthetic text generated by GatorTronGPT outperform models trained using real-world clinical text. Physicians' Turing test using 1 (worst) to 9 (best) scale shows that there are no significant differences in linguistic readability (p = 0.22; 6.57 of GatorTronGPT compared with 6.93 of human) and clinical relevance (p = 0.91; 7.0 of GatorTronGPT compared with 6.97 of human) and that physicians cannot differentiate them (p < 0.001). This study provides insights into the opportunities and challenges of LLMs for medical research and healthcare.
Collapse
Affiliation(s)
- Cheng Peng
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Xi Yang
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Aokun Chen
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | | | | | | | | | | | - Ying Zhang
- Research Computing, University of Florida, Gainesville, FL, USA
| | - Tanja Magoc
- Integrated Data Repository Research Services, University of Florida, Gainesville, FL, USA
| | - Gloria Lipori
- Integrated Data Repository Research Services, University of Florida, Gainesville, FL, USA
- Lillian S. Wells Department of Neurosurgery, Clinical and Translational Science Institute, University of Florida, Gainesville, FL, USA
| | - Duane A Mitchell
- Lillian S. Wells Department of Neurosurgery, Clinical and Translational Science Institute, University of Florida, Gainesville, FL, USA
| | - Naykky S Ospina
- Division of Endocrinology, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Mustafa M Ahmed
- Division of Cardiovascular Medicine, Department of Medicine, College of Medicine, University of Florida, Gainesville, FL, USA
| | - William R Hogan
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Elizabeth A Shenkman
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
| | - Yi Guo
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Jiang Bian
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA
| | - Yonghui Wu
- Department of Health Outcomes and Biomedical Informatics, College of Medicine, University of Florida, Gainesville, FL, USA.
- Cancer Informatics Shared Resource, University of Florida Health Cancer Center, Gainesville, FL, USA.
| |
Collapse
|