1
|
Crema C, Verde F, Tiraboschi P, Marra C, Arighi A, Fostinelli S, Giuffre GM, Maschio VPD, L'Abbate F, Solca F, Poletti B, Silani V, Rotondo E, Borracci V, Vimercati R, Crepaldi V, Inguscio E, Filippi M, Caso F, Rosati AM, Quaranta D, Binetti G, Pagnoni I, Morreale M, Burgio F, Maserati MS, Capellari S, Pardini M, Girtler N, Piras F, Piras F, Lalli S, Perdixi E, Lombardi G, Tella SD, Costa A, Capelli M, Fundaro C, Manera M, Muscio C, Pellencin E, Lodi R, Tagliavini F, Redolfi A. Medical Information Extraction With NLP-Powered QABots: A Real-World Scenario. IEEE J Biomed Health Inform 2024; 28:6906-6917. [PMID: 39190519 DOI: 10.1109/jbhi.2024.3450118] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/29/2024]
Abstract
The advent of computerized medical recording systems in healthcare facilities has made data retrieval tasks easier, compared to manual recording. Nevertheless, the potential of the information contained within medical records remains largely untapped, mostly due to the time and effort required to extract data from unstructured documents. Natural Language Processing (NLP) represents a promising solution to this challenge, as it enables the use of automated text-mining tools for clinical practitioners. In this work, we present the architecture of the Virtual Dementia Institute (IVD), a consortium of sixteen Italian hospitals, using the NLP Extraction and Management Tool (NEMT), a (semi-) automated end-to-end pipeline that extracts relevant information from clinical documents and stores it in a centralized REDCap database. After defining a common Case Report Form (CRF) across the IVD hospitals, we implemented NEMT, the core of which is a Question Answering Bot (QABot) based on a modern NLP model. This QABot is fine-tuned on thousands of examples from IVD centers. Detailed descriptions of the process to define a common minimum dataset, Inter-Annotator Agreement calculated on clinical documents, and NEMT results are provided. The best QABot performance show an Exact Match score (EM) of 78.1%, a F1-score of 84.7%, a Lenient Accuracy (LAcc) of 0.834, and a Mean Reciprocal Rank (MRR) of 0.810. EM and F1 scores outperform the same metrics obtained with ChatGPTv3.5 (68.9% and 52.5%, respectively). With NEMT the IVD has been able to populate a database that will contain data from thousands of Italian patients, all screened with the same procedure. NEMT represents an efficient tool that paves the way for medical information extraction and exploitation for new research studies.
Collapse
|
2
|
Murugan M, Yuan B, Venner E, Ballantyne CM, Robinson KM, Coons JC, Wang L, Empey PE, Gibbs RA. Empowering personalized pharmacogenomics with generative AI solutions. J Am Med Inform Assoc 2024; 31:1356-1366. [PMID: 38447590 PMCID: PMC11105140 DOI: 10.1093/jamia/ocae039] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/29/2023] [Revised: 02/06/2024] [Accepted: 02/19/2024] [Indexed: 03/08/2024] Open
Abstract
OBJECTIVE This study evaluates an AI assistant developed using OpenAI's GPT-4 for interpreting pharmacogenomic (PGx) testing results, aiming to improve decision-making and knowledge sharing in clinical genetics and to enhance patient care with equitable access. MATERIALS AND METHODS The AI assistant employs retrieval-augmented generation (RAG), which combines retrieval and generative techniques, by harnessing a knowledge base (KB) that comprises data from the Clinical Pharmacogenetics Implementation Consortium (CPIC). It uses context-aware GPT-4 to generate tailored responses to user queries from this KB, further refined through prompt engineering and guardrails. RESULTS Evaluated against a specialized PGx question catalog, the AI assistant showed high efficacy in addressing user queries. Compared with OpenAI's ChatGPT 3.5, it demonstrated better performance, especially in provider-specific queries requiring specialized data and citations. Key areas for improvement include enhancing accuracy, relevancy, and representative language in responses. DISCUSSION The integration of context-aware GPT-4 with RAG significantly enhanced the AI assistant's utility. RAG's ability to incorporate domain-specific CPIC data, including recent literature, proved beneficial. Challenges persist, such as the need for specialized genetic/PGx models to improve accuracy and relevancy and addressing ethical, regulatory, and safety concerns. CONCLUSION This study underscores generative AI's potential for transforming healthcare provider support and patient accessibility to complex pharmacogenomic information. While careful implementation of large language models like GPT-4 is necessary, it is clear that they can substantially improve understanding of pharmacogenomic data. With further development, these tools could augment healthcare expertise, provider productivity, and the delivery of equitable, patient-centered healthcare services.
Collapse
Affiliation(s)
- Mullai Murugan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
| | - Bo Yuan
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States
| | - Eric Venner
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States
| | - Christie M Ballantyne
- Sections of Cardiology and Cardiovascular Research, Department of Medicine, Baylor College of Medicine, Houston, TX, United States
| | | | - James C Coons
- School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, United States
- Department of Pharmacy, UPMC Presbyterian-Shadyside Hospital, Pittsburgh, PA, United States
| | - Liwen Wang
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
| | - Philip E Empey
- School of Pharmacy, University of Pittsburgh, Pittsburgh, PA, United States
- Institute for Precision Medicine, UPMC/University of Pittsburgh, Pittsburgh, PA, United States
| | - Richard A Gibbs
- Human Genome Sequencing Center, Baylor College of Medicine, Houston, TX, United States
- Department of Molecular and Human Genetics, Baylor College of Medicine, Houston, TX, United States
| |
Collapse
|
3
|
Zhang Z, Liang X, Zuo Y, Lin C. Improving unsupervised keyphrase extraction by modeling hierarchical multi-granularity features. Inf Process Manag 2023. [DOI: 10.1016/j.ipm.2023.103356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/05/2023]
|
4
|
Bai J, Yin C, Wu Z, Zhang J, Wang Y, Jia G, Rong W, Xiong Z. Improving Biomedical ReQA With Consistent NLI-Transfer and Post-Whitening. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:1864-1875. [PMID: 36331640 DOI: 10.1109/tcbb.2022.3219375] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
Retrieval Question Answering (ReQA) is an essential mechanism of information sharing which aims to find the answer to a posed question from large-scale candidates. Currently, the most efficient solution is Dual-Encoder which has shown great potential in the general domain, while it still lacks research on biomedical ReQA. Obtaining a robust Dual-Encoder from biomedical datasets is challenging, as scarce annotated data are not enough to sufficiently train the model which results in over-fitting problems. In this work, we first build ReQA BioASQ datasets for retrieving answers to biomedical questions, which can facilitate the corresponding research. On that basis, we propose a framework to solve the over-fitting issue for robust biomedical answer retrieval. Under the proposed framework, we first pre-train Dual-Encoder on natural language inference (NLI) task before the training on biomedical ReQA, where we appropriately change the pre-training objective of NLI to improve the consistency between NLI and biomedical ReQA, which significantly improve the transferability. Moreover, to eliminate the feature redundancies of Dual-Encoder, consistent post-whitening is proposed to conduct decorrelation on the training and trained sentence embeddings. With extensive experiments, the proposed framework achieves promising results and exhibits significant improvement compared with various competitive methods.
Collapse
|
5
|
Yang Y, Lin H, Yang Z, Zhang Y, Zhao D, Huai S. ADPG: Biomedical entity recognition based on Automatic Dependency Parsing Graph. J Biomed Inform 2023; 140:104317. [PMID: 36804374 DOI: 10.1016/j.jbi.2023.104317] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2022] [Revised: 01/19/2023] [Accepted: 02/08/2023] [Indexed: 02/19/2023]
Abstract
Named entity recognition is a key task in text mining. In the biomedical field, entity recognition focuses on extracting key information from large-scale biomedical texts for the downstream information extraction task. Biomedical literature contains a large amount of long-dependent text, and previous studies use external syntactic parsing tools to capture word dependencies in sentences to achieve nested biomedical entity recognition. However, the addition of external parsing tools often introduces unnecessary noise to the current auxiliary task and cannot improve the performance of entity recognition in an end-to-end way. Therefore, we propose a novel automatic dependency parsing approach, namely the ADPG model, to fuse syntactic structure information in an end-to-end way to recognize biomedical entities. Specifically, the method is based on a multilayer Tree-Transformer structure to automatically extract the semantic representation and syntactic structure in long-dependent sentences, and then combines a multilayer graph attention neural network (GAT) to extract the dependency paths between words in the syntactic structure to improve the performance of biomedical entity recognition. We evaluated our ADPG model on three biomedical domain and one news domain datasets, and the experimental results demonstrate that our model achieves state-of-the-art results on these four datasets with certain generalization performance. Our model is released on GitHub: https://github.com/Yumeng-Y/ADPG.
Collapse
Affiliation(s)
- Yumeng Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China.
| | - Hongfei Lin
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China.
| | - Zhihao Yang
- School of Computer Science and Technology, Dalian University of Technology, Dalian, China.
| | - Yijia Zhang
- School of Information Science and Technology, Dalian Maritime University, Dalian, China.
| | - Di Zhao
- School of Computer Science and Engineering, Dalian Minzu University, Dalian, China.
| | - Shuaiheng Huai
- School of Information Science and Technology, Dalian Maritime University, Dalian, China.
| |
Collapse
|
6
|
Chai Z, Jin H, Shi S, Zhan S, Zhuo L, Yang Y, Lian Q. Noise Reduction Learning Based on XLNet-CRF for Biomedical Named Entity Recognition. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:595-605. [PMID: 35259113 DOI: 10.1109/tcbb.2022.3157630] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
In recent years, Biomedical Named Entity Recognition (BioNER) systems have mainly been based on deep neural networks, which are used to extract information from the rapidly expanding biomedical literature. Long-distance context autoencoding language models based on transformers have recently been employed for BioNER with great success. However, noise interference exists in the process of pre-training and fine-tuning, and there is no effective decoder for label dependency. Current models have many aspects in need of improvement for better performance. We propose two kinds of noise reduction models, Shared Labels and Dynamic Splicing, based on XLNet encoding which is a permutation language pre-training model and decoding by Conditional Random Field (CRF). By testing 15 biomedical named entity recognition datasets, the two models improved the average F1-score by 1.504 and 1.48, respectively, and state-of-the-art performance was achieved on 7 of them. Further analysis proves the effectiveness of the two models and the improvement of the recognition effect of CRF, and suggests the applicable scope of the models according to different data characteristics.
Collapse
|
7
|
Bai J, Yin C, Zhang J, Wang Y, Dong Y, Rong W, Xiong Z. Adversarial Knowledge Distillation Based Biomedical Factoid Question Answering. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2023; 20:106-118. [PMID: 35316189 DOI: 10.1109/tcbb.2022.3161032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/04/2023]
Abstract
Biomedical factoid question answering is an essential application for biomedical information sharing. Recently, neural network based approaches have shown remarkable performance for this task. However, due to the scarcity of annotated data which requires intensive knowledge of expertise, training a robust model on limited-scale biomedical datasets remains a challenge. Previous works solve this problem by introducing useful knowledge. It is found that the interaction between question and answer (QA-interaction) is also a kind of knowledge which could help extract answer accurately. This research develops a knowledge distillation framework for biomedical factoid question answering, in which a teacher model as the knowledge source of QA-interaction is designed to enhance the student model. In addition, to further alleviate the problem of limited-scale dataset, a novel adversarial knowledge distillation technique is proposed to robustly distill the knowledge from teacher model to student model by constructing perturbed examples as additional training data. By forcing the student model to mimic the predicted distributions of teacher model on both original examples and perturbed examples, the knowledge of QA-interaction can be learned by student model. We evaluate the proposed framework on the widely used BioASQ datasets, and experimental results have shown the proposed method's promising potential.
Collapse
|
8
|
Rashid J, Kim J, Hussain A, Naseem U, Juneja S. A novel multiple kernel fuzzy topic modeling technique for biomedical data. BMC Bioinformatics 2022; 23:275. [PMID: 35820793 PMCID: PMC9277941 DOI: 10.1186/s12859-022-04780-1] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2022] [Accepted: 06/08/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Text mining in the biomedical field has received much attention and regarded as the important research area since a lot of biomedical data is in text format. Topic modeling is one of the popular methods among text mining techniques used to discover hidden semantic structures, so called topics. However, discovering topics from biomedical data is a challenging task due to the sparsity, redundancy, and unstructured format. METHODS In this paper, we proposed a novel multiple kernel fuzzy topic modeling (MKFTM) technique using fusion probabilistic inverse document frequency and multiple kernel fuzzy c-means clustering algorithm for biomedical text mining. In detail, the proposed fusion probabilistic inverse document frequency method is used to estimate the weights of global terms while MKFTM generates frequencies of local and global terms with bag-of-words. In addition, the principal component analysis is applied to eliminate higher-order negative effects for term weights. RESULTS Extensive experiments are conducted on six biomedical datasets. MKFTM achieved the highest classification accuracy 99.04%, 99.62%, 99.69%, 99.61% in the Muchmore Springer dataset and 94.10%, 89.45%, 92.91%, 90.35% in the Ohsumed dataset. The CH index value of MKFTM is higher, which shows that its clustering performance is better than state-of-the-art topic models. CONCLUSION We have confirmed from results that proposed MKFTM approach is very efficient to handles to sparsity and redundancy problem in biomedical text documents. MKFTM discovers semantically relevant topics with high accuracy for biomedical documents. Its gives better results for classification and clustering in biomedical documents. MKFTM is a new approach to topic modeling, which has the flexibility to work with a variety of clustering methods.
Collapse
Affiliation(s)
- Junaid Rashid
- Department of Computer Science and Engineering, Kongju National University, Cheonan, 31080 Korea
| | - Jungeun Kim
- Department of Software, Department of Computer Science and Engineering, Kongju National University, Cheonan, 31080 Korea
| | - Amir Hussain
- Data Science and Cyber Analytics Research Group, Edinburgh Napier University, Edinburgh, EH11 4DY UK
| | - Usman Naseem
- School of Computer Science, University of Sydney, Sydney, Australia
| | - Sapna Juneja
- Department of Computer Science, KIET Group of Institutions, Dehli NCR, Ghaziabad, India
| |
Collapse
|