1
|
Yusuf A, Boyne DJ, O'Sullivan DE, Brenner DR, Cheung WY, Mirza I, Jarada TN. Text analysis framework for identifying mutations among non-small cell lung cancer patients from laboratory data. BMC Med Res Methodol 2024; 24:63. [PMID: 38468224 PMCID: PMC10926579 DOI: 10.1186/s12874-024-02192-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/31/2022] [Accepted: 02/25/2024] [Indexed: 03/13/2024] Open
Abstract
BACKGROUND Laboratory data can provide great value to support research aimed at reducing the incidence, prolonging survival and enhancing outcomes of cancer. Data is characterized by the information it carries and the format it holds. Data captured in Alberta's biomarker laboratory repository is free text, cluttered and rouge. Such data format limits its utility and prohibits broader adoption and research development. Text analysis for information extraction of unstructured data can change this and lead to more complete analyses. Previous work on extracting relevant information from free text, unstructured data employed Natural Language Processing (NLP), Machine Learning (ML), rule-based Information Extraction (IE) methods, or a hybrid combination between them. METHODS In our study, text analysis was performed on Alberta Precision Laboratories data which consisted of 95,854 entries from the Southern Alberta Dataset (SAD) and 6944 entries from the Northern Alberta Dataset (NAD). The data covers all of Alberta and is completely population-based. Our proposed framework is built around rule-based IE methods. It incorporates topics such as Syntax and Lexical analyses to achieve deterministic extraction of data from biomarker laboratory data (i.e., Epidermal Growth Factor Receptor (EGFR) test results). Lexical analysis compromises of data cleaning and pre-processing, Rich Text Format text conversion into readable plain text format, and normalization and tokenization of text. The framework then passes the text into the Syntax analysis stage which includes the rule-based method of extracting relevant data. Rule-based patterns of the test result are identified, and a Context Free Grammar then generates the rules of information extraction. Finally, the results are linked with the Alberta Cancer Registry to support real-world cancer research studies. RESULTS Of the original 5512 entries in the SAD dataset and 5017 entries in the NAD dataset which were filtered for EGFR, the framework yielded 5129 and 3388 extracted EGFR test results from the SAD and NAD datasets, respectively. An accuracy of 97.5% was achieved on a random sample of 362 tests. CONCLUSIONS We presented a text analysis framework to extract specific information from unstructured clinical data. Our proposed framework has shown that it can successfully extract relevant information from EGFR test results.
Collapse
Affiliation(s)
- Amman Yusuf
- Department of Oncology, University of Calgary, Calgary, AB, T2N 4N2, Canada
| | - Devon J Boyne
- Department of Oncology, University of Calgary, Calgary, AB, T2N 4N2, Canada
- Department of Community Health Sciences, University of Calgary, Calgary, AB, T2N 4Z6, Canada
| | - Dylan E O'Sullivan
- Department of Oncology, University of Calgary, Calgary, AB, T2N 4N2, Canada
- Department of Community Health Sciences, University of Calgary, Calgary, AB, T2N 4Z6, Canada
| | - Darren R Brenner
- Department of Oncology, University of Calgary, Calgary, AB, T2N 4N2, Canada
- Department of Community Health Sciences, University of Calgary, Calgary, AB, T2N 4Z6, Canada
| | - Winson Y Cheung
- Department of Oncology, University of Calgary, Calgary, AB, T2N 4N2, Canada
- Department of Community Health Sciences, University of Calgary, Calgary, AB, T2N 4Z6, Canada
| | - Imran Mirza
- Alberta Precision Laboratories, Calgary, AB, T2L 2K8, Canada
| | - Tamer N Jarada
- Department of Oncology, University of Calgary, Calgary, AB, T2N 4N2, Canada.
- Department of Community Health Sciences, University of Calgary, Calgary, AB, T2N 4Z6, Canada.
| |
Collapse
|
2
|
Hu D, Liu B, Zhu X, Lu X, Wu N. Zero-shot information extraction from radiological reports using ChatGPT. Int J Med Inform 2024; 183:105321. [PMID: 38157785 DOI: 10.1016/j.ijmedinf.2023.105321] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2023] [Revised: 12/04/2023] [Accepted: 12/16/2023] [Indexed: 01/03/2024]
Abstract
INTRODUCTION Electronic health records contain an enormous amount of valuable information recorded in free text. Information extraction is the strategy to transform free text into structured data, but some of its components require annotated data to tune, which has become a bottleneck. Large language models achieve good performances on various downstream NLP tasks without parameter tuning, becoming a possible way to extract information in a zero-shot manner. METHODS In this study, we aim to explore whether the most popular large language model, ChatGPT, can extract information from the radiological reports. We first design the prompt template for the interested information in the CT reports. Then, we generate the prompts by combining the prompt template with the CT reports as the inputs of ChatGPT to obtain the responses. A post-processing module is developed to transform the responses into structured extraction results. Besides, we add prior medical knowledge to the prompt template to reduce wrong extraction results. We also explore the consistency of the extraction results. RESULTS We conducted the experiments with 847 real CT reports. The experimental results indicate that ChatGPT can achieve competitive performances for some extraction tasks like tumor location, tumor long and short diameters compared with the baseline information extraction system. By adding some prior medical knowledge to the prompt template, extraction tasks about tumor spiculations and lobulations obtain significant improvements but tasks about tumor density and lymph node status do not achieve better performances. CONCLUSION ChatGPT can achieve competitive information extraction for radiological reports in a zero-shot manner. Adding prior medical knowledge as instructions can further improve performances for some extraction tasks but may lead to worse performances for some complex extraction tasks.
Collapse
Affiliation(s)
- Danqing Hu
- Zhejiang Lab, Hangzhou, 311121, Zhejiang, China.
| | - Bing Liu
- Department of Thoracic Surgery II, Peking University Cancer Hospital and Institute, Beijing, 100142, China
| | - Xiaofeng Zhu
- Zhejiang Lab, Hangzhou, 311121, Zhejiang, China.
| | - Xudong Lu
- College of Biomedical Engineering and Instrumental Science, Zhejiang University, Hangzhou, 310027, Zhejiang, China
| | - Nan Wu
- Department of Thoracic Surgery II, Peking University Cancer Hospital and Institute, Beijing, 100142, China.
| |
Collapse
|
3
|
C Pereira S, Mendonça AM, Campilho A, Sousa P, Teixeira Lopes C. Automated image label extraction from radiology reports - A review. Artif Intell Med 2024; 149:102814. [PMID: 38462277 DOI: 10.1016/j.artmed.2024.102814] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Revised: 11/29/2023] [Accepted: 02/12/2024] [Indexed: 03/12/2024]
Abstract
Machine Learning models need large amounts of annotated data for training. In the field of medical imaging, labeled data is especially difficult to obtain because the annotations have to be performed by qualified physicians. Natural Language Processing (NLP) tools can be applied to radiology reports to extract labels for medical images automatically. Compared to manual labeling, this approach requires smaller annotation efforts and can therefore facilitate the creation of labeled medical image data sets. In this article, we summarize the literature on this topic spanning from 2013 to 2023, starting with a meta-analysis of the included articles, followed by a qualitative and quantitative systematization of the results. Overall, we found four types of studies on the extraction of labels from radiology reports: those describing systems based on symbolic NLP, statistical NLP, neural NLP, and those describing systems combining or comparing two or more of the latter. Despite the large variety of existing approaches, there is still room for further improvement. This work can contribute to the development of new techniques or the improvement of existing ones.
Collapse
Affiliation(s)
- Sofia C Pereira
- Institute for Systems and Computer Engineering, Technology and Science (INESC-TEC), Portugal; Faculty of Engineering of the University of Porto, Portugal.
| | - Ana Maria Mendonça
- Institute for Systems and Computer Engineering, Technology and Science (INESC-TEC), Portugal; Faculty of Engineering of the University of Porto, Portugal.
| | - Aurélio Campilho
- Institute for Systems and Computer Engineering, Technology and Science (INESC-TEC), Portugal; Faculty of Engineering of the University of Porto, Portugal.
| | - Pedro Sousa
- Hospital Center of Vila Nova de Gaia/Espinho, Portugal.
| | - Carla Teixeira Lopes
- Institute for Systems and Computer Engineering, Technology and Science (INESC-TEC), Portugal; Faculty of Engineering of the University of Porto, Portugal.
| |
Collapse
|
4
|
Weissenbacher D, Courtright K, Rawal S, Crane-Droesch A, O'Connor K, Kuhl N, Merlino C, Foxwell A, Haines L, Puhl J, Gonzalez-Hernandez G. Detecting goals of care conversations in clinical notes with active learning. J Biomed Inform 2024; 151:104618. [PMID: 38431151 DOI: 10.1016/j.jbi.2024.104618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/03/2023] [Revised: 01/22/2024] [Accepted: 02/26/2024] [Indexed: 03/05/2024]
Abstract
OBJECTIVE Goals of care (GOC) discussions are an increasingly used quality metric in serious illness care and research. Wide variation in documentation practices within the Electronic Health Record (EHR) presents challenges for reliable measurement of GOC discussions. Novel natural language processing approaches are needed to capture GOC discussions documented in real-world samples of seriously ill hospitalized patients' EHR notes, a corpus with a very low event prevalence. METHODS To automatically detect sentences documenting GOC discussions outside of dedicated GOC note types, we proposed an ensemble of classifiers aggregating the predictions of rule-based, feature-based, and three transformers-based classifiers. We trained our classifier on 600 manually annotated EHR notes among patients with serious illnesses. Our corpus exhibited an extremely imbalanced ratio between sentences discussing GOC and sentences that do not. This ratio challenges standard supervision methods to train a classifier. Therefore, we trained our classifier with active learning. RESULTS Using active learning, we reduced the annotation cost to fine-tune our ensemble by 70% while improving its performance in our test set of 176 EHR notes, with 0.557 F1-score for sentence classification and 0.629 for note classification. CONCLUSION When classifying notes, with a true positive rate of 72% (13/18) and false positive rate of 8% (13/158), our performance may be sufficient for deploying our classifier in the EHR to facilitate bedside clinicians' access to GOC conversations documented outside of dedicated notes types, without overburdening clinicians with false positives. Improvements are needed before using it to enrich trial populations or as an outcome measure.
Collapse
Affiliation(s)
- Davy Weissenbacher
- Department of Computational Biomedicine, Cedars-Sinai Medical Center, West Hollywood, CA, USA.
| | - Katherine Courtright
- Palliative and Advanced Illness Research Center, Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Siddharth Rawal
- DBEI, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Andrew Crane-Droesch
- Palliative and Advanced Illness Research Center, Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Karen O'Connor
- DBEI, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Nicholas Kuhl
- The Department of Medicine, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Corinne Merlino
- Palliative and Advanced Illness Research Center, Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Anessa Foxwell
- NewCourtland Center for Transitions and Health, School of Nursing, University of Pennsylvania, Philadelphia, PA, USA
| | - Lindsay Haines
- Hospice & Palliative Care, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Joseph Puhl
- Palliative and Advanced Illness Research Center, Department of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | | |
Collapse
|
5
|
Dada A, Ufer TL, Kim M, Hasin M, Spieker N, Forsting M, Nensa F, Egger J, Kleesiek J. Information extraction from weakly structured radiological reports with natural language queries. Eur Radiol 2024; 34:330-337. [PMID: 37505252 DOI: 10.1007/s00330-023-09977-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 05/08/2023] [Accepted: 05/27/2023] [Indexed: 07/29/2023]
Abstract
OBJECTIVES Provide physicians and researchers an efficient way to extract information from weakly structured radiology reports with natural language processing (NLP) machine learning models. METHODS We evaluate seven different German bidirectional encoder representations from transformers (BERT) models on a dataset of 857,783 unlabeled radiology reports and an annotated reading comprehension dataset in the format of SQuAD 2.0 based on 1223 additional reports. RESULTS Continued pre-training of a BERT model on the radiology dataset and a medical online encyclopedia resulted in the most accurate model with an F1-score of 83.97% and an exact match score of 71.63% for answerable questions and 96.01% accuracy in detecting unanswerable questions. Fine-tuning a non-medical model without further pre-training led to the lowest-performing model. The final model proved stable against variation in the formulations of questions and in dealing with questions on topics excluded from the training set. CONCLUSIONS General domain BERT models further pre-trained on radiological data achieve high accuracy in answering questions on radiology reports. We propose to integrate our approach into the workflow of medical practitioners and researchers to extract information from radiology reports. CLINICAL RELEVANCE STATEMENT By reducing the need for manual searches of radiology reports, radiologists' resources are freed up, which indirectly benefits patients. KEY POINTS • BERT models pre-trained on general domain datasets and radiology reports achieve high accuracy (83.97% F1-score) on question-answering for radiology reports. • The best performing model achieves an F1-score of 83.97% for answerable questions and 96.01% accuracy for questions without an answer. • Additional radiology-specific pretraining of all investigated BERT models improves their performance.
Collapse
Affiliation(s)
- Amin Dada
- Institute of AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, 45131, Essen, Germany.
| | - Tim Leon Ufer
- Institute of AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, 45131, Essen, Germany
| | - Moon Kim
- Institute of AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, 45131, Essen, Germany
| | - Max Hasin
- Institute of AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, 45131, Essen, Germany
| | | | - Michael Forsting
- Institute of AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, 45131, Essen, Germany
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Essen, Germany
| | - Felix Nensa
- Institute of AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, 45131, Essen, Germany
- Institute of Diagnostic and Interventional Radiology and Neuroradiology, University Hospital Essen, Essen, Germany
| | - Jan Egger
- Institute of AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, 45131, Essen, Germany
- Cancer Research Center Cologne Essen (CCCE), University Medicine Essen, Essen, Germany
| | - Jens Kleesiek
- Institute of AI in Medicine (IKIM), University Hospital Essen, Girardetstraße 2, 45131, Essen, Germany
- Dr. Krüger MVZ GmbH, Bocholt, Germany
- German Cancer Consortium (DKTK), Partner Site Essen, Essen, Germany
| |
Collapse
|
6
|
Fuenteslópez CV, McKitrick A, Corvi J, Ginebra MP, Hakimi O. Biomaterials text mining: A hands-on comparative study of methods on polydioxanone biocompatibility. N Biotechnol 2023; 77:161-175. [PMID: 37673372 DOI: 10.1016/j.nbt.2023.09.001] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2023] [Revised: 08/14/2023] [Accepted: 09/02/2023] [Indexed: 09/08/2023]
Abstract
Scientific information extraction is fundamental for research and innovation, but is currently mostly a manual, time-consuming process. Text Mining tools (TMTs) enable automated, accurate and quick information extraction from text, but there is little precedent of their use in the biomaterials field. Here, we compare the ability of various TMTs to extract useful information from biomaterials abstracts. Focusing on the biocompatibility of polydioxanone, a biodegradable polymer for which there are relatively few scientific publications, we tested several tools ranging from machine learning approaches and statistical text analysis to MeSH indexing and domain-specific semantic tools for Named Entity Recognition. We also evaluated their output alongside a manual review of systematic reviews and meta-analyses. The findings show that TMTs can be highly efficient and powerful for mapping biomaterials texts and rapidly yield up-to-date information. Here, TMTs enable one to identify dominating themes, see the evolution of specific terms and topics, and learn about key medical applications in biomaterials literature over the years. The analysis also shows that ambiguity around biomaterials nomenclature is a significant challenge in mining biomedical literature that is yet to be tackled. This research showcases the potential value of using Natural Language Processing and domain-specific tools to extract and organize biomaterials data.
Collapse
Affiliation(s)
- Carla V Fuenteslópez
- Institute of Biomedical Engineering, Botnar Research Centre, Nuffield Orthopaedic Centre, University of Oxford, Oxford OX3 7LD, UK.
| | - Austin McKitrick
- Institute of Social Research, University of Michigan, MI 48104, USA
| | - Javier Corvi
- Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain
| | - Maria-Pau Ginebra
- Department of Materials Science and Engineering, Universitat Politècnica de Catalunya, Barcelona 08019, Spain
| | - Osnat Hakimi
- Barcelona Supercomputing Center (BSC), Barcelona 08034, Spain; Department of Materials Science and Engineering, Universitat Politècnica de Catalunya, Barcelona 08019, Spain; Faculty of Medicine and Health Sciences, Universitat Internacional de Catalunya, Barcelona 08017, Spain.
| |
Collapse
|
7
|
Ma MW, Gao XS, Zhang ZY, Shang SY, Jin L, Liu PL, Lv F, Ni W, Han YC, Zong H. Extracting laboratory test information from paper-based reports. BMC Med Inform Decis Mak 2023; 23:251. [PMID: 37932733 PMCID: PMC10629084 DOI: 10.1186/s12911-023-02346-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/25/2022] [Accepted: 10/20/2023] [Indexed: 11/08/2023] Open
Abstract
BACKGROUND In the healthcare domain today, despite the substantial adoption of electronic health information systems, a significant proportion of medical reports still exist in paper-based formats. As a result, there is a significant demand for the digitization of information from these paper-based reports. However, the digitization of paper-based laboratory reports into a structured data format can be challenging due to their non-standard layouts, which includes various data types such as text, numeric values, reference ranges, and units. Therefore, it is crucial to develop a highly scalable and lightweight technique that can effectively identify and extract information from laboratory test reports and convert them into a structured data format for downstream tasks. METHODS We developed an end-to-end Natural Language Processing (NLP)-based pipeline for extracting information from paper-based laboratory test reports. Our pipeline consists of two main modules: an optical character recognition (OCR) module and an information extraction (IE) module. The OCR module is applied to locate and identify text from scanned laboratory test reports using state-of-the-art OCR algorithms. The IE module is then used to extract meaningful information from the OCR results to form digitalized tables of the test reports. The IE module consists of five sub-modules, which are time detection, headline position, line normalization, Named Entity Recognition (NER) with a Conditional Random Fields (CRF)-based method, and step detection for multi-column. Finally, we evaluated the performance of the proposed pipeline on 153 laboratory test reports collected from Peking University First Hospital (PKU1). RESULTS In the OCR module, we evaluate the accuracy of text detection and recognition results at three different levels and achieved an averaged accuracy of 0.93. In the IE module, we extracted four laboratory test entities, including test item name, test result, test unit, and reference value range. The overall F1 score is 0.86 on the 153 laboratory test reports collected from PKU1. With a single CPU, the average inference time of each report is only 0.78 s. CONCLUSION In this study, we developed a practical lightweight pipeline to digitalize and extract information from paper-based laboratory test reports in diverse types and with different layouts that can be adopted in real clinical environments with the lowest possible computing resources requirements. The high evaluation performance on the real-world hospital dataset validated the feasibility of the proposed pipeline.
Collapse
Affiliation(s)
- Ming-Wei Ma
- Department of Radiation Oncology, Peking University First Hospital, No.7 Xishiku Street, Beijing, 100034, China
| | - Xian-Shu Gao
- Department of Radiation Oncology, Peking University First Hospital, No.7 Xishiku Street, Beijing, 100034, China.
| | - Ze-Yu Zhang
- Philips Research China, Shanghai, 200072, China
| | - Shi-Yu Shang
- Department of Radiation Oncology, Peking University First Hospital, No.7 Xishiku Street, Beijing, 100034, China
| | - Ling Jin
- Philips Research China, Shanghai, 200072, China
| | - Pei-Lin Liu
- Department of Radiation Oncology, Peking University First Hospital, No.7 Xishiku Street, Beijing, 100034, China
| | - Feng Lv
- Department of Radiation Oncology, Peking University First Hospital, No.7 Xishiku Street, Beijing, 100034, China
| | - Wei Ni
- Philips Research China, Shanghai, 200072, China
| | - Yu-Chen Han
- Philips Research China, Shanghai, 200072, China
| | - Hui Zong
- Philips Research China, Shanghai, 200072, China
| |
Collapse
|
8
|
Hu Y, Chen Y, Qin Y, Huang R. Learning entity-oriented representation for biomedical relation extraction. J Biomed Inform 2023; 147:104527. [PMID: 37852347 DOI: 10.1016/j.jbi.2023.104527] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2023] [Revised: 10/11/2023] [Accepted: 10/15/2023] [Indexed: 10/20/2023]
Abstract
Biomedical Relation Extraction (BioRE) aims to automatically extract semantic relations for given entity pairs and is of great significance in biomedical research. Current popular methods often utilize pretrained language models to extract semantic features from individual input instances, which frequently suffer from overlapping semantics. Overlapping semantics refers to the situation in which a sentence contains multiple entity pairs that share the same context, leading to highly similar information between these entity pairs. In this study, we propose a model for learning Entity-oriented Representation (EoR) that aims to improve the performance of the model by enhancing the discriminability between entity pairs. It contains three modules: sentence representation, entity-oriented representation, and output. The first module learns the global semantic information of the input instance; the second module focuses on extracting the semantic information of the sentence from the target entities; and the third module enhances distinguishability among entity pairs and classifies the relation type. We evaluated our approach on four BioRE tasks with eight datasets, and the experiments showed that our EoR achieved state-of-the-art performance for PPI, DDI, CPI, and DPI tasks. Further analysis demonstrated the benefits of entity-oriented semantic information in handling multiple entity pairs in the BioRE task.
Collapse
Affiliation(s)
- Ying Hu
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Yanping Chen
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Yongbin Qin
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| | - Ruizhang Huang
- Text Computing and Cognitive Intelligence Engineering Research Center of National Education Ministry, State Key Laboratory of Public Big Data, College of Computer Science and Technology, Guizhou University, Guiyang, 550025, China.
| |
Collapse
|
9
|
Whitton J, Hunter A. Automated tabulation of clinical trial results: A joint entity and relation extraction approach with transformer-based language representations. Artif Intell Med 2023; 144:102661. [PMID: 37783549 DOI: 10.1016/j.artmed.2023.102661] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2022] [Revised: 07/05/2023] [Accepted: 09/04/2023] [Indexed: 10/04/2023]
Abstract
Evidence-based medicine, the practice in which healthcare professionals refer to the best available evidence when making decisions, forms the foundation of modern healthcare. However, it relies on labour-intensive systematic reviews, where domain specialists must aggregate and extract information from thousands of publications, primarily of randomised controlled trial (RCT) results, into evidence tables. This paper investigates automating evidence table generation by decomposing the problem across two language processing tasks: named entity recognition, which identifies key entities within text, such as drug names, and relation extraction, which maps their relationships for separating them into ordered tuples. We focus on the automatic tabulation of sentences from published RCT abstracts that report the results of the study outcomes. Two deep neural net models were developed as part of a joint extraction pipeline, using the principles of transfer learning and transformer-based language representations. To train and test these models, a new gold-standard corpus was developed, comprising over 550 result sentences from six disease areas. This approach demonstrated significant advantages, with our system performing well across multiple natural language processing tasks and disease areas, as well as in generalising to disease domains unseen during training. Furthermore, we show these results were achievable through training our models on as few as 170 example sentences. The final system is a proof of concept that the generation of evidence tables can be semi-automated, representing a step towards fully automating systematic reviews.
Collapse
Affiliation(s)
- Jetsun Whitton
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
| | - Anthony Hunter
- Department of Computer Science, University College London, Gower Street, London, WC1E 6BT, UK.
| |
Collapse
|
10
|
Aletaha A, Nemati-Anaraki L, Keshtkar A, Sedghi S, Keramatfar A, Korolyova A. A Scoping Review of Adopted Information Extraction Methods for RCTs. Med J Islam Repub Iran 2023; 37:95. [PMID: 38021383 PMCID: PMC10657257 DOI: 10.47176/mjiri.37.95] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/25/2022] [Indexed: 12/01/2023] Open
Abstract
Background Randomized controlled trials (RCTs) provide the strongest evidence for therapeutic interventions and their effects on groups of subjects. However, the large amount of unstructured information in these trials makes it challenging and time-consuming to make decisions and identify important concepts and valid evidence. This study aims to explore methods for automating or semi-automating information extraction from reports of RCT studies. Methods We conducted a systematic search of PubMed, ACM Digital Library, and Web of Science to identify relevant articles published between January 1, 2010, and 2022. We focused on published Natural Language Processing (NLP), machine learning, and deep learning methods that automate or semi-automate key elements of information extraction in the context of RCTs. Results A total of 26 publications were included, which discussed the automatic extraction of key characteristics of RCTs using various PICO frameworks (PIBOSO and PECODR). Among these publications, 14 (53.8%) extracted key characteristics based on PICO, PIBOSO, and PECODR, while 12 (46.1%) discussed information extraction methods in RCT studies. Common approaches mentioned included word/phrase matching, machine learning algorithms such as binary classification using the Naïve Bayes algorithm and powerful BERT network for feature extraction, support vector machine for data classification, conditional random field, non-machine-dependent automation, and machine learning or deep learning approaches. Conclusion The lack of publicly available software and limited access to existing software makes it difficult to determine the most powerful information extraction system. However, deep learning models like Transformers and BERT language models have shown better performance in natural language processing.
Collapse
Affiliation(s)
- Azadeh Aletaha
- Department of Medical Library and Information Science, School of Health
Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
- Evidence-Based Medicine Research Center, Endocrinology and Metabolism Clinical
Sciences Institute, Tehran University of Medical Sciences, Tehran, Iran
| | - Leila Nemati-Anaraki
- Department of Medical Library and Information Science, School of Health
Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
- Health Management and Economics Research Center, Health Management Research
Institute, Iran University of Medical Sciences, Tehran, Iran
| | - AbbasAli Keshtkar
- Department of Health Science Educational Development, School of Public Health,
Tehran University of Medical Sciences. Tehran, Iran
| | - Shahram Sedghi
- Department of Medical Library and Information Science, School of Health
Management and Information Sciences, Iran University of Medical Sciences, Tehran, Iran
- Economics Research Center, Iran University of Medical Sciences, PO Box
14665-354, Tehran, Iran
| | | | - Anna Korolyova
- Computer Science Laboratory for Mechanics and Engineering Sciences (LIMSI),
CNRS, Universit´e Paris-Saclay, F-91405 Orsay, France
- School of Life Sciences and Facility Management Zurich University of Applied
Sciences (ZHAW)
- Fraser House, White Cross Business Park, Lancaster, LA1 4XQ
| |
Collapse
|
11
|
Elmarakeby HA, Trukhanov PS, Arroyo VM, Riaz IB, Schrag D, Van Allen EM, Kehl KL. Empirical evaluation of language modeling to ascertain cancer outcomes from clinical text reports. BMC Bioinformatics 2023; 24:328. [PMID: 37658330 PMCID: PMC10474750 DOI: 10.1186/s12859-023-05439-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2023] [Accepted: 08/07/2023] [Indexed: 09/03/2023] Open
Abstract
BACKGROUND Longitudinal data on key cancer outcomes for clinical research, such as response to treatment and disease progression, are not captured in standard cancer registry reporting. Manual extraction of such outcomes from unstructured electronic health records is a slow, resource-intensive process. Natural language processing (NLP) methods can accelerate outcome annotation, but they require substantial labeled data. Transfer learning based on language modeling, particularly using the Transformer architecture, has achieved improvements in NLP performance. However, there has been no systematic evaluation of NLP model training strategies on the extraction of cancer outcomes from unstructured text. RESULTS We evaluated the performance of nine NLP models at the two tasks of identifying cancer response and cancer progression within imaging reports at a single academic center among patients with non-small cell lung cancer. We trained the classification models under different conditions, including training sample size, classification architecture, and language model pre-training. The training involved a labeled dataset of 14,218 imaging reports for 1112 patients with lung cancer. A subset of models was based on a pre-trained language model, DFCI-ImagingBERT, created by further pre-training a BERT-based model using an unlabeled dataset of 662,579 reports from 27,483 patients with cancer from our center. A classifier based on our DFCI-ImagingBERT, trained on more than 200 patients, achieved the best results in most experiments; however, these results were marginally better than simpler "bag of words" or convolutional neural network models. CONCLUSION When developing AI models to extract outcomes from imaging reports for clinical cancer research, if computational resources are plentiful but labeled training data are limited, large language models can be used for zero- or few-shot learning to achieve reasonable performance. When computational resources are more limited but labeled training data are readily available, even simple machine learning architectures can achieve good performance for such tasks.
Collapse
Affiliation(s)
- Haitham A Elmarakeby
- Dana-Farber Cancer Institute, Boston, MA, USA.
- Al-Azhar University, Cairo, Egypt.
- Harvard Medical School, Boston, MA, USA.
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| | | | | | - Irbaz Bin Riaz
- Dana-Farber Cancer Institute, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
- Mayo Clinic, Rochester, MN, USA
| | - Deborah Schrag
- Memorial-Sloan Kettering Cancer Center, New York, NY, USA
| | - Eliezer M Van Allen
- Dana-Farber Cancer Institute, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
- The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Kenneth L Kehl
- Dana-Farber Cancer Institute, Boston, MA, USA
- Harvard Medical School, Boston, MA, USA
| |
Collapse
|
12
|
Frei J, Kramer F. Annotated dataset creation through large language models for non-english medical NLP. J Biomed Inform 2023; 145:104478. [PMID: 37625508 DOI: 10.1016/j.jbi.2023.104478] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/05/2023] [Revised: 08/01/2023] [Accepted: 08/21/2023] [Indexed: 08/27/2023]
Abstract
Obtaining text datasets with semantic annotations is an effortful process, yet crucial for supervised training in natural language processing (NLP). In general, developing and applying new NLP pipelines in domain-specific contexts for tasks often requires custom-designed datasets to address NLP tasks in a supervised machine learning fashion. When operating in non-English languages for medical data processing, this exposes several minor and major, interconnected problems such as the lack of task-matching datasets as well as task-specific pre-trained models. In our work, we suggest to leverage pre-trained large language models for training data acquisition in order to retrieve sufficiently large datasets for training smaller and more efficient models for use-case-specific tasks. To demonstrate the effectiveness of your approach, we create a custom dataset that we use to train a medical NER model for German texts, GPTNERMED, yet our method remains language-independent in principle. Our obtained dataset as well as our pre-trained models are publicly available at https://github.com/frankkramer-lab/GPTNERMED.
Collapse
Affiliation(s)
- Johann Frei
- IT-Infrastructure for Translational Medical Research, University of Augsburg Alter Postweg 101, 86159 Augsburg, Germany.
| | - Frank Kramer
- IT-Infrastructure for Translational Medical Research, University of Augsburg Alter Postweg 101, 86159 Augsburg, Germany.
| |
Collapse
|
13
|
Szekér S, Fogarassy G, Vathy-Fogarassy Á. A general text mining method to extract echocardiography measurement results from echocardiography documents. Artif Intell Med 2023; 143:102584. [PMID: 37673570 DOI: 10.1016/j.artmed.2023.102584] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/03/2022] [Revised: 03/08/2023] [Accepted: 05/16/2023] [Indexed: 09/08/2023]
Abstract
BACKGROUND In everyday medical practice, the results of cardiac ultrasound examinations are generally recorded in unstructured text, from which extracting relevant information is an important and challenging task. This paper presents a generally applicable language and corpus-independent text mining method for extracting and structuring numerical measurement results and their descriptions from echocardiography reports. METHOD The developed method is based on generally applicable text mining preprocessing activities, it automatically identifies and standardizes the descriptions of the cardiac ultrasound measures, and it stores the extracted and standardized measurement descriptions with their measurement results in a structured form for later usage. The method does not contain any regular expression-based search and does not rely on information about the structure of the document. RESULTS The method has been tested on a document set containing more than 20,000 echocardiographic reports by examining the efficiency of extracting 12 echocardiography parameters considered important by experts. The method extracted and structured the echocardiography parameters under the study with good sensitivity (lowest value: 0.775, highest value: 1.0, average: 0.904) and excellent specificity (for all cases 1.0). The F1 score ranged between 0.873 and 1.0, and its average value was 0.948. CONCLUSION The presented case study has shown that the proposed method can extract measurement results from echocardiography documents with high confidence without performing a direct search or having detailed information about the data recording habits. Furthermore, it effectively handles spelling errors, abbreviations and the highly varied terminology used in descriptions. As it does not rely on any information related to the structure or the language of the documents or data recording habits, it can be applied for processing any free-text written medical texts.
Collapse
Affiliation(s)
- Szabolcs Szekér
- Department of Computer Science and Systems Technology, University of Pannonia, Veszprém, Hungary
| | - György Fogarassy
- 1st Department of Cardiology, State Hospital for Cardiology, Balatonfüred, Hungary
| | - Ágnes Vathy-Fogarassy
- Department of Computer Science and Systems Technology, University of Pannonia, Veszprém, Hungary.
| |
Collapse
|
14
|
Dang LD, Phan UTP, Nguyen NTH. GENA: A knowledge graph for nutrition and mental health. J Biomed Inform 2023; 145:104460. [PMID: 37532000 DOI: 10.1016/j.jbi.2023.104460] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2023] [Revised: 07/23/2023] [Accepted: 07/24/2023] [Indexed: 08/04/2023]
Abstract
While a large number of knowledge graphs have previously been developed by automatically extracting and structuring knowledge from literature, there is currently no such knowledge graph that encodes relationships between food, biochemicals and mental illnesses, even though a large amount of knowledge about these relationships is available in the form of unstructured text in biomedical literature articles. To address this limitation, this article describes the development of GENA - (Graph of mEntal-health and Nutrition Association), a knowledge graph that represents relations between nutrition and mental health, extracted from biomedical abstracts. GENA is constructed from PubMed abstracts that contain keywords relating to chemicals, food, and health. A hybrid named entity recognition (NER) model is firstly applied to these abstracts to identify various entities of interest. Subsequently, a deep syntax-based relation extraction model is used to detect binary relations between the identified entities. Finally, the resulting relations are used to populate the GENA knowledge graph, whose relationships can be accessed in an intuitive and interpretable manner using the Neo4J Database Management System. To evaluate the reliability of GENA, two annotators manually assessed a subset of the extracted relations. The evaluation results show that our methods obtain high precision for the NER task and acceptable precision and relative recall for the relation extraction task. GENA consists of 43,367 relationships that encode information about nutrition and health, of which 94.04% are new relations that are not present in existing ontologies of food and diseases. GENA is constructed based on scientific principles, and has the potential to be used within further applications to contribute towards scientific research within the domain. It is a pioneering knowledge graph in nutrition and mental health, containing a diverse range of relationship types. All of our source code and results are publicly available at https://github.com/ddlinh/gena-db.
Collapse
Affiliation(s)
- Linh D Dang
- Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam.
| | - Uyen T P Phan
- Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam; Vietnam National University, Ho Chi Minh City, Vietnam
| | - Nhung T H Nguyen
- Department of Computer Science, The University of Manchester, Manchester, United Kingdom
| |
Collapse
|
15
|
Zhou S, Wang N, Wang L, Sun J, Blaes A, Liu H, Zhang R. A cross-institutional evaluation on breast cancer phenotyping NLP algorithms on electronic health records. Comput Struct Biotechnol J 2023; 22:32-40. [PMID: 37680211 PMCID: PMC10480628 DOI: 10.1016/j.csbj.2023.08.018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/16/2023] [Revised: 08/15/2023] [Accepted: 08/21/2023] [Indexed: 09/09/2023] Open
Abstract
Objective Transformer-based language models are prevailing in the clinical domain due to their excellent performance on clinical NLP tasks. The generalizability of those models is usually ignored during the model development process. This study evaluated the generalizability of CancerBERT, a Transformer-based clinical NLP model, along with classic machine learning models, i.e., conditional random field (CRF), bi-directional long short-term memory CRF (BiLSTM-CRF), across different clinical institutes through a breast cancer phenotype extraction task. Materials and methods Two clinical corpora of breast cancer patients were collected from the electronic health records from the University of Minnesota (UMN) and Mayo Clinic (MC), and annotated following the same guideline. We developed three types of NLP models (i.e., CRF, BiLSTM-CRF and CancerBERT) to extract cancer phenotypes from clinical texts. We evaluated the generalizability of models on different test sets with different learning strategies (model transfer vs locally trained). The entity coverage score was assessed with their association with the model performances. Results We manually annotated 200 and 161 clinical documents at UMN and MC. The corpora of the two institutes were found to have higher similarity between the target entities than the overall corpora. The CancerBERT models obtained the best performances among the independent test sets from two clinical institutes and the permutation test set. The CancerBERT model developed in one institute and further fine-tuned in another institute achieved reasonable performance compared to the model developed on local data (micro-F1: 0.925 vs 0.932). Conclusions The results indicate the CancerBERT model has superior learning ability and generalizability among the three types of clinical NLP models for our named entity recognition task. It has the advantage to recognize complex entities, e.g., entities with different labels.
Collapse
Affiliation(s)
- Sicheng Zhou
- Institute for Health Informatics, University of Minnesota, Minneapolis, MN, USA
| | - Nan Wang
- School of Statistics, University of Minnesota, Minneapolis, MN, USA
| | - Liwei Wang
- Department of AI and Informatics, Mayo Clinic, Rochester, MN, USA
| | - Ju Sun
- Department of Computer Science & Engineering, University of Minnesota, Minneapolis, MN, USA
| | - Anne Blaes
- Department of Medicine, University of Minnesota, Minneapolis, MN, USA
| | - Hongfang Liu
- Department of AI and Informatics, Mayo Clinic, Rochester, MN, USA
| | - Rui Zhang
- Division of Computational Health Sciences, Department of Surgery, University of Minnesota, Minneapolis, MN, USA
| |
Collapse
|
16
|
Zhu E, Sheng Q, Yang H, Liu Y, Cai T, Li J. A unified framework of medical information annotation and extraction for Chinese clinical text. Artif Intell Med 2023; 142:102573. [PMID: 37316096 DOI: 10.1016/j.artmed.2023.102573] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2022] [Revised: 03/17/2023] [Accepted: 04/27/2023] [Indexed: 06/16/2023]
Abstract
Medical information extraction consists of a group of natural language processing (NLP) tasks, which collaboratively convert clinical text to pre-defined structured formats. This is a critical step to exploit electronic medical records (EMRs). Given the recent thriving NLP technologies, model implementation and performance seem no longer an obstacle, whereas the bottleneck locates on a high-quality annotated corpus and the whole engineering workflow. This study presents an engineering framework consisting of three tasks, i.e., medical entity recognition, relation extraction and attribute extraction. Within this framework, the whole workflow is demonstrated from EMR data collection through model performance evaluation. Our annotation scheme is designed to be comprehensive and compatible between the multiple tasks. With the EMRs from a general hospital in Ningbo, China, and the manual annotation by experienced physicians, our corpus is of large scale and high quality. Built upon this Chinese clinical corpus, the medical information extraction system show performance that approaches human annotation. The annotation scheme, (a subset of) the annotated corpus, and the code are all publicly released, to facilitate further research.
Collapse
Affiliation(s)
- Enwei Zhu
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China; Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo 315016, Zhejiang Province, PR China.
| | - Qilin Sheng
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China.
| | - Huanwan Yang
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China.
| | - Yiyang Liu
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China; Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo 315016, Zhejiang Province, PR China.
| | - Ting Cai
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China; Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo 315016, Zhejiang Province, PR China.
| | - Jinpeng Li
- Ningbo No. 2 Hospital, Ningbo 315010, Zhejiang Province, PR China; Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo 315016, Zhejiang Province, PR China.
| |
Collapse
|
17
|
Mahajan D, Liang JJ, Tsou CH, Uzuner Ö. Overview of the 2022 n2c2 shared task on contextualized medication event extraction in clinical notes. J Biomed Inform 2023; 144:104432. [PMID: 37356640 PMCID: PMC10529825 DOI: 10.1016/j.jbi.2023.104432] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/06/2023] [Revised: 06/15/2023] [Accepted: 06/17/2023] [Indexed: 06/27/2023]
Abstract
BACKGROUND An accurate medication history, foundational for providing quality medical care, requires understanding of medication change events documented in clinical notes. However, extracting medication changes without the necessary clinical context is insufficient for real-world applications. METHODS To address this need, Track 1 of the 2022 National NLP Clinical Challenges focused on extracting the context for medication changes documented in clinical notes using the Contextualized Medication Event Dataset. Track 1 consisted of 3 subtasks: extracting medication mentions from clinical notes (NER), determining whether a medication change is being discussed (Event), and determining the action, negation, temporality, certainty, and actor for any change events (Context). Participants were allowed to participate in any one or more of the subtasks. RESULTS A total of 32 teams with participants from 19 countries submitted a total of 211 systems across all subtasks. Most teams formulated NER as a token classification task and Event and Context as multi-class classification tasks, using transformer-based large language models. Overall, performance for NER was high across submitted systems. However, performance for Event and Context were much lower, often due to indirectly stated change events with no clear action verb, events requiring farther textual clues for understanding, and medication mentions with multiple change events. CONCLUSIONS This shared task showed that while NLP research on medication extraction is relatively mature, understanding of contextual information surrounding medication events in clinical notes is still an open problem requiring further research to achieve the end goal of supporting real-world clinical applications.
Collapse
Affiliation(s)
- Diwakar Mahajan
- IBM T.J. Watson Research Center, Yorktown Heights, NY, United States of America
| | - Jennifer J Liang
- IBM T.J. Watson Research Center, Yorktown Heights, NY, United States of America.
| | - Ching-Huei Tsou
- IBM T.J. Watson Research Center, Yorktown Heights, NY, United States of America
| | - Özlem Uzuner
- Department of Information Sciences & Technology, George Mason University, Fairfax, VA, United States of America
| |
Collapse
|
18
|
Hosch R, Baldini G, Parmar V, Borys K, Koitka S, Engelke M, Arzideh K, Ulrich M, Nensa F. FHIR-PYrate: a data science friendly Python package to query FHIR servers. BMC Health Serv Res 2023; 23:734. [PMID: 37415138 DOI: 10.1186/s12913-023-09498-1] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/23/2022] [Accepted: 05/03/2023] [Indexed: 07/08/2023] Open
Abstract
BACKGROUND We present FHIR-PYrate, a Python package to handle the full clinical data collection and extraction process. The software is to be plugged into a modern hospital domain, where electronic patient records are used to handle the entire patient's history. Most research institutes follow the same procedures to build study cohorts, but mainly in a non-standardized and repetitive way. As a result, researchers spend time writing boilerplate code, which could be used for more challenging tasks. METHODS The package can improve and simplify existing processes in the clinical research environment. It collects all needed functionalities into a straightforward interface that can be used to query a FHIR server, download imaging studies and filter clinical documents. The full capacity of the search mechanism of the FHIR REST API is available to the user, leading to a uniform querying process for all resources, thus simplifying the customization of each use case. Additionally, valuable features like parallelization and filtering are included to make it more performant. RESULTS As an exemplary practical application, the package can be used to analyze the prognostic significance of routine CT imaging and clinical data in breast cancer with tumor metastases in the lungs. In this example, the initial patient cohort is first collected using ICD-10 codes. For these patients, the survival information is also gathered. Some additional clinical data is retrieved, and CT scans of the thorax are downloaded. Finally, the survival analysis can be computed using a deep learning model with the CT scans, the TNM staging and positivity of relevant markers as input. This process may vary depending on the FHIR server and available clinical data, and can be customized to cover even more use cases. CONCLUSIONS FHIR-PYrate opens up the possibility to quickly and easily retrieve FHIR data, download image data, and search medical documents for keywords within a Python package. With the demonstrated functionality, FHIR-PYrate opens an easy way to assemble research collectives automatically.
Collapse
Affiliation(s)
- René Hosch
- Institute of Interventional and Diagnostic Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
| | - Giulia Baldini
- Institute of Interventional and Diagnostic Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany.
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany.
| | - Vicky Parmar
- Institute of Interventional and Diagnostic Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
| | - Katarzyna Borys
- Institute of Interventional and Diagnostic Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
| | - Sven Koitka
- Institute of Interventional and Diagnostic Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
| | - Merlin Engelke
- Institute of Interventional and Diagnostic Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
| | - Kamyar Arzideh
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
- Central IT Department, Data Integration Center, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
| | - Moritz Ulrich
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
- Central IT Department, Data Integration Center, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
| | - Felix Nensa
- Institute of Interventional and Diagnostic Radiology and Neuroradiology, University Hospital Essen, Hufelandstraße 55, Essen, 45147, Germany
- Institute for Artificial Intelligence in Medicine, University Hospital Essen, Girardetstraße 2, Essen, 45131, Germany
| |
Collapse
|
19
|
Boguslav MR, Salem NM, White EK, Sullivan KJ, Bada M, Hernandez TL, Leach SM, Hunter LE. Creating an ignorance-base: Exploring known unknowns in the scientific literature. J Biomed Inform 2023; 143:104405. [PMID: 37270143 PMCID: PMC10528083 DOI: 10.1016/j.jbi.2023.104405] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/08/2022] [Revised: 05/18/2023] [Accepted: 05/21/2023] [Indexed: 06/05/2023]
Abstract
BACKGROUND Scientific discovery progresses by exploring new and uncharted territory. More specifically, it advances by a process of transforming unknown unknowns first into known unknowns, and then into knowns. Over the last few decades, researchers have developed many knowledge bases to capture and connect the knowns, which has enabled topic exploration and contextualization of experimental results. But recognizing the unknowns is also critical for finding the most pertinent questions and their answers. Prior work on known unknowns has sought to understand them, annotate them, and automate their identification. However, no knowledge-bases yet exist to capture these unknowns, and little work has focused on how scientists might use them to trace a given topic or experimental result in search of open questions and new avenues for exploration. We show here that a knowledge base of unknowns can be connected to ontologically grounded biomedical knowledge to accelerate research in the field of prenatal nutrition. RESULTS We present the first ignorance-base, a knowledge-base created by combining classifiers to recognize ignorance statements (statements of missing or incomplete knowledge that imply a goal for knowledge) and biomedical concepts over the prenatal nutrition literature. This knowledge-base places biomedical concepts mentioned in the literature in context with the ignorance statements authors have made about them. Using our system, researchers interested in the topic of vitamin D and prenatal health were able to uncover three new avenues for exploration (immune system, respiratory system, and brain development) by searching for concepts enriched in ignorance statements. These were buried among the many standard enriched concepts. Additionally, we used the ignorance-base to enrich concepts connected to a gene list associated with vitamin D and spontaneous preterm birth and found an emerging topic of study (brain development) in an implied field (neuroscience). The researchers could look to the field of neuroscience for potential answers to the ignorance statements. CONCLUSION Our goal is to help students, researchers, funders, and publishers better understand the state of our collective scientific ignorance (known unknowns) in order to help accelerate research through the continued illumination of and focus on the known unknowns and their respective goals for scientific knowledge.
Collapse
Affiliation(s)
- Mayla R Boguslav
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA.
| | - Nourah M Salem
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Elizabeth K White
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Katherine J Sullivan
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Michael Bada
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Teri L Hernandez
- College of Nursing, Department of Medicine/Division of Endocrinology, Metabolism, & Diabetes, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| | - Sonia M Leach
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA; Center for Genes, Environment and Health, National Jewish Health, Jackson Street, Denver, 80206, CO, USA
| | - Lawrence E Hunter
- Computational Bioscience Program, University of Colorado, Anschutz Medical Campus, E 17th Avenue, Aurora, 80045, CO, USA
| |
Collapse
|
20
|
Yang L, Huang X, Wang J, Yang X, Ding L, Li Z, Li J. Identifying stroke-related quantified evidence from electronic health records in real-world studies. Artif Intell Med 2023; 140:102552. [PMID: 37210153 DOI: 10.1016/j.artmed.2023.102552] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/18/2021] [Revised: 02/28/2023] [Accepted: 04/11/2023] [Indexed: 05/22/2023]
Abstract
BACKGROUND Stroke is one of the leading causes of death and disability worldwide. The National Institutes of Health Stroke Scale (NIHSS) scores in electronic health records (EHRs), which quantitatively describe patients' neurological deficits in evidence-based treatment, are crucial in stroke-related clinical investigations. However, the free-text format and lack of standardization inhibit their effective use. Automatically extracting the scale scores from the clinical free text so that its potential value in real-world studies is realized has become an important goal. OBJECTIVE This study aims to develop an automated method to extract scale scores from the free text of EHRs. METHODS We propose a two-step pipeline method to identify NIHSS items and numerical scores and validate its feasibility using a freely accessible critical care database: MIMIC-III (Medical Information Mart for Intensive Care III). First, we utilize MIMIC-III to create an annotated corpus. Then, we investigate possible machine learning methods for two subtasks, NIHSS item and score recognition and item-score relation extraction. In the evaluation, we conduct both task-specific and end-to-end evaluations and compare our method with the rule-based method using precision, recall and F1 scores as evaluation metrics. RESULTS We use all available discharge summaries of stroke cases in MIMIC-III. The annotated NIHSS corpus contains 312 cases, 2929 scale items, 2774 scores and 2733 relations. The results show that the best F1-score of our method was 0.9006, which was attained by combining BERT-BiLSTM-CRF and Random Forest, and it outperformed the rule-based method (F1-score = 0.8098). In the end-to-end task, our method could successfully recognize the item "1b level of consciousness questions", the score "1" and their relation "('1b level of consciousness questions', '1', 'has value')" from the sentence "1b level of consciousness questions: said name = 1", while the rule-based method could not. CONCLUSIONS The two-step pipeline method we propose is an effective approach to identify NIHSS items, scores and their relations. With its help, clinical investigators can easily retrieve and access structured scale data, thereby supporting stroke-related real-world studies.
Collapse
Affiliation(s)
- Lin Yang
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing 100020, China; Key Laboratory of Medical Information Intelligent Technology, Chinese Academy of Medical Sciences, Beijing 100020, China
| | - Xiaoshuo Huang
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing 100020, China; School of Health Care Technology, Dalian Neusoft University of Information, Dalian 116023, China
| | - Jiayang Wang
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing 100020, China
| | - Xin Yang
- China National Clinical Research Center for Neurological Diseases, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China; National Center for Healthcare Quality Management in Neurological Diseases, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China
| | - Lingling Ding
- China National Clinical Research Center for Neurological Diseases, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China; Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China
| | - Zixiao Li
- China National Clinical Research Center for Neurological Diseases, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China; National Center for Healthcare Quality Management in Neurological Diseases, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China; Department of Neurology, Beijing Tiantan Hospital, Capital Medical University, Beijing 100070, China
| | - Jiao Li
- Institute of Medical Information and Library, Chinese Academy of Medical Sciences/Peking Union Medical College, Beijing 100020, China; Key Laboratory of Medical Information Intelligent Technology, Chinese Academy of Medical Sciences, Beijing 100020, China.
| |
Collapse
|
21
|
Abeynayake HIMM, Goonetilleke RS, Wijeweera A, Reischl U. Efficacy of information extraction from bar, line, circular, bubble and radar graphs. Appl Ergon 2023; 109:103996. [PMID: 36805850 DOI: 10.1016/j.apergo.2023.103996] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/19/2022] [Revised: 12/02/2022] [Accepted: 02/06/2023] [Indexed: 06/18/2023]
Abstract
With the emergence of enormous amounts of data, numerous ways to visualize such data have been used. Bar, circular, line, radar and bubble graphs that are ubiquitous were investigated for their effectiveness. Fourteen participants performed four types of evaluations: between categories (cities), within categories (transport modes within a city), all categories, and a direct reading within a category from a graph. The representations were presented in random order and participants were asked to respond to sixteen questions to the best of their ability after visually scanning the related graph. There were two trials on two separate days for each participant. Eye movements were recorded using an eye tracker. Bar and line graphs show superiority over circular and radial graphs in effectiveness, efficiency, and perceived ease of use primarily due to eye saccades. The radar graph had the worst performance. "Vibration-type" fill pattern could be improved by adding colors and symbolic fills. Design guidelines are proposed for the effective representation of data so that the presentation and communication of information are effective.
Collapse
Affiliation(s)
| | | | - Albert Wijeweera
- Department of Humanities and Social Sciences, Khalifa University, United Arab Emirates
| | - Uwe Reischl
- Department of Public Health and Population Science, Boise State University, USA
| |
Collapse
|
22
|
Eisenstein EL, Zozus MN, Garza MY, Lanham HJ, Adagarla B, Walden A, Benjamin DK, Zimmerman KO, Kumar KR; Best Pharmaceuticals for Children Act: Pediatric Trials Network Steering Committee. Assessing clinical site readiness for EHR-to-EDC automated data collection. Contemp Clin Trials 2023; 128:107144. [PMID: 36898625 DOI: 10.1016/j.cct.2023.107144] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/19/2022] [Revised: 02/13/2023] [Accepted: 03/05/2023] [Indexed: 03/11/2023]
Abstract
BACKGROUND eSource software is used to automatically copy a patient's electronic health record data into a clinical study's electronic case report form. However, there is little evidence to assist sponsors in identifying the best sites for multi-center eSource studies. METHODS We developed an eSource site readiness survey. The survey was administered to principal investigators, clinical research coordinators, and chief research information officers at Pediatric Trial Network sites. RESULTS A total of 61 respondents were included in this study (clinical research coordinator, 22; principal investigator, 20; and chief research information officer, 19). Clinical research coordinators and principal investigators ranked medication administration, medication orders, laboratory, medical history, and vital signs data as having the highest priority for automation. While most organizations used some electronic health record research functions (clinical research coordinator, 77%; principal investigator, 75%; and chief research information officer, 89%), only 21% of sites were using Fast Healthcare Interoperability Resources standards to exchange patient data with other institutions. Respondents generally gave lower readiness for change ratings to organizations that did not have a separate research information technology group and where researchers practiced in hospitals not operated by their medical schools. CONCLUSIONS Site readiness to participate in eSource studies is not merely a technical problem. While technical capabilities are important, organizational priorities, structure, and the site's support of clinical research functions are equally important considerations.
Collapse
|
23
|
Ter Horst H, Brazda N, Schira-Heinen J, Krebbers J, Müller HW, Cimiano P. Automatic knowledge graph population with model-complete text comprehension for pre-clinical outcomes in the field of spinal cord injury. Artif Intell Med 2023; 137:102491. [PMID: 36868686 DOI: 10.1016/j.artmed.2023.102491] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Revised: 07/16/2022] [Accepted: 01/11/2023] [Indexed: 01/19/2023]
Abstract
The paradigm of evidence-based medicine requires that medical decisions are made on the basis of the best available knowledge published in the literature. Existing evidence is often summarized in the form of systematic reviews and/or meta-reviews and is rarely available in a structured form. Manual compilation and aggregation is costly, and conducting a systematic review represents a high effort. The need to aggregate evidence arises not only in the context of clinical trials, but is also important in the context of pre-clinical animal studies. In this context, evidence extraction is important to support translation of the most promising pre-clinical therapies into clinical trials or to optimize clinical trial design. Aiming at developing methods that facilitate the task of aggregating evidence published in pre-clinical studies, in this paper a new system is presented that automatically extracts structured knowledge from such publications and stores it in a so-called domain knowledge graph. The approach follows the paradigm of model-complete text comprehension by relying on guidance from a domain ontology creating a deep relational data-structure that reflects the main concepts, protocol, and key findings of studies. Focusing on the domain of spinal cord injuries, a single outcome of a pre-clinical study is described by up to 103 outcome parameters. Since the problem of extracting all these variables together is intractable, we propose a hierarchical architecture that incrementally predicts semantic sub-structures according to a given data model in a bottom-up fashion. At the heart of our approach is a statistical inference method that relies on conditional random fields to infer the most likely instance of the domain model given the text of a scientific publication as input. This approach allows modeling dependencies between the different variables describing a study in a semi-joint fashion. We present a comprehensive evaluation of our system to understand the extent to which our system can capture a study in the depth required to enable the generation of new knowledge. We conclude the article with a brief description of some applications of the populated knowledge graph and show the potential implications of our work for supporting evidence-based medicine.
Collapse
Affiliation(s)
- Hendrik Ter Horst
- CITEC, Bielefeld University, Inspiration 1, 33619 Bielefeld, Germany.
| | - Nicole Brazda
- Neurologische Klinik, Universitätsklinikum der Heinrich-Heine-Universität Düsseldorf, Moorenstr. 5 and Center for Neuronal Regeneration, Life Science Center Düsseldorf, Merowingerplatz 1a, 40225 Düsseldorf, Germany
| | - Jessica Schira-Heinen
- Neurologische Klinik, Universitätsklinikum der Heinrich-Heine-Universität Düsseldorf, Moorenstr. 5 and Center for Neuronal Regeneration, Life Science Center Düsseldorf, Merowingerplatz 1a, 40225 Düsseldorf, Germany
| | - Julia Krebbers
- Neurologische Klinik, Universitätsklinikum der Heinrich-Heine-Universität Düsseldorf, Moorenstr. 5 and Center for Neuronal Regeneration, Life Science Center Düsseldorf, Merowingerplatz 1a, 40225 Düsseldorf, Germany
| | - Hans-Werner Müller
- Neurologische Klinik, Universitätsklinikum der Heinrich-Heine-Universität Düsseldorf, Moorenstr. 5 and Center for Neuronal Regeneration, Life Science Center Düsseldorf, Merowingerplatz 1a, 40225 Düsseldorf, Germany
| | - Philipp Cimiano
- CITEC, Bielefeld University, Inspiration 1, 33619 Bielefeld, Germany
| |
Collapse
|
24
|
Ramachandran GK, Lybarger K, Liu Y, Mahajan D, Liang JJ, Tsou CH, Yetisgen M, Uzuner Ö. Extracting medication changes in clinical narratives using pre-trained language models. J Biomed Inform 2023; 139:104302. [PMID: 36754129 DOI: 10.1016/j.jbi.2023.104302] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2022] [Revised: 01/10/2023] [Accepted: 01/29/2023] [Indexed: 02/08/2023]
Abstract
An accurate and detailed account of patient medications, including medication changes within the patient timeline, is essential for healthcare providers to provide appropriate patient care. Healthcare providers or the patients themselves may initiate changes to patient medication. Medication changes take many forms, including prescribed medication and associated dosage modification. These changes provide information about the overall health of the patient and the rationale that led to the current care. Future care can then build on the resulting state of the patient. This work explores the automatic extraction of medication change information from free-text clinical notes. The Contextual Medication Event Dataset (CMED) is a corpus of clinical notes with annotations that characterize medication changes through multiple change-related attributes, including the type of change (start, stop, increase, etc.), initiator of the change, temporality, change likelihood, and negation. Using CMED, we identify medication mentions in clinical text and propose three novel high-performing BERT-based systems that resolve the annotated medication change characteristics. We demonstrate that our proposed systems improve medication change classification performance over the initial work exploring CMED.
Collapse
Affiliation(s)
| | - Kevin Lybarger
- Department of Information Sciences & Technology, George Mason University, Fairfax, VA, United States of America
| | - Yaya Liu
- Department of Information Sciences & Technology, George Mason University, Fairfax, VA, United States of America
| | - Diwakar Mahajan
- IBM T.J. Watson Research Center, Yorktown Heights, NY, United States of America
| | - Jennifer J Liang
- IBM T.J. Watson Research Center, Yorktown Heights, NY, United States of America
| | - Ching-Huei Tsou
- IBM T.J. Watson Research Center, Yorktown Heights, NY, United States of America
| | - Meliha Yetisgen
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, WA, United States of America
| | - Özlem Uzuner
- Department of Information Sciences & Technology, George Mason University, Fairfax, VA, United States of America
| |
Collapse
|
25
|
Seong D, Choi YH, Shin SY, Yi BK. Deep learning approach to detection of colonoscopic information from unstructured reports. BMC Med Inform Decis Mak 2023; 23:28. [PMID: 36750932 PMCID: PMC9903463 DOI: 10.1186/s12911-023-02121-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/23/2022] [Accepted: 01/23/2023] [Indexed: 02/09/2023] Open
Abstract
BACKGROUND Colorectal cancer is a leading cause of cancer deaths. Several screening tests, such as colonoscopy, can be used to find polyps or colorectal cancer. Colonoscopy reports are often written in unstructured narrative text. The information embedded in the reports can be used for various purposes, including colorectal cancer risk prediction, follow-up recommendation, and quality measurement. However, the availability and accessibility of unstructured text data are still insufficient despite the large amounts of accumulated data. We aimed to develop and apply deep learning-based natural language processing (NLP) methods to detect colonoscopic information. METHODS This study applied several deep learning-based NLP models to colonoscopy reports. Approximately 280,668 colonoscopy reports were extracted from the clinical data warehouse of Samsung Medical Center. For 5,000 reports, procedural information and colonoscopic findings were manually annotated with 17 labels. We compared the long short-term memory (LSTM) and BioBERT model to select the one with the best performance for colonoscopy reports, which was the bidirectional LSTM with conditional random fields. Then, we applied pre-trained word embedding using large unlabeled data (280,668 reports) to the selected model. RESULTS The NLP model with pre-trained word embedding performed better for most labels than the model with one-hot encoding. The F1 scores for colonoscopic findings were: 0.9564 for lesions, 0.9722 for locations, 0.9809 for shapes, 0.9720 for colors, 0.9862 for sizes, and 0.9717 for numbers. CONCLUSIONS This study applied deep learning-based clinical NLP models to extract meaningful information from colonoscopy reports. The method in this study achieved promising results that demonstrate it can be applied to various practical purposes.
Collapse
Affiliation(s)
- Donghyeong Seong
- grid.264381.a0000 0001 2181 989XSamsung Advanced Institute for Health Sciences and Technology (SAIHST), Sungkyunkwan University, Seoul, 06355 Republic of Korea
| | - Yoon Ho Choi
- grid.264381.a0000 0001 2181 989XDepartment of Digital Health, SAIHST, Sungkyunkwan University, Seoul, 06355 Republic of Korea
| | - Soo-Yong Shin
- grid.264381.a0000 0001 2181 989XDepartment of Digital Health, SAIHST, Sungkyunkwan University, Seoul, 06355 Republic of Korea ,grid.414964.a0000 0001 0640 5613Research Institute for Future Medicine, Samsung Medical Center, Seoul, 06351 Republic of Korea
| | - Byoung-Kee Yi
- Department of Artificial Intelligence Convergence, Kangwon National University, 1 Kangwondaehak-Gil, Chuncheon-si, Gangwon-do, 24341, Republic of Korea.
| |
Collapse
|
26
|
Lau W, Lybarger K, Gunn ML, Yetisgen M. Event-Based Clinical Finding Extraction from Radiology Reports with Pre-trained Language Model. J Digit Imaging 2023; 36:91-104. [PMID: 36253581 DOI: 10.1007/s10278-022-00717-5] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2021] [Revised: 08/31/2022] [Accepted: 09/30/2022] [Indexed: 11/16/2022] Open
Abstract
Radiology reports contain a diverse and rich set of clinical abnormalities documented by radiologists during their interpretation of the images. Comprehensive semantic representations of radiological findings would enable a wide range of secondary use applications to support diagnosis, triage, outcomes prediction, and clinical research. In this paper, we present a new corpus of radiology reports annotated with clinical findings. Our annotation schema captures detailed representations of pathologic findings that are observable on imaging ("lesions") and other types of clinical problems ("medical problems"). The schema used an event-based representation to capture fine-grained details, including assertion, anatomy, characteristics, size, and count. Our gold standard corpus contained a total of 500 annotated computed tomography (CT) reports. We extracted triggers and argument entities using two state-of-the-art deep learning architectures, including BERT. We then predicted the linkages between trigger and argument entities (referred to as argument roles) using a BERT-based relation extraction model. We achieved the best extraction performance using a BERT model pre-trained on 3 million radiology reports from our institution: 90.9-93.4% F1 for finding triggers and 72.0-85.6% F1 for argument roles. To assess model generalizability, we used an external validation set randomly sampled from the MIMIC Chest X-ray (MIMIC-CXR) database. The extraction performance on this validation set was 95.6% for finding triggers and 79.1-89.7% for argument roles, demonstrating that the model generalized well to the cross-institutional data with a different imaging modality. We extracted the finding events from all the radiology reports in the MIMIC-CXR database and provided the extractions to the research community.
Collapse
|
27
|
Jaradeh MY, Singh K, Stocker M, Both A, Auer S. Information extraction pipelines for knowledge graphs. Knowl Inf Syst 2023; 65:1989-2016. [PMID: 36643405 PMCID: PMC9823264 DOI: 10.1007/s10115-022-01826-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2021] [Revised: 12/16/2022] [Accepted: 12/25/2022] [Indexed: 01/09/2023]
Abstract
In the last decade, a large number of knowledge graph (KG) completion approaches were proposed. Albeit effective, these efforts are disjoint, and their collective strengths and weaknesses in effective KG completion have not been studied in the literature. We extend Plumber, a framework that brings together the research community's disjoint efforts on KG completion. We include more components into the architecture of Plumber to comprise 40 reusable components for various KG completion subtasks, such as coreference resolution, entity linking, and relation extraction. Using these components, Plumber dynamically generates suitable knowledge extraction pipelines and offers overall 432 distinct pipelines. We study the optimization problem of choosing optimal pipelines based on input sentences. To do so, we train a transformer-based classification model that extracts contextual embeddings from the input and finds an appropriate pipeline. We study the efficacy of Plumber for extracting the KG triples using standard datasets over three KGs: DBpedia, Wikidata, and Open Research Knowledge Graph. Our results demonstrate the effectiveness of Plumber in dynamically generating KG completion pipelines, outperforming all baselines agnostic of the underlying KG. Furthermore, we provide an analysis of collective failure cases, study the similarities and synergies among integrated components and discuss their limitations.
Collapse
Affiliation(s)
| | | | - Markus Stocker
- TIB Leibniz Information Centre for Science and Technology, Hanover, Germany
| | - Andreas Both
- Anhalt University of Applied Sciences, Bernburg, Germany
| | - Sören Auer
- TIB Leibniz Information Centre for Science and Technology, Hanover, Germany
| |
Collapse
|
28
|
Nomoto T. Keyword Extraction: A Modern Perspective. SN Comput Sci 2023; 4:92. [PMID: 36536753 DOI: 10.1007/s42979-022-01481-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2021] [Accepted: 10/27/2022] [Indexed: 12/23/2022]
Abstract
The goal of keyword extraction is to extract from a text, words, or phrases indicative of what it is talking about. In this work, we look at keyword extraction from a number of different perspectives: Statistics, Automatic Term Indexing, Information Retrieval (IR), Natural Language Processing (NLP), and the emerging Neural paradigm. The 1990s have seen some early attempts to tackle the issue primarily based on text statistics [13, 17]. Meanwhile, in IR, efforts were largely led by DARPA's Topic Detection and Tracking (TDT) project [2]. In this contribution, we discuss how past innovations paved a way for more recent developments, such as LDA, PageRank, and Neural Networks. We walk through the history of keyword extraction over the last 50 years, noting differences and similarities among methods that emerged during the time. We conduct a large meta-analysis of the past literature using datasets from news media, science, and medicine to business and bureaucracy, to draw a general picture of what a successful approach would look like.
Collapse
|
29
|
Pereira A, Almeida JR, Lopes RP, Oliveira JL. Querying semantic catalogues of biomedical databases. J Biomed Inform 2023; 137:104272. [PMID: 36563828 DOI: 10.1016/j.jbi.2022.104272] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Revised: 11/03/2022] [Accepted: 12/12/2022] [Indexed: 12/24/2022]
Abstract
BACKGROUND Secondary use of health data is a valuable source of knowledge that boosts observational studies, leading to important discoveries in the medical and biomedical sciences. The fundamental guiding principle for performing a successful observational study is the research question and the approach in advance of executing a study. However, in multi-centre studies, finding suitable datasets to support the study is challenging, time-consuming, and sometimes impossible without a deep understanding of each dataset. METHODS We propose a strategy for retrieving biomedical datasets of interest that were semantically annotated, using an interface built by applying a methodology for transforming natural language questions into formal language queries. The advantages of creating biomedical semantic data are enhanced by using natural language interfaces to issue complex queries without manipulating a logical query language. RESULTS Our methodology was validated using Alzheimer's disease datasets published in a European platform for sharing and reusing biomedical data. We converted data to semantic information format using biomedical ontologies in everyday use in the biomedical community and published it as a FAIR endpoint. We have considered natural language questions of three types: single-concept questions, questions with exclusion criteria, and multi-concept questions. Finally, we analysed the performance of the question-answering module we used and its limitations. The source code is publicly available at https://bioinformatics-ua.github.io/BioKBQA/. CONCLUSION We propose a strategy for using information extracted from biomedical data and transformed into a semantic format using open biomedical ontologies. Our method uses natural language to formulate questions to be answered by this semantic data without the direct use of formal query languages.
Collapse
Affiliation(s)
| | - João Rafael Almeida
- DETI/IEETA, LASI, University of Aveiro, Aveiro, Portugal; Department of Computation, University of A Coruña, A Coruña, Spain.
| | - Rui Pedro Lopes
- CeDRI, Polytechnic Institute of Bragança, Bragança, Portugal.
| | | |
Collapse
|
30
|
Landolsi MY, Hlaoua L, Ben Romdhane L. Information extraction from electronic medical documents: state of the art and future research directions. Knowl Inf Syst 2023; 65:463-516. [PMID: 36405956 PMCID: PMC9640816 DOI: 10.1007/s10115-022-01779-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2022] [Revised: 05/04/2022] [Accepted: 10/17/2022] [Indexed: 11/10/2022]
Abstract
In the medical field, a doctor must have a comprehensive knowledge by reading and writing narrative documents, and he is responsible for every decision he takes for patients. Unfortunately, it is very tiring to read all necessary information about drugs, diseases and patients due to the large amount of documents that are increasing every day. Consequently, so many medical errors can happen and even kill people. Likewise, there is such an important field that can handle this problem, which is the information extraction. There are several important tasks in this field to extract the important and desired information from unstructured text written in natural language. The main principal tasks are named entity recognition and relation extraction since they can structure the text by extracting the relevant information. However, in order to treat the narrative text we should use natural language processing techniques to extract useful information and features. In our paper, we introduce and discuss the several techniques and solutions used in these tasks. Furthermore, we outline the challenges in information extraction from medical documents. In our knowledge, this is the most comprehensive survey in the literature with an experimental analysis and a suggestion for some uncovered directions.
Collapse
Affiliation(s)
- Mohamed Yassine Landolsi
- MARS Research Laboratory, SDM Research Group, ISITCom, University of Sousse, Hammam Sousse, Tunisia
| | - Lobna Hlaoua
- MARS Research Laboratory, SDM Research Group, ISITCom, University of Sousse, Hammam Sousse, Tunisia
| | - Lotfi Ben Romdhane
- MARS Research Laboratory, SDM Research Group, ISITCom, University of Sousse, Hammam Sousse, Tunisia
| |
Collapse
|
31
|
Bombieri M, Rospocher M, Ponzetto SP, Fiorini P. Machine understanding surgical actions from intervention procedure textbooks. Comput Biol Med 2023; 152:106415. [PMID: 36527782 DOI: 10.1016/j.compbiomed.2022.106415] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/24/2022] [Revised: 11/23/2022] [Accepted: 12/04/2022] [Indexed: 12/12/2022]
Abstract
The automatic extraction of procedural surgical knowledge from surgery manuals, academic papers or other high-quality textual resources, is of the utmost importance to develop knowledge-based clinical decision support systems, to automatically execute some procedure's step or to summarize the procedural information, spread throughout the texts, in a structured form usable as a study resource by medical students. In this work, we propose a first benchmark on extracting detailed surgical actions from available intervention procedure textbooks and papers. We frame the problem as a Semantic Role Labeling task. Exploiting a manually annotated dataset, we apply different Transformer-based information extraction methods. Starting from RoBERTa and BioMedRoBERTa pre-trained language models, we first investigate a zero-shot scenario and compare the obtained results with a full fine-tuning setting. We then introduce a new ad-hoc surgical language model, named SurgicBERTa, pre-trained on a large collection of surgical materials, and we compare it with the previous ones. In the assessment, we explore different dataset splits (one in-domain and two out-of-domain) and we investigate also the effectiveness of the approach in a few-shot learning scenario. Performance is evaluated on three correlated sub-tasks: predicate disambiguation, semantic argument disambiguation and predicate-argument disambiguation. Results show that the fine-tuning of a pre-trained domain-specific language model achieves the highest performance on all splits and on all sub-tasks. All models are publicly released.
Collapse
Affiliation(s)
- Marco Bombieri
- Department of Computer Science, University of Verona, Verona, Italy.
| | - Marco Rospocher
- Department of Foreign Languages and Literatures, University of Verona, Verona, Italy
| | | | - Paolo Fiorini
- Department of Computer Science, University of Verona, Verona, Italy
| |
Collapse
|
32
|
Wang Q, Liao J, Lapata M, Macleod M. PICO entity extraction for preclinical animal literature. Syst Rev 2022; 11:209. [PMID: 36180888 PMCID: PMC9524079 DOI: 10.1186/s13643-022-02074-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/22/2021] [Accepted: 09/12/2022] [Indexed: 12/09/2022] Open
Abstract
BACKGROUND Natural language processing could assist multiple tasks in systematic reviews to reduce workflow, including the extraction of PICO elements such as study populations, interventions, comparators and outcomes. The PICO framework provides a basis for the retrieval and selection for inclusion of evidence relevant to a specific systematic review question, and automatic approaches to PICO extraction have been developed particularly for reviews of clinical trial findings. Considering the difference between preclinical animal studies and clinical trials, developing separate approaches is necessary. Facilitating preclinical systematic reviews will inform the translation from preclinical to clinical research. METHODS We randomly selected 400 abstracts from the PubMed Central Open Access database which described in vivo animal research and manually annotated these with PICO phrases for Species, Strain, methods of Induction of disease model, Intervention, Comparator and Outcome. We developed a two-stage workflow for preclinical PICO extraction. Firstly we fine-tuned BERT with different pre-trained modules for PICO sentence classification. Then, after removing the text irrelevant to PICO features, we explored LSTM-, CRF- and BERT-based models for PICO entity recognition. We also explored a self-training approach because of the small training corpus. RESULTS For PICO sentence classification, BERT models using all pre-trained modules achieved an F1 score of over 80%, and models pre-trained on PubMed abstracts achieved the highest F1 of 85%. For PICO entity recognition, fine-tuning BERT pre-trained on PubMed abstracts achieved an overall F1 of 71% and satisfactory F1 for Species (98%), Strain (70%), Intervention (70%) and Outcome (67%). The score of Induction and Comparator is less satisfactory, but F1 of Comparator can be improved to 50% by applying self-training. CONCLUSIONS Our study indicates that of the approaches tested, BERT pre-trained on PubMed abstracts is the best for both PICO sentence classification and PICO entity recognition in the preclinical abstracts. Self-training yields better performance for identifying comparators and strains.
Collapse
Affiliation(s)
- Qianying Wang
- CCBS, Edinburgh Medical School, University of Edinburgh, Edinburgh, UK
| | - Jing Liao
- CCBS, Edinburgh Medical School, University of Edinburgh, Edinburgh, UK
| | - Mirella Lapata
- ILCC, School of Informatics, University of Edinburgh, Edinburgh, UK
| | - Malcolm Macleod
- CCBS, Edinburgh Medical School, University of Edinburgh, Edinburgh, UK.
| |
Collapse
|
33
|
Paraskevopoulos S, Smeets P, Tian X, Medema G. Using Artificial Intelligence to extract information on pathogen characteristics from scientific publications. Int J Hyg Environ Health 2022; 245:114018. [PMID: 35985219 DOI: 10.1016/j.ijheh.2022.114018] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2022] [Revised: 07/29/2022] [Accepted: 07/30/2022] [Indexed: 10/15/2022]
Abstract
Health risk assessment of environmental exposure to pathogens requires complete and up to date knowledge. With the rapid growth of scientific publications and the protocolization of literature reviews, an automated approach based on Artificial Intelligence (AI) techniques could help extract meaningful information from the literature and make literature reviews more efficient. The objective of this research was to determine whether it is feasible to extract both qualitative and quantitative information from scientific publications about the waterborne pathogen Legionella on PubMed, using Deep Learning and Natural Language Processing techniques. The model effectively extracted the qualitative and quantitative characteristics with high precision, recall and F-score of 0.91, 0.80, and 0.85 respectively. The AI extraction yielded results that were comparable to manual information extraction. Overall, AI could reliably extract both qualitative and quantitative information about Legionella from scientific literature. Our study paved the way for a better understanding of the information extraction processes and is a first step towards harnessing AI to collect meaningful information on pathogen characteristics from environmental microbiology publications.
Collapse
Affiliation(s)
- Sotirios Paraskevopoulos
- KWR Water Research Institute, Groningenhaven 7, P.O. Box 1072, 3430 BB, Nieuwegein, the Netherlands; Department of Water Management, Delft University of Technology, Stevinweg 1, 2628, CN Delft, the Netherlands.
| | - Patrick Smeets
- KWR Water Research Institute, Groningenhaven 7, P.O. Box 1072, 3430 BB, Nieuwegein, the Netherlands
| | - Xin Tian
- KWR Water Research Institute, Groningenhaven 7, P.O. Box 1072, 3430 BB, Nieuwegein, the Netherlands
| | - Gertjan Medema
- KWR Water Research Institute, Groningenhaven 7, P.O. Box 1072, 3430 BB, Nieuwegein, the Netherlands; Department of Water Management, Delft University of Technology, Stevinweg 1, 2628, CN Delft, the Netherlands
| |
Collapse
|
34
|
Sunkle S, Saxena K, Patil A, Kulkarni V. AI-driven streamlined modeling: experiences and lessons learned from multiple domains. Softw Syst Model 2022; 21:1-23. [PMID: 35221860 PMCID: PMC8857636 DOI: 10.1007/s10270-022-00982-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 03/17/2021] [Revised: 12/05/2021] [Accepted: 01/24/2022] [Indexed: 06/14/2023]
Abstract
Model-driven technologies (MD*), considered beneficial through abstraction and automation, have not enjoyed widespread adoption in the industry. In keeping with the recent trends, using AI techniques might help the benefits of MD* outweigh their costs. Although the modeling community has started using AI techniques, it is, in our opinion, quite limited and requires a change in perspective. We provide such a perspective through five industrial case studies where we use AI techniques in different modeling activities. We discuss our experiences and lessons learned, in some cases evolving purely modeling solutions with AI techniques, and in others considering the AI aids from the beginning. We believe that these case studies can help the researchers and practitioners make sense of various artifacts and data available to them and use applicable AI techniques to enhance suitable modeling activities.
Collapse
Affiliation(s)
- Sagar Sunkle
- Tata Consultancy Services Research, Pune, 411013 India
| | - Krati Saxena
- Tata Consultancy Services Research, Pune, 411013 India
| | - Ashwini Patil
- Tata Consultancy Services Research, Pune, 411013 India
| | | |
Collapse
|
35
|
Kim S, Choi Y, Won JH, Mi Oh J, Lee H. An annotated corpus from biomedical articles to construct a drug-food interaction database. J Biomed Inform 2022; 126:103985. [PMID: 35007753 DOI: 10.1016/j.jbi.2022.103985] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2021] [Revised: 12/08/2021] [Accepted: 01/03/2022] [Indexed: 11/27/2022]
Abstract
MOTIVATION While drug-food interaction (DFI) may undermine the efficacy and safety of drugs, DFI detection has been difficult because a well-organized database for DFI did not exist. To construct a DFI database and build a natural language processing system extracting DFI from biomedical articles, we formulated the DFI extraction tasks and manually annotated texts that could have contained DFI information. In this article, we introduced a new annotated corpus for extracting DFI, the DFI corpus. RESULTS The DFI corpus contains 2270 abstracts of biomedical articles accessible through PubMed and 2498 sentences that contain DFI and/or drug-drug information (DDI), a substantial amount of information about drug/food entities, evidence-levels of abstracts and relations between named entities. BERT models pre-trained on the biomedical domain achieved a F1 score 55.0% in extracting DFI key-sentences. To the best of our knowledge, the DFI corpus is the largest public corpus for drug-food interaction. AVAILABILITY AND IMPLEMENTATION Our corpus is available at https://github.com/ccadd-snu/corpus-for-DFI-extraction.
Collapse
Affiliation(s)
- Siun Kim
- Department of Applied Biomedical Engineering, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Korea; Department of Clinical Pharmacology and Therapeutics, Seoul National University College of Medicine and Hospital, Seoul, Korea
| | - Yoona Choi
- Department of Applied Biomedical Engineering, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Korea; Department of Clinical Pharmacology and Therapeutics, Seoul National University College of Medicine and Hospital, Seoul, Korea
| | - Jung-Hyun Won
- Department of Clinical Pharmacology and Therapeutics, Seoul National University College of Medicine and Hospital, Seoul, Korea; Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Korea
| | - Jung Mi Oh
- College of Pharmacy and Research Institute of Pharmaceutical Sciences, Seoul National University, Seoul, Korea.
| | - Howard Lee
- Department of Applied Biomedical Engineering, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Korea; Department of Clinical Pharmacology and Therapeutics, Seoul National University College of Medicine and Hospital, Seoul, Korea; Department of Molecular Medicine and Biopharmaceutical Sciences, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Korea; Center for Convergence Approaches in Drug Development, Graduate School of Convergence Science and Technology, Seoul National University, Seoul, Korea; Advanced Institute of Convergence Technology, Suwon, Korea.
| |
Collapse
|
36
|
Raja K. Biomedical Literature Mining and Its Components. Methods Mol Biol 2022; 2496:1-16. [PMID: 35713856 DOI: 10.1007/978-1-0716-2305-3_1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The published biomedical articles are the best source of knowledge to understand the importance of biomedical entities such as disease, drugs, and their role in different patient population groups. The number of biomedical literature available and being published is increasing at an exponential rate with the use of large scale experimental techniques. Manual extraction of such information is becoming extremely difficult because of the huge number of biomedical literature available. Alternatively, text mining approaches receive much interest within biomedicine by providing automatic extraction of such information in more structured format from the unstructured biomedical text. Here, a text mining protocol to extract the patient population information, to identify the disease and drug mentions in PubMed titles and abstracts, and a simple information retrieval approach to retrieve a list of relevant documents for a user query are presented. The text mining protocol presented in this chapter is useful for retrieving information on drugs for patients with a specific disease. The protocol covers three major text mining tasks, namely, information retrieval, information extraction, and knowledge discovery.
Collapse
Affiliation(s)
- Kalpana Raja
- Regenerative Biology, Morgridge Institute for Research, Madison, WI, USA.
| |
Collapse
|
37
|
Anand D, Manoharan S, Iyyappan OR, Anand S, Raja K. Extracting Significant Comorbid Diseases from MeSH Index of PubMed. Methods Mol Biol 2022; 2496:283-299. [PMID: 35713870 DOI: 10.1007/978-1-0716-2305-3_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
Text mining is an important research area to be explored in terms of understanding disease associations and have an insight in disease comorbidities. The reason for comorbid occurrence in any patient may be genetic or molecular interference from any other processes. Comorbidity and multimorbidity may be technically different, yet still are inseparable in studies. They have overlapping nature of associations and hence can be integrated for a more rational approach. The association rule generally used to determine comorbidity may also be helpful in novel knowledge prediction or may even serve as an important tool of assessment in surgical cases. Another approach of interest may be to utilize biological vocabulary resources like UMLS/MeSH across a patient health information and analyze the interrelationship between different health conditions. The protocol presented here can be utilized for understanding the disease associations and analyze at an extensive level.
Collapse
Affiliation(s)
- Dheepa Anand
- Department of Pharmacology, Cheran College of Pharmacy, Coimbatore, Tamilnadu, India
| | - Sharanya Manoharan
- Department of Bioinformatics, Stella Maris College (Autonomous), Chennai, Tamilnadu, India
| | - Oviya Ramalakshmi Iyyappan
- Department of Sciences, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Chennai, Tamilnadu, India
| | - Sadhanha Anand
- Department of Biomedical Engineering, PSG College of Technology, Coimbatore, Tamilnadu, India
| | - Kalpana Raja
- Regenerative Biology, The Morgridge Institute for Research, Madison, WI, USA.
| |
Collapse
|
38
|
Manoharan S, Iyyappan OR. A Hybrid Protocol for Finding Novel Gene Targets for Various Diseases Using Microarray Expression Data Analysis and Text Mining. Methods Mol Biol 2022; 2496:41-70. [PMID: 35713858 DOI: 10.1007/978-1-0716-2305-3_3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/15/2023]
Abstract
The advancement in technology for various scientific experiments and the amount of raw data produced from that is enormous, thus giving rise to various subsets of biologists working with genome, proteome, transcriptome, expression, pathway, and so on. This has led to exponential growth in scientific literature which is becoming beyond the means of manual curation and annotation for extracting information of importance. Microarray data are expression data, analysis of which results in a set of up/downregulated lists of genes that are functionally annotated to ascertain the biological meaning of genes. These genes are represented as vocabularies and/or Gene Ontology terms when associated with pathway enrichment analysis need relational and conceptual understanding to a disease. The chapter deals with a hybrid approach we designed for identifying novel drug-disease targets. Microarray data for muscular dystrophy is explored here as an example and text mining approaches are utilized with an aim to identify promisingly novel drug targets. Our main objective is to give a basic overview from a biologist's perspective for whom text mining approaches of data mining and information retrieval is fairly a new concept. The chapter aims to bridge the gap between biologist and computational text miners and bring about unison for a more informative research in a fast and time efficient manner.
Collapse
Affiliation(s)
- Sharanya Manoharan
- Department of Bioinformatics, Stella Maris College (Autonomous), Chennai, Tamilnadu, India.
| | - Oviya Ramalakshmi Iyyappan
- Department of Sciences, Amrita School of Engineering, Amrita Vishwa Vidyapeetham, Chennai, Tamilnadu, India
| |
Collapse
|
39
|
Abstract
Drug-drug interactions (DDIs) and adverse drug reactions (ADRs) occur during the pharmacotherapy of multiple comorbidities and in susceptible individuals. DDIs and ADRs limit the therapeutic outcomes in pharmacotherapy. DDIs and ADRs have significant impact on patients' life and health care cost. Hence, knowledge of DDI and ADRs is required for providing better clinical outcomes to patients. Various approaches are developed by the scientific community to document and report the occurrences of DDIs and ADRs through scientific publications. Due to the enormously increasing number of publications and the requirement of updated information on DDIs and ADRs, manual retrieval of data is time consuming and laborious. Various automated techniques are developed to get information on DDIs and ADRs. One such technique is text mining of DDIs and ADRs from published biomedical literature in PubMed. Here, we present a recently developed text mining protocol for predicting DDIs and ADRs from PubMed abstracts.
Collapse
Affiliation(s)
| | - Kalpana Raja
- Regenerative Biology, Morgridge Institute for Research, Madison, WI, USA.
- Department of Dermatology, University of Michigan Medical School, Ann Arbor, MI, USA.
| | - Mohamad Taufik Hidayat Baharuldin
- Department of Human Anatomy, Faculty of Medicine and Health Sciences, University Putra Malaysia (UPM), Serdang, Selangor, Malaysia
- Unit of Physiology, Department of Preclinical, Faculty of Medicine and Defence Health, National Defence University of Malaysia,, Kuala Lumpur, Malaysia
| |
Collapse
|
40
|
Datta S, Roberts K. Fine-grained spatial information extraction in radiology as two-turn question answering. Int J Med Inform 2021; 158:104628. [PMID: 34839119 PMCID: PMC9072592 DOI: 10.1016/j.ijmedinf.2021.104628] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Revised: 10/25/2021] [Accepted: 10/25/2021] [Indexed: 11/29/2022]
Abstract
OBJECTIVES Radiology reports contain important clinical information that can be used to automatically construct fine-grained labels for applications requiring deep phenotyping. We propose a two-turn question answering (QA) method based on a transformer language model, BERT, for extracting detailed spatial information from radiology reports. We aim to demonstrate the advantage that a multi-turn QA framework provides over sequence-based methods for extracting fine-grained information. METHODS Our proposed method identifies spatial and descriptor information by answering queries given a radiology report text. We frame the extraction problem such that all the main radiology entities (e.g., finding, device, anatomy) and the spatial trigger terms (denoting the presence of a spatial relation between finding/device and anatomical location) are identified in the first turn. In the subsequent turn, various other contextual information that acts as important spatial roles with respect to a spatial trigger term are extracted along with identifying the spatial and other descriptor terms qualifying a radiological entity. The queries are constructed using separate templates for the two turns and we employ two query variations in the second turn. RESULTS When compared to the best-reported work on this task using a traditional sequence tagging method, the two-turn QA model exceeds its performance on every component. This includes promising improvements of 12, 13, and 12 points in the average F1 scores for identifying the spatial triggers, Figure, and Ground frame elements, respectively. DISCUSSION Our experiments suggest that incorporating domain knowledge in the query (a general description about a frame element) helps in obtaining better results for some of the spatial and descriptive frame elements, especially in the case of the clinical pre-trained BERT model. We further highlight that the two-turn QA approach fits well for extracting information for complex schema where the objective is to identify all the frame elements linked to each spatial trigger and finding/device/anatomy entity, thereby enabling the extraction of more comprehensive information in the radiology domain. CONCLUSION Extracting fine-grained spatial information from text in the form of answering natural language queries holds potential in achieving better results when compared to more standard sequence labeling-based approaches.
Collapse
Affiliation(s)
- Surabhi Datta
- School of Biomedical Informatics, The University of Texas Health Science Center, Houston, TX, United States.
| | - Kirk Roberts
- School of Biomedical Informatics, The University of Texas Health Science Center, Houston, TX, United States.
| |
Collapse
|
41
|
Chapman AB, Jones A, Kelley AT, Jones B, Gawron L, Montgomery AE, Byrne T, Suo Y, Cook J, Pettey W, Peterson K, Jones M, Nelson R. ReHouSED: A novel measurement of Veteran housing stability using natural language processing. J Biomed Inform 2021; 122:103903. [PMID: 34474188 PMCID: PMC8608249 DOI: 10.1016/j.jbi.2021.103903] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/18/2021] [Revised: 08/07/2021] [Accepted: 08/27/2021] [Indexed: 10/20/2022]
Abstract
Housing stability is an important determinant of health. The US Department of Veterans Affairs (VA) administers several programs to assist Veterans experiencing unstable housing. Measuring long-term housing stability of Veterans who receive assistance from VA is difficult due to a lack of standardized structured documentation in the Electronic Health Record (EHR). However, the text of clinical notes often contains detailed information about Veterans' housing situations that may be extracted using natural language processing (NLP). We present a novel NLP-based measurement of Veteran housing stability: Relative Housing Stability in Electronic Documentation (ReHouSED). We first develop and evaluate a system for classifying documents containing information about Veterans' housing situations. Next, we aggregate information from multiple documents to derive a patient-level measurement of housing stability. Finally, we demonstrate this method's ability to differentiate between Veterans who are stably and unstably housed. Thus, ReHouSED provides an important methodological framework for the study of long-term housing stability among Veterans receiving housing assistance.
Collapse
Affiliation(s)
- Alec B Chapman
- Informatics, Decision-Enhancement and Analytic Sciences (IDEAS) Center, Veterans Affairs (VA) Salt Lake City Health Care System, Salt Lake City, UT, United States; Division of Epidemiology, University of Utah School of Medicine, Salt Lake City, UT, United States.
| | - Audrey Jones
- Informatics, Decision-Enhancement and Analytic Sciences (IDEAS) Center, Veterans Affairs (VA) Salt Lake City Health Care System, Salt Lake City, UT, United States; Division of Epidemiology, University of Utah School of Medicine, Salt Lake City, UT, United States
| | - A Taylor Kelley
- Informatics, Decision-Enhancement and Analytic Sciences (IDEAS) Center, Veterans Affairs (VA) Salt Lake City Health Care System, Salt Lake City, UT, United States; Division of General Internal Medicine, University of Utah School of Medicine, Salt Lake City, UT, United States
| | - Barbara Jones
- Informatics, Decision-Enhancement and Analytic Sciences (IDEAS) Center, Veterans Affairs (VA) Salt Lake City Health Care System, Salt Lake City, UT, United States; Division of Pulmonary and Critical Care Medicine, University of Utah and VA Healthcare System, Salt Lake City, UT, United States
| | - Lori Gawron
- Informatics, Decision-Enhancement and Analytic Sciences (IDEAS) Center, Veterans Affairs (VA) Salt Lake City Health Care System, Salt Lake City, UT, United States; Department of Obstetrics and Gynecology, University of Utah School of Medicine, Salt Lake City, UT, United States
| | - Ann Elizabeth Montgomery
- Birmingham Veterans Administration Medical Center, Birmingham, AL, United States; School of Public Health, University of Alabama at Birmingham, Birmingham, AL, United States; U.S. Department of Veterans Affairs, National Center on Homelessness among Veterans, Tampa, FL, United States
| | - Thomas Byrne
- U.S. Department of Veterans Affairs, National Center on Homelessness among Veterans, Tampa, FL, United States; U.S. Department of Veterans Affairs, Center for Healthcare Outcomes and Implementation Research, Edith Nourse Rodgers VA Medical Center, Bedford, MA, United States; Boston University School of Social Work, Boston, MA, United States
| | - Ying Suo
- Informatics, Decision-Enhancement and Analytic Sciences (IDEAS) Center, Veterans Affairs (VA) Salt Lake City Health Care System, Salt Lake City, UT, United States; Division of Epidemiology, University of Utah School of Medicine, Salt Lake City, UT, United States
| | - James Cook
- Informatics, Decision-Enhancement and Analytic Sciences (IDEAS) Center, Veterans Affairs (VA) Salt Lake City Health Care System, Salt Lake City, UT, United States; Division of Epidemiology, University of Utah School of Medicine, Salt Lake City, UT, United States
| | - Warren Pettey
- Informatics, Decision-Enhancement and Analytic Sciences (IDEAS) Center, Veterans Affairs (VA) Salt Lake City Health Care System, Salt Lake City, UT, United States; Division of Epidemiology, University of Utah School of Medicine, Salt Lake City, UT, United States
| | - Kelly Peterson
- Informatics, Decision-Enhancement and Analytic Sciences (IDEAS) Center, Veterans Affairs (VA) Salt Lake City Health Care System, Salt Lake City, UT, United States; Division of Epidemiology, University of Utah School of Medicine, Salt Lake City, UT, United States; Veterans Health Administration Office of Analytics and Performance Integration, United States
| | - Makoto Jones
- Informatics, Decision-Enhancement and Analytic Sciences (IDEAS) Center, Veterans Affairs (VA) Salt Lake City Health Care System, Salt Lake City, UT, United States; Division of Epidemiology, University of Utah School of Medicine, Salt Lake City, UT, United States
| | - Richard Nelson
- Informatics, Decision-Enhancement and Analytic Sciences (IDEAS) Center, Veterans Affairs (VA) Salt Lake City Health Care System, Salt Lake City, UT, United States; Division of Epidemiology, University of Utah School of Medicine, Salt Lake City, UT, United States; U.S. Department of Veterans Affairs, National Center on Homelessness among Veterans, Tampa, FL, United States
| |
Collapse
|
42
|
Mayer T, Marro S, Cabrio E, Villata S. Enhancing evidence-based medicine with natural language argumentative analysis of clinical trials. Artif Intell Med 2021; 118:102098. [PMID: 34412851 DOI: 10.1016/j.artmed.2021.102098] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/23/2020] [Revised: 02/11/2021] [Accepted: 05/05/2021] [Indexed: 11/24/2022]
Abstract
In the latest years, the healthcare domain has seen an increasing interest in the definition of intelligent systems to support clinicians in their everyday tasks and activities. Among others, also the field of Evidence-Based Medicine is impacted by this twist, with the aim to combine the reasoning frameworks proposed thus far in the field with mining algorithms to extract structured information from clinical trials, clinical guidelines, and Electronic Health Records. In this paper, we go beyond the state of the art by proposing a new end-to-end pipeline to address argumentative outcome analysis on clinical trials. More precisely, our pipeline is composed of (i) an Argument Mining module to extract and classify argumentative components (i.e., evidence and claims of the trial) and their relations (i.e., support, attack), and (ii) an outcome analysis module to identify and classify the effects (i.e., improved, increased, decreased, no difference, no occurrence) of an intervention on the outcome of the trial, based on PICO elements. We annotated a dataset composed of more than 500 abstracts of Randomized Controlled Trials (RCT) from the MEDLINE database, leading to a labeled dataset with 4198 argument components, 2601 argument relations, and 3351 outcomes on five different diseases (i.e., neoplasm, glaucoma, hepatitis, diabetes, hypertension). We experiment with deep bidirectional transformers in combination with different neural architectures (i.e., LSTM, GRU and CRF) and obtain a macro F1-score of.87 for component detection and.68 for relation prediction, outperforming current state-of-the-art end-to-end Argument Mining systems, and a macro F1-score of.80 for outcome classification.
Collapse
Affiliation(s)
- Tobias Mayer
- Université Côte d'Azur, CNRS, Inria I3S, France.
| | | | - Elena Cabrio
- Université Côte d'Azur, CNRS, Inria I3S, France.
| | | |
Collapse
|
43
|
Vashishth S, Newman-Griffis D, Joshi R, Dutt R, Rosé CP. Improving broad-coverage medical entity linking with semantic type prediction and large-scale datasets. J Biomed Inform 2021; 121:103880. [PMID: 34390853 PMCID: PMC8952339 DOI: 10.1016/j.jbi.2021.103880] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2021] [Revised: 07/31/2021] [Accepted: 07/31/2021] [Indexed: 10/28/2022]
Abstract
OBJECTIVES Biomedical natural language processing tools are increasingly being applied for broad-coverage information extraction-extracting medical information of all types in a scientific document or a clinical note. In such broad-coverage settings, linking mentions of medical concepts to standardized vocabularies requires choosing the best candidate concepts from large inventories covering dozens of types. This study presents a novel semantic type prediction module for biomedical NLP pipelines and two automatically-constructed, large-scale datasets with broad coverage of semantic types. METHODS We experiment with five off-the-shelf biomedical NLP toolkits on four benchmark datasets for medical information extraction from scientific literature and clinical notes. All toolkits adopt a staged approach of mention detection followed by two stages of medical entity linking: (1) generating a list of candidate concepts, and (2) picking the best concept among them. We introduce a semantic type prediction module to alleviate the problem of overgeneration of candidate concepts by filtering out irrelevant candidate concepts based on the predicted semantic type of a mention. We present MedType, a fully modular semantic type prediction model which we integrate into the existing NLP toolkits. To address the dearth of broad-coverage training data for medical information extraction, we further present WikiMed and PubMedDS, two large-scale datasets for medical entity linking. RESULTS Semantic type filtering improves medical entity linking performance across all toolkits and datasets, often by several percentage points of F-1. Further, pretraining MedType on our novel datasets achieves state-of-the-art performance for semantic type prediction in biomedical text. CONCLUSIONS Semantic type prediction is a key part of building accurate NLP pipelines for broad-coverage information extraction from biomedical text. We make our source code and novel datasets publicly available to foster reproducible research.
Collapse
Affiliation(s)
| | | | - Rishabh Joshi
- Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA
| | - Ritam Dutt
- Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA
| | - Carolyn P Rosé
- Carnegie Mellon University, 5000 Forbes Ave, Pittsburgh, PA, USA
| |
Collapse
|
44
|
Deshmukh PR, Phalnikar R. Information extraction for prognostic stage prediction from breast cancer medical records using NLP and ML. Med Biol Eng Comput 2021; 59:1751-1772. [PMID: 34297300 DOI: 10.1007/s11517-021-02399-7] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2020] [Accepted: 07/01/2021] [Indexed: 11/24/2022]
Abstract
For cancer prediction, the prognostic stage is the main factor that helps medical experts to decide the optimal treatment for a patient. Specialists study prognostic stage information from medical reports, often in an unstructured form, and take a larger review time. The main objective of this study is to suggest a generic clinical decision-unifying staging method to extract the most reliable prognostic stage information of breast cancer from medical records of various health institutions. Additional prognostic elements should be extracted from medical reports to identify the cancer stage for getting an exact measure of cancer and improving care quality. This study has collected 465 pathological and clinical reports of breast cancer sufferers from India's reputed medical institutions. The unstructured records were found distinct from each institute. Anatomic and biologic factors are extracted from medical records using the natural language processing, machine learning and rule-based method for prognostic stage detection. This study has extracted anatomic stage, grade, estrogen receptor (ER), progesterone receptor (PR), and human epidermal growth factor receptor 2 (HER2) from medical reports with high accuracy and predicted prognostic stage for both regions. The prognostic stage prediction's average accuracy is found 92% and 82% in rural and urban areas, respectively. It was essential to combine biological and anatomical elements under a single prognostic staging method. A generic clinical decision-unifying staging method for prognostic stage detection with great accuracy in various institutions of different regional areas suggests that the proposed research improves the prognosis of breast cancer.
Collapse
Affiliation(s)
- Pratiksha R Deshmukh
- School of Computer Engineering and Technology, MIT World Peace University, Pune, India, 411029. .,Department of Computer Engineering and Information Technology, College of Engineering, Pune, 411005, India.
| | - Rashmi Phalnikar
- School of Computer Engineering and Technology, MIT World Peace University, Pune, India, 411029
| |
Collapse
|
45
|
Almeida JR, Silva JF, Matos S, Oliveira JL. A two-stage workflow to extract and harmonize drug mentions from clinical notes into observational databases. J Biomed Inform 2021; 120:103849. [PMID: 34214696 DOI: 10.1016/j.jbi.2021.103849] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/02/2020] [Revised: 06/04/2021] [Accepted: 06/19/2021] [Indexed: 01/02/2023]
Abstract
BACKGROUND The content of the clinical notes that have been continuously collected along patients' health history has the potential to provide relevant information about treatments and diseases, and to increase the value of structured data available in Electronic Health Records (EHR) databases. EHR databases are currently being used in observational studies which lead to important findings in medical and biomedical sciences. However, the information present in clinical notes is not being used in those studies, since the computational analysis of this unstructured data is much complex in comparison to structured data. METHODS We propose a two-stage workflow for solving an existing gap in Extraction, Transformation and Loading (ETL) procedures regarding observational databases. The first stage of the workflow extracts prescriptions present in patient's clinical notes, while the second stage harmonises the extracted information into their standard definition and stores the resulting information in a common database schema used in observational studies. RESULTS We validated this methodology using two distinct data sets, in which the goal was to extract and store drug related information in a new Observational Medical Outcomes Partnership (OMOP) Common Data Model (CDM) database. We analysed the performance of the used annotator as well as its limitations. Finally, we described some practical examples of how users can explore these datasets once migrated to OMOP CDM databases. CONCLUSION With this methodology, we were able to show a strategy for using the information extracted from the clinical notes in business intelligence tools, or for other applications such as data exploration through the use of SQL queries. Besides, the extracted information complements the data present in OMOP CDM databases which was not directly available in the EHR database.
Collapse
Affiliation(s)
- João Rafael Almeida
- DETI/IEETA, University of Aveiro, Aveiro, Portugal; Department of Computation, University of A Coruña, A Coruña, Spain.
| | | | - Sérgio Matos
- DETI/IEETA, University of Aveiro, Aveiro, Portugal.
| | | |
Collapse
|
46
|
Miettinen J, Tanskanen T, Degerlund H, Nevala A, Malila N, Pitkäniemi J. Accurate pattern-based extraction of complex Gleason score expressions from pathology reports. J Biomed Inform 2021; 120:103850. [PMID: 34182148 DOI: 10.1016/j.jbi.2021.103850] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2020] [Revised: 04/25/2021] [Accepted: 06/19/2021] [Indexed: 11/20/2022]
Abstract
PURPOSE The Gleason score is an important grading factor of prostate cancer. Gleason scores can be extracted from pathology report texts using regular expressions, but previously developed programmes have targeted only relatively simple Gleason score expressions. We developed a programme capable of extracting also complex expressions. The programme is relatively easy to adapt to other languages and datasets. METHODS We developed and evaluated our regular expression-based programme using manually processed pathology reports of prostate cancer cases diagnosed in Finland in 2016-2017. Both simple and complex Gleason score expressions were targeted. We measured the performance of our programme using recall, precision, and the F1. The proportion of complex Gleason score expressions was estimated as the complement of the recall when only addition expressions (e.g. "Gleason 3 + 4") were targeted. RESULTS The detection of values (scores and score components) is based on mandatory keywords before or after the value. The programme favours precision over recall by primarily allowing for lists of optional expressions between keyword-value pairs and only secondarily allowing for arbitrary expressions. The programme is straightforward to adapt to new datasets by modifying the lists of mandatory and optional expressions. The full and addition-only programmes had 92% (95% CI: [90%, 95%]) and 65% ([61%, 70%]) recall and high precision (98% [97%, 99%] and 100% [99%, 100%]), respectively. The estimated proportion of complex Gleason score expressions was 100-65 = 35%. CONCLUSIONS Even complex Gleason score expressions can be extracted with high recall and precision using regular expressions. We recommend implementing automated Gleason score extraction where possible by adapting our validated programme.
Collapse
|
47
|
Akkasi A, Moens MF. Causal relationship extraction from biomedical text using deep neural models: A comprehensive survey. J Biomed Inform 2021; 119:103820. [PMID: 34044157 DOI: 10.1016/j.jbi.2021.103820] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/10/2020] [Revised: 05/08/2021] [Accepted: 05/15/2021] [Indexed: 01/10/2023]
Abstract
The identification of causal relationships between events or entities within biomedical texts is of great importance for creating scientific knowledge bases and is also a fundamental natural language processing (NLP) task. A causal (cause-effect) relation is defined as an association between two events in which the first must occur before the second. Although this task is an open problem in artificial intelligence, and despite its important role in information extraction from the biomedical literature, very few works have considered this problem. However, with the advent of new techniques in machine learning, especially deep neural networks, research increasingly addresses this problem. This paper summarizes state-of-the-art research, its applications, existing datasets, and remaining challenges. For this survey we have implemented and evaluated various techniques including a Multiview CNN (MVC), attention-based BiLSTM models and state-of-the-art word embedding models, such as those obtained with bidirectional encoder representations (ELMo) and transformer architectures (BioBERT). In addition, we have evaluated a graph LSTM as well as a baseline rule based system. We have investigated the class imbalance problem as an innate property of annotated data in this type of task. The results show that a considerable improvement of the results of state-of-the-art systems can be achieved when a simple random oversampling technique for data augmentation is used in order to reduce class imbalance.
Collapse
|
48
|
Lybarger K, Ostendorf M, Thompson M, Yetisgen M. Extracting COVID-19 diagnoses and symptoms from clinical text: A new annotated corpus and neural event extraction framework. J Biomed Inform 2021; 117:103761. [PMID: 33781918 PMCID: PMC7997694 DOI: 10.1016/j.jbi.2021.103761] [Citation(s) in RCA: 11] [Impact Index Per Article: 3.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/14/2020] [Revised: 03/02/2021] [Accepted: 03/20/2021] [Indexed: 12/29/2022]
Abstract
Coronavirus disease 2019 (COVID-19) is a global pandemic. Although much has been learned about the novel coronavirus since its emergence, there are many open questions related to tracking its spread, describing symptomology, predicting the severity of infection, and forecasting healthcare utilization. Free-text clinical notes contain critical information for resolving these questions. Data-driven, automatic information extraction models are needed to use this text-encoded information in large-scale studies. This work presents a new clinical corpus, referred to as the COVID-19 Annotated Clinical Text (CACT) Corpus, which comprises 1,472 notes with detailed annotations characterizing COVID-19 diagnoses, testing, and clinical presentation. We introduce a span-based event extraction model that jointly extracts all annotated phenomena, achieving high performance in identifying COVID-19 and symptom events with associated assertion values (0.83-0.97 F1 for events and 0.73-0.79 F1 for assertions). Our span-based event extraction model outperforms an extractor built on MetaMapLite for the identification of symptoms with assertion values. In a secondary use application, we predicted COVID-19 test results using structured patient data (e.g. vital signs and laboratory results) and automatically extracted symptom information, to explore the clinical presentation of COVID-19. Automatically extracted symptoms improve COVID-19 prediction performance, beyond structured data alone.
Collapse
Affiliation(s)
- Kevin Lybarger
- Biomedical & Health Informatics, University of Washington, Box 358047, Seattle, WA 98109, USA.
| | - Mari Ostendorf
- Department of Electrical & Computer Engineering, University of Washington, Campus Box 352500 185, Seattle, WA 98195-2500, USA
| | - Matthew Thompson
- Department of Family Medicine, University of Washington, Box 354696, Seattle, WA 98195-2500, USA
| | - Meliha Yetisgen
- Biomedical & Health Informatics, University of Washington, Box 358047, Seattle, WA 98109, USA
| |
Collapse
|
49
|
Hegazi MO, Al-Dossari Y, Al-Yahy A, Al-Sumari A, Hilal A. Preprocessing Arabic text on social media. Heliyon 2021; 7:e06191. [PMID: 33644469 PMCID: PMC7895730 DOI: 10.1016/j.heliyon.2021.e06191] [Citation(s) in RCA: 7] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/07/2019] [Revised: 05/19/2020] [Accepted: 02/01/2021] [Indexed: 11/04/2022] Open
Abstract
Currently, social media plays an important role in daily life and routine. Millions of people use social media for different purposes. Large amounts of data flow through online networks every second, and these data contain valuable information that can be extracted if the data are properly processed and analyzed. However, most of the processing results are affected by preprocessing difficulties. This paper presents an approach to extract information from social media Arabic text. It provides an integrated solution for the challenges in preprocessing Arabic text on social media in four stages: data collection, cleaning, enrichment, and availability. The preprocessed Arabic text is stored in structured database tables to provide a useful corpus to which, information extraction and data analysis algorithms can be applied. The experiment in this study reveals that the implementation of the proposed approach yields a useful and full-featured dataset and valuable information. The resultant dataset presented the Arabic text in three structured levels with more than 20 features. Additionally, the experiment provides valuable information and processed results such as topic classification and sentiment analysis.
Collapse
Affiliation(s)
- Mohamed Osman Hegazi
- Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
| | - Yasser Al-Dossari
- Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
| | - Abdullah Al-Yahy
- Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
| | - Abdulaziz Al-Sumari
- Department of Computer Science, College of Computer Engineering and Sciences, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
| | - Anwer Hilal
- Department of Computer and Self Development, Preparatory Year Deanship, Prince Sattam Bin Abdulaziz University, Al-Kharj 11942, Saudi Arabia
| |
Collapse
|
50
|
Abstract
Meta-analysis has been recognized as the best means to evaluate objectively and study the evidence for a particular issue. In order to give researchers a better understanding of the Meta process, we present an overall introduction to Meta-analysis in terms of comprehensive assessment of the literature, goals, advantages, main steps, and article structure.
Collapse
Affiliation(s)
- Xu Fang
- Department of AnesthesiologyAffiliated Hospital of Zunyi Medical UniversityZunyiGuizhouChina
| | - Nan Zhao
- Department of AnesthesiologyAffiliated Stomatology Hospital of Zunyi Medical UniversityZunyiGuizhouChina
| | - Zhao‐Qiong Zhu
- Department of AnesthesiologyAffiliated Hospital of Zunyi Medical UniversityZunyiGuizhouChina
| |
Collapse
|