1
|
Nedoshivina L, Halimi A, Bettencourt-Silva J, Braghin S. Pragmatic De-Identification of Cross-Domain Unstructured Documents: A Utility-Preserving Approach with Relation Extraction Filtering. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2024; 2024:85-94. [PMID: 38827069 PMCID: PMC11141830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]
Abstract
The volume of information, and in particular personal information, generated each day is increasing at a staggering rate. The ability to leverage such information depends greatly on being able to satisfy the many compliance and privacy regulations that are appearing all over the world. We present READI, a utility preserving framework for the unstructured document de-identification. READI leverages Named Entity Recognition and Relation Extraction technology to improve the quality of the entity detection, thus improving the overall quality of the data de-identification process. In this proof of concept study, we evaluate the proposed approach on two different datasets and compare with the existing state-of-the-art approaches. We show that Relation Extraction-based Approach for De-Identification (READI) notably reduces the number of false positives and improves the utility of the de-identified text.
Collapse
|
2
|
Heider PM, Meystre SM. An Extensible Evaluation Framework Applied to Clinical Text Deidentification Natural Language Processing Tools: Multisystem and Multicorpus Study. J Med Internet Res 2024; 26:e55676. [PMID: 38805692 PMCID: PMC11167315 DOI: 10.2196/55676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 04/11/2024] [Accepted: 04/13/2024] [Indexed: 05/30/2024] Open
Abstract
BACKGROUND Clinical natural language processing (NLP) researchers need access to directly comparable evaluation results for applications such as text deidentification across a range of corpus types and the means to easily test new systems or corpora within the same framework. Current systems, reported metrics, and the personally identifiable information (PII) categories evaluated are not easily comparable. OBJECTIVE This study presents an open-source and extensible end-to-end framework for comparing clinical NLP system performance across corpora even when the annotation categories do not align. METHODS As a use case for this framework, we use 6 off-the-shelf text deidentification systems (ie, CliniDeID, deid from PhysioNet, MITRE Identity Scrubber Toolkit [MIST], NeuroNER, National Library of Medicine [NLM] Scrubber, and Philter) across 3 standard clinical text corpora for the task (2 of which are publicly available) and 1 private corpus (all in English), with annotation categories that are not directly analogous. The framework is built on shell scripts that can be extended to include new systems, corpora, and performance metrics. We present this open tool, multiple means for aligning PII categories during evaluation, and our initial timing and performance metric findings. Code for running this framework with all settings needed to run all pairs are available via Codeberg and GitHub. RESULTS From this case study, we found large differences in processing speed between systems. The fastest system (ie, MIST) processed an average of 24.57 (SD 26.23) notes per second, while the slowest (ie, CliniDeID) processed an average of 1.00 notes per second. No system uniformly outperformed the others at identifying PII across corpora and categories. Instead, a rich tapestry of performance trade-offs emerged for PII categories. CliniDeID and Philter prioritize recall over precision (with an average recall 6.9 and 11.2 points higher, respectively, for partially matching spans of text matching any PII category), while the other 4 systems consistently have higher precision (with MIST's precision scoring 20.2 points higher, NLM Scrubber scoring 4.4 points higher, NeuroNER scoring 7.2 points higher, and deid scoring 17.1 points higher). The macroaverage recall across corpora for identifying names, one of the more sensitive PII categories, included deid (48.8%) and MIST (66.9%) at the low end and NeuroNER (84.1%), NLM Scrubber (88.1%), and CliniDeID (95.9%) at the high end. A variety of metrics across categories and corpora are reported with a wider variety (eg, F2-score) available via the tool. CONCLUSIONS NLP systems in general and deidentification systems and corpora in our use case tend to be evaluated in stand-alone research articles that only include a limited set of comparators. We hold that a single evaluation pipeline across multiple systems and corpora allows for more nuanced comparisons. Our open pipeline should reduce barriers to evaluation and system advancement.
Collapse
Affiliation(s)
- Paul M Heider
- Biomedical Informatics Center, Medical University of South Carolina, Charleston, SC, United States
| | - Stéphane M Meystre
- Institute of Digital Technologies for Personalised Healthcare (MeDiTech), University of Applied Sciences and Arts of Southern Switzerland, Lugano, Switzerland
| |
Collapse
|
3
|
Kovačević A, Bašaragin B, Milošević N, Nenadić G. De-identification of clinical free text using natural language processing: A systematic review of current approaches. Artif Intell Med 2024; 151:102845. [PMID: 38555848 DOI: 10.1016/j.artmed.2024.102845] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 03/13/2024] [Accepted: 03/18/2024] [Indexed: 04/02/2024]
Abstract
BACKGROUND Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process. OBJECTIVES Our study aims to provide systematic evidence on how the de-identification of clinical free text written in English has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems for the English language. In addition, we aim to identify challenges and potential research opportunities in this field. METHODS A systematic search in PubMed, Web of Science, and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance. RESULTS A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. The majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora. CONCLUSION Earlier de-identification approaches aimed at English were mainly rule and machine learning hybrids with extensive feature engineering and post-processing, while more recent performance improvements are due to feature-inferring recurrent neural networks. Current leading performance is achieved using attention-based neural models. Recent studies report state-of-the-art F1-scores (over 98 %) when evaluated in the manner usually adopted by the clinical natural language processing community. However, their performance needs to be more thoroughly assessed with different measures to judge their reliability to safely de-identify data in a real-world setting. Without additional manually labeled training data, state-of-the-art systems fail to generalise well across a wide range of clinical sub-domains.
Collapse
Affiliation(s)
- Aleksandar Kovačević
- The University of Novi Sad, Faculty of Technical Sciences, Trg Dositeja Obradovića 6, 21002 Novi Sad, Serbia
| | - Bojana Bašaragin
- The Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, 21000 Novi Sad, Serbia.
| | - Nikola Milošević
- The Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, 21000 Novi Sad, Serbia; Bayer A.G., Research and Development, Mullerstrasse 173, Berlin 13342, Germany
| | - Goran Nenadić
- The University of Manchester, Department of Computer Science, Manchester, United Kingdom
| |
Collapse
|
4
|
Barman H, Venkateswaran S, Santo AD, Yoo U, Silvert E, Rao K, Raghunathan B, Kottschade LA, Block MS, Chandler GS, Zalis J, Wagner TE, Mohindra R. Identification and Characterization of Immune Checkpoint Inhibitor-Induced Toxicities From Electronic Health Records Using Natural Language Processing. JCO Clin Cancer Inform 2024; 8:e2300151. [PMID: 38687915 PMCID: PMC11161244 DOI: 10.1200/cci.23.00151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 01/09/2024] [Accepted: 03/01/2024] [Indexed: 05/02/2024] Open
Abstract
PURPOSE Immune checkpoint inhibitors (ICIs) have revolutionized cancer treatment, yet their use is associated with immune-related adverse events (irAEs). Estimating the prevalence and patient impact of these irAEs in the real-world data setting is critical for characterizing the benefit/risk profile of ICI therapies beyond the clinical trial population. Diagnosis codes, such as International Classification of Diseases codes, do not comprehensively illustrate a patient's care journey and offer no insight into drug-irAE causality. This study aims to capture the relationship between ICIs and irAEs more accurately by using augmented curation (AC), a natural language processing-based innovation, on unstructured data in electronic health records. METHODS In a cohort of 9,290 patients treated with ICIs at Mayo Clinic from 2005 to 2021, we compared the prevalence of irAEs using diagnosis codes and AC models, which classify drug-irAE pairs in clinical notes with implied textual causality. Four illustrative irAEs with high patient impact-myocarditis, encephalitis, pneumonitis, and severe cutaneous adverse reactions, abbreviated as MEPS-were analyzed using corticosteroid administration and ICI discontinuation as proxies of severity. RESULTS For MEPS, only 70% (n = 118) of patients found by AC were also identified by diagnosis codes. Using AC models, patients with MEPS received corticosteroids for their respective irAE 82% of the time and permanently discontinued the ICI because of the irAE 35.9% (n = 115) of the time. CONCLUSION Overall, AC models enabled more accurate identification and assessment of patient impact of ICI-induced irAEs not found using diagnosis codes, demonstrating a novel and more efficient strategy to assess real-world clinical outcomes in patients treated with ICIs.
Collapse
|
5
|
Liu J, Gupta S, Chen A, Wang CK, Mishra P, Dai HJ, Wong ZSY, Jonnagaddala J. OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study. J Med Internet Res 2023; 25:e48145. [PMID: 38055317 PMCID: PMC10733816 DOI: 10.2196/48145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 07/26/2023] [Accepted: 11/22/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Electronic health records (EHRs) in unstructured formats are valuable sources of information for research in both the clinical and biomedical domains. However, before such records can be used for research purposes, sensitive health information (SHI) must be removed in several cases to protect patient privacy. Rule-based and machine learning-based methods have been shown to be effective in deidentification. However, very few studies investigated the combination of transformer-based language models and rules. OBJECTIVE The objective of this study is to develop a hybrid deidentification pipeline for Australian EHR text notes using rules and transformers. The study also aims to investigate the impact of pretrained word embedding and transformer-based language models. METHODS In this study, we present a hybrid deidentification pipeline called OpenDeID, which is developed using an Australian multicenter EHR-based corpus called OpenDeID Corpus. The OpenDeID corpus consists of 2100 pathology reports with 38,414 SHI entities from 1833 patients. The OpenDeID pipeline incorporates a hybrid approach of associative rules, supervised deep learning, and pretrained language models. RESULTS The OpenDeID achieved a best F1-score of 0.9659 by fine-tuning the Discharge Summary BioBERT model and incorporating various preprocessing and postprocessing rules. The OpenDeID pipeline has been deployed at a large tertiary teaching hospital and has processed over 8000 unstructured EHR text notes in real time. CONCLUSIONS The OpenDeID pipeline is a hybrid deidentification pipeline to deidentify SHI entities in unstructured EHR text notes. The pipeline has been evaluated on a large multicenter corpus. External validation will be undertaken as part of our future work to evaluate the effectiveness of the OpenDeID pipeline.
Collapse
Affiliation(s)
- Jiaxing Liu
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, China
| | | | - Aipeng Chen
- School of Computer Science and Engineering, UNSW, Sydney, Australia
| | - Chen-Kai Wang
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | | | - Hong-Jie Dai
- School of Post-Baccalaureate Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Zoie Shui-Yee Wong
- Graduate School of Public Health, St. Luke's International University, Tokyo, Japan
- The Kirby Institute, University of New South Wales, Sydney, Australia
| | - Jitendra Jonnagaddala
- School of Population Health, UNSW Sydney, Kensington, Australia
- NMC Royal Hospital, Khalifa City, Abu Dhabi, United Arab Emirates
| |
Collapse
|
6
|
Awasthi S, Sachdeva N, Gupta Y, Anto AG, Asfahan S, Abbou R, Bade S, Sood S, Hegstrom L, Vellanki N, Alger HM, Babu M, Medina-Inojosa JR, McCully RB, Lerman A, Stampehl M, Barve R, Attia ZI, Friedman PA, Soundararajan V, Lopez-Jimenez F. Identification and risk stratification of coronary disease by artificial intelligence-enabled ECG. EClinicalMedicine 2023; 65:102259. [PMID: 38106563 PMCID: PMC10725070 DOI: 10.1016/j.eclinm.2023.102259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 09/20/2023] [Accepted: 09/22/2023] [Indexed: 12/19/2023] Open
Abstract
Background Atherosclerotic cardiovascular disease (ASCVD) is the leading cause of death worldwide, driven primarily by coronary artery disease (CAD). ASCVD risk estimators such as the pooled cohort equations (PCE) facilitate risk stratification and primary prevention of ASCVD but their accuracy is still suboptimal. Methods Using deep electronic health record data from 7,116,209 patients seen at 70+ hospitals and clinics across 5 states in the USA, we developed an artificial intelligence-based electrocardiogram analysis tool (ECG-AI) to detect CAD and assessed the additive value of ECG-AI-based ASCVD risk stratification to the PCE. We created independent ECG-AI models using separate neural networks including subjects without known history of ASCVD, to identify coronary artery calcium (CAC) score ≥300 Agatston units by computed tomography, obstructive CAD by angiography or procedural intervention, and regional left ventricular akinesis in ≥1 segment by echocardiogram, as a reflection of possible prior myocardial infarction (MI). These were used to assess the utility of ECG-AI-based ASCVD risk stratification in a retrospective observational study consisting of patients with PCE scores and no prior ASCVD. The study period covered all available digitized EHR data, with the first available ECG in 1987 and the last in February 2023. Findings ECG-AI for identifying CAC ≥300, obstructive CAD, and regional akinesis achieved area under the receiver operating characteristic (AUROC) values of 0.88, 0.85, and 0.94, respectively. An ensembled ECG-AI identified 3, 5, and 10-year risk for acute coronary events and mortality independently and additively to PCE. Hazard ratios for acute coronary events over 3-years in patients without ASCVD that tested positive on 1, 2, or 3 versus 0 disease-specific ECG-AI models at cohort entry were 2.41 (2.14-2.71), 4.23 (3.74-4.78), and 11.75 (10.2-13.52), respectively. Similar stratification was observed in cohorts stratified by PCE or age. Interpretation ECG-AI has potential to address unmet need for accessible risk stratification in patients in whom PCE under, over, or insufficiently estimates ASCVD risk, and in whom risk assessment over time periods shorter than 10 years is desired. Funding Anumana.
Collapse
Affiliation(s)
- Samir Awasthi
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | - Nikhil Sachdeva
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | - Yash Gupta
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | - Ausath G. Anto
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | - Shahir Asfahan
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | - Ruben Abbou
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | - Sairam Bade
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | - Sanyam Sood
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | - Lars Hegstrom
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | - Nirupama Vellanki
- nference, Inc, One Main Street, Cambridge, MA, USA
- Beth Israel Deaconess Medical Center, Boston, MA, USA
| | - Heather M. Alger
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | - Melwin Babu
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | | | | | | | - Mark Stampehl
- Novartis Pharmaceuticals Corporation, East Hanover, NJ, USA
| | - Rakesh Barve
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | | | | | - Venky Soundararajan
- Anumana, Inc, One Main Street, Cambridge, MA, USA
- nference, Inc, One Main Street, Cambridge, MA, USA
| | | |
Collapse
|
7
|
Barman H, Sikirica V, Carlson K, Silvert E, Carlson KB, Boyer S, Glaser R, Morava E, Wagner T, Lanpher B. Retrospective study of propionic acidemia using natural language processing in Mayo Clinic electronic health record data. Mol Genet Metab 2023; 140:107695. [PMID: 37708666 DOI: 10.1016/j.ymgme.2023.107695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 08/31/2023] [Accepted: 08/31/2023] [Indexed: 09/16/2023]
Abstract
BACKGROUND Propionic acidemia (PA) is a rare autosomal recessive organic acidemia that classically presents within the first days of life with a metabolic crisis or via newborn screening and is confirmed with laboratory tests. Limited data exist on the natural history of patients with PA describing presentation, treatments, and clinical outcomes. OBJECTIVE To retrospectively describe the natural history of patients with PA in a clinical setting from a real-world database using both structured and unstructured electronic health record (EHR) data using novel data extraction techniques in a unique care setting. DESIGN/METHODS This retrospective study used EHR data to identify patients with PA seen at the Mayo Clinic. Unstructured clinical text (medical notes, pathology reports) were analyzed using augmented curation natural language processing models to enhance analysis of data extracted by structured data fields (International Classification of Diseases 9th or 10th revision [ICD-9/-10] codes, Current Procedural Terminology [CPT] codes, and medication orders). De-identified health records were also manually reviewed by clinical scientists to ensure data accuracy and completeness. The index date was defined as the patient's date of PA diagnosis at the Mayo Clinic. Results were reported as aggregate descriptive statistics relative to patients' index dates. Complications, therapeutic interventions, laboratory tests, procedures, and hospitalization encounters related to PA were described at and within 6 months of the patient's index date, and from medical history available before the index date. RESULTS In total, 13 patients with PA were identified, with visits occurring from 1998 to 2022. Age at diagnosis ranged from birth to 3 years; age at initial evaluation at the Mayo Clinic ranged from 3 days to 28 years. The mean number of Mayo Clinic outpatient visits was 31 (median duration of care, 2 years). PA-related complications were documented in 85% of patients and included nutritional difficulties (46%), metabolic decompensation events (MDEs; 38%), neurologic abnormalities (38%), and cardiomyopathy (7%). One pair of affected siblings had mild symptoms and no complications or MDEs. All 5 patients with a history of MDEs presented with developmental delays. Among patients with MDEs, the mean frequency of outpatient clinical care visits was 10 per year, and 3 patients required inpatient hospitalization (mean duration, 16 days). The incidence of severe complications was higher among patients with MDEs than those without MDEs. Of the patients with MDEs, 2 experienced crises while receiving treatment at the Mayo Clinic, with 9 total MDEs occurring between the 2 patients. Symptoms at presentation included hyperammonemia (78%), fever and/or decreased nutritional intake (67%), hyperglycemia/hypoglycemia (56%), intercurrent upper respiratory infection and/or lethargy (44%), constipation (33%), altered mental status (33%), and cough (33%). CONCLUSIONS This study highlights the range and frequency of clinical outcomes experienced by patients with PA and demonstrates the clinical burden of MDEs.
Collapse
Affiliation(s)
- Hannah Barman
- nference, One Main Street, Suite 400, East Arcade, 4th Floor, Cambridge, MA 02142, USA
| | - Vanja Sikirica
- Moderna, Inc., 200 Technology Sq, Cambridge, MA 02139, USA
| | - Katherine Carlson
- nference, One Main Street, Suite 400, East Arcade, 4th Floor, Cambridge, MA 02142, USA
| | - Eli Silvert
- nference, One Main Street, Suite 400, East Arcade, 4th Floor, Cambridge, MA 02142, USA
| | | | - Suzanne Boyer
- Division of Clinical Genomics, Mayo Clinic, 19th Floor, 200 First St. SW, Rochester, MN 55905, USA
| | - Ruchira Glaser
- Moderna, Inc., 200 Technology Sq, Cambridge, MA 02139, USA
| | - Eva Morava
- Division of Clinical Genomics, Mayo Clinic, 19th Floor, 200 First St. SW, Rochester, MN 55905, USA
| | - Tyler Wagner
- nference, One Main Street, Suite 400, East Arcade, 4th Floor, Cambridge, MA 02142, USA.
| | - Brendan Lanpher
- Division of Clinical Genomics, Mayo Clinic, 19th Floor, 200 First St. SW, Rochester, MN 55905, USA
| |
Collapse
|
8
|
Durango MC, Torres-Silva EA, Orozco-Duque A. Named Entity Recognition in Electronic Health Records: A Methodological Review. Healthc Inform Res 2023; 29:286-300. [PMID: 37964451 PMCID: PMC10651400 DOI: 10.4258/hir.2023.29.4.286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 07/29/2023] [Accepted: 09/03/2023] [Indexed: 11/16/2023] Open
Abstract
OBJECTIVES A substantial portion of the data contained in Electronic Health Records (EHR) is unstructured, often appearing as free text. This format restricts its potential utility in clinical decision-making. Named entity recognition (NER) methods address the challenge of extracting pertinent information from unstructured text. The aim of this study was to outline the current NER methods and trace their evolution from 2011 to 2022. METHODS We conducted a methodological literature review of NER methods, with a focus on distinguishing the classification models, the types of tagging systems, and the languages employed in various corpora. RESULTS Several methods have been documented for automatically extracting relevant information from EHRs using natural language processing techniques such as NER and relation extraction (RE). These methods can automatically extract concepts, events, attributes, and other data, as well as the relationships between them. Most NER studies conducted thus far have utilized corpora in English or Chinese. Additionally, the bidirectional encoder representation from transformers using the BIO tagging system architecture is the most frequently reported classification scheme. We discovered a limited number of papers on the implementation of NER or RE tasks in EHRs within a specific clinical domain. CONCLUSIONS EHRs play a pivotal role in gathering clinical information and could serve as the primary source for automated clinical decision support systems. However, the creation of new corpora from EHRs in specific clinical domains is essential to facilitate the swift development of NER and RE models applied to EHRs for use in clinical practice.
Collapse
Affiliation(s)
- María C. Durango
- Grupo de Investigación e Innovación Biomédica, Instituto Tecnológico Metropolitano, Antioquia,
Colombia
| | - Ever A. Torres-Silva
- Grupo de Investigación e Innovación Biomédica, Instituto Tecnológico Metropolitano, Antioquia,
Colombia
| | - Andrés Orozco-Duque
- Grupo de Investigación e Innovación Biomédica, Instituto Tecnológico Metropolitano, Antioquia,
Colombia
- Facultad de Ingenierías, Universidad de Medellín, Antioquia,
Colombia
| |
Collapse
|
9
|
El-Hayek C, Barzegar S, Faux N, Doyle K, Pillai P, Mutch SJ, Vaisey A, Ward R, Sanci L, Dunn AG, Hellard ME, Hocking JS, Verspoor K, Boyle DI. An evaluation of existing text de-identification tools for use with patient progress notes from Australian general practice. Int J Med Inform 2023; 173:105021. [PMID: 36870249 DOI: 10.1016/j.ijmedinf.2023.105021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 02/07/2023] [Accepted: 02/10/2023] [Indexed: 02/13/2023]
Abstract
INTRODUCTION Digitized patient progress notes from general practice represent a significant resource for clinical and public health research but cannot feasibly and ethically be used for these purposes without automated de-identification. Internationally, several open-source natural language processing tools have been developed, however, given wide variations in clinical documentation practices, these cannot be utilized without appropriate review. We evaluated the performance of four de-identification tools and assessed their suitability for customization to Australian general practice progress notes. METHODS Four tools were selected: three rule-based (HMS Scrubber, MIT De-id, Philter) and one machine learning (MIST). 300 patient progress notes from three general practice clinics were manually annotated with personally identifying information. We conducted a pairwise comparison between the manual annotations and patient identifiers automatically detected by each tool, measuring recall (sensitivity), precision (positive predictive value), f1-score (harmonic mean of precision and recall), and f2-score (weighs recall 2x higher than precision). Error analysis was also conducted to better understand each tool's structure and performance. RESULTS Manual annotation detected 701 identifiers in seven categories. The rule-based tools detected identifiers in six categories and MIST in three. Philter achieved the highest aggregate recall (67%) and the highest recall for NAME (87%). HMS Scrubber achieved the highest recall for DATE (94%) and all tools performed poorly on LOCATION. MIST achieved the highest precision for NAME and DATE while also achieving similar recall to the rule-based tools for DATE and highest recall for LOCATION. Philter had the lowest aggregate precision (37%), however preliminary adjustments of its rules and dictionaries showed a substantial reduction in false positives. CONCLUSION Existing off-the-shelf solutions for automated de-identification of clinical text are not immediately suitable for our context without modification. Philter is the most promising candidate due to its high recall and flexibility however will require extensive revising of its pattern matching rules and dictionaries.
Collapse
Affiliation(s)
- Carol El-Hayek
- Burnet Institute, Melbourne, Australia; Melbourne School of Population and Global Health, University of Melbourne, Australia; School of Public Health and Preventive Medicine, Monash University, Australia.
| | - Siamak Barzegar
- School of Computing and Information Systems, University of Melbourne, Australia
| | - Noel Faux
- Melbourne Data Analytics Platform, University of Melbourne, Australia; Florey Institute of Neuroscience and Mental Health, University of Melbourne, Australia
| | - Kim Doyle
- Melbourne Data Analytics Platform, University of Melbourne, Australia
| | - Priyanka Pillai
- Melbourne Data Analytics Platform, University of Melbourne, Australia; The Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
| | - Simon J Mutch
- Melbourne Data Analytics Platform, University of Melbourne, Australia
| | - Alaina Vaisey
- Melbourne School of Population and Global Health, University of Melbourne, Australia
| | - Roger Ward
- Department of General Practice and Primary Care, University of Melbourne, Australia
| | - Lena Sanci
- Department of General Practice and Primary Care, University of Melbourne, Australia
| | - Adam G Dunn
- School of Medical Sciences, University of Sydney, Australia
| | - Margaret E Hellard
- Burnet Institute, Melbourne, Australia; Melbourne School of Population and Global Health, University of Melbourne, Australia; School of Public Health and Preventive Medicine, Monash University, Australia; The Peter Doherty Institute for Infection and Immunity, Melbourne, Australia
| | - Jane S Hocking
- Melbourne School of Population and Global Health, University of Melbourne, Australia
| | - Karin Verspoor
- School of Computing and Information Systems, University of Melbourne, Australia; School of Computing Technologies, RMIT University, Melbourne, Australia
| | - Douglas Ir Boyle
- Department of General Practice and Primary Care, University of Melbourne, Australia
| |
Collapse
|
10
|
Ghosh P, Niesen MJ, Pawlowski C, Bandi H, Yoo U, Lenehan PJ, M. PK, Nadig M, Ross J, Ardhanari S, O’Horo JC, Venkatakrishnan AJ, Rosen CJ, Telenti A, Hurt RT, Soundararajan V. Severe acute infection and chronic pulmonary disease are risk factors for developing post-COVID-19 conditions. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2022:2022.11.30.22282831. [PMID: 36523407 PMCID: PMC9753786 DOI: 10.1101/2022.11.30.22282831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]
Abstract
Post-COVID-19 conditions, also known as "long COVID", has significantly impacted the lives of many individuals, but the risk factors for this condition are poorly understood. In this study, we performed a retrospective EHR analysis of 89,843 individuals at a multi-state health system in the United States with PCR-confirmed COVID-19, including 1,086 patients diagnosed with long COVID and 1,086 matched controls not diagnosed with long COVID. For these two cohorts, we evaluated a wide range of clinical covariates, including laboratory tests, medication orders, phenotypes recorded in the clinical notes, and outcomes. We found that chronic pulmonary disease (CPD) was significantly more common as a pre-existing condition for the long COVID cohort than the control cohort (odds ratio: 1.9, 95% CI: [1.5, 2.6]). Additionally, long-COVID patients were more likely to have a history of migraine (odds ratio: 2.2, 95% CI: [1.6, 3.1]) and fibromyalgia (odds ratio: 2.3, 95% CI: [1.3, 3.8]). During the acute infection phase, the following lab measurements were abnormal in the long COVID cohort: high triglycerides (meanlongCOVID: 278.5 mg/dL vs. meancontrol: 141.4 mg/dL), low HDL cholesterol levels (meanlongCOVID: 38.4 mg/dL vs. meancontrol: 52.5 mg/dL), and high neutrophil-lymphocyte ratio (meanlongCOVID: 10.7 vs. meancontrol: 7.2). The hospitalization rate during the acute infection phase was also higher in the long COVID cohort compared to the control cohort (ratelongCOVID: 5% vs. ratecontrol: 1%). Overall, this study suggests that the severity of acute infection and a history of CPD, migraine, CFS, or fibromyalgia may be risk factors for long COVID symptoms. Our findings motivate clinical studies to evaluate whether suppressing acute disease severity proactively, especially in patients at high risk, can reduce incidence of long COVID.
Collapse
Affiliation(s)
| | | | | | - Hari Bandi
- nference, inc., Cambridge, Massachusetts 02139, USA
| | - Unice Yoo
- nference, inc., Cambridge, Massachusetts 02139, USA
| | | | | | - Mihika Nadig
- nference, inc., Cambridge, Massachusetts 02139, USA
| | - Jason Ross
- nference, inc., 18 3rd St. S.W., Rochester MN 55902, USA
| | | | | | | | - Clifford J. Rosen
- Maine Medical Center, Portland, ME 04102, USA
- NIH RECOVER Initiative, USA
| | | | | | - Venky Soundararajan
- nference Labs, Bengaluru, India
- nference, inc., Cambridge, Massachusetts 02139, USA
- nference, inc., 18 3rd St. S.W., Rochester MN 55902, USA
- nference, inc. 2424 Erwin Road, Durham, NC 27705, USA
- Anumana, inc., Cambridge, Massachusetts 02139, USA
| |
Collapse
|
11
|
Chen JS, Lin WC, Yang S, Chiang MF, Hribar MR. Development of an Open-Source Annotated Glaucoma Medication Dataset From Clinical Notes in the Electronic Health Record. Transl Vis Sci Technol 2022; 11:20. [PMID: 36441131 PMCID: PMC9710490 DOI: 10.1167/tvst.11.11.20] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2022] [Accepted: 10/21/2022] [Indexed: 11/30/2022] Open
Abstract
Purpose To describe the methods involved in processing and characteristics of an open dataset of annotated clinical notes from the electronic health record (EHR) annotated for glaucoma medications. Methods In this study, 480 clinical notes from office visits, medical record numbers (MRNs), visit identification numbers, provider names, and billing codes were extracted for 480 patients seen for glaucoma by a comprehensive or glaucoma ophthalmologist from January 1, 2019, to August 31, 2020. MRNs and all visit data were de-identified using a hash function with salt from the deidentifyr package. All progress notes were annotated for glaucoma medication name, route, frequency, dosage, and drug use using an open-source annotation tool, Doccano. Annotations were saved separately. All protected health information (PHI) in progress notes and annotated files were de-identified using the published de-identifying algorithm Philter. All progress notes and annotations were manually validated by two ophthalmologists to ensure complete de-identification. Results The final dataset contained 5520 annotated sentences, including those with and without medications, for 480 clinical notes. Manual validation revealed 10 instances of remaining PHI which were manually corrected. Conclusions Annotated free-text clinical notes can be de-identified for upload as an open dataset. As data availability increases with the adoption of EHRs, free-text open datasets will become increasingly valuable for "big data" research and artificial intelligence development. This dataset is published online and publicly available at https://github.com/jche253/Glaucoma_Med_Dataset. Translational Relevance This open access medication dataset may be a source of raw data for future research involving big data and artificial intelligence research using free-text.
Collapse
Affiliation(s)
- Jimmy S. Chen
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR, USA
- Division of Ophthalmology Informatics and Data Science, Viterbi Family Department of Ophthalmology and Shiley Eye Institute, University of California San Diego, La Jolla, CA, USA
| | - Wei-Chun Lin
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA
| | - Sen Yang
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR, USA
| | - Michael F. Chiang
- National Eye Institute, National Institutes of Health, Bethesda, MD, USA
| | - Michelle R. Hribar
- Department of Ophthalmology, Casey Eye Institute, Oregon Health & Science University, Portland, OR, USA
- Department of Medical Informatics and Clinical Epidemiology, Oregon Health & Science University, Portland, OR, USA
| |
Collapse
|
12
|
Moving towards vertically integrated artificial intelligence development. NPJ Digit Med 2022; 5:143. [PMID: 36104535 PMCID: PMC9474277 DOI: 10.1038/s41746-022-00690-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Accepted: 08/31/2022] [Indexed: 11/08/2022] Open
Abstract
AbstractSubstantial interest and investment in clinical artificial intelligence (AI) research has not resulted in widespread translation to deployed AI solutions. Current attention has focused on bias and explainability in AI algorithm development, external validity and model generalisability, and lack of equity and representation in existing data. While of great importance, these considerations also reflect a model-centric approach seen in published clinical AI research, which focuses on optimising architecture and performance of an AI model on best available datasets. However, even robustly built models using state-of-the-art algorithms may fail once tested in realistic environments due to unpredictability of real-world conditions, out-of-dataset scenarios, characteristics of deployment infrastructure, and lack of added value to clinical workflows relative to cost and potential clinical risks. In this perspective, we define a vertically integrated approach to AI development that incorporates early, cross-disciplinary, consideration of impact evaluation, data lifecycles, and AI production, and explore its implementation in two contrasting AI development pipelines: a scalable “AI factory” (Mayo Clinic, Rochester, United States), and an end-to-end cervical cancer screening platform for resource poor settings (Paps AI, Mbarara, Uganda). We provide practical recommendations for implementers, and discuss future challenges and novel approaches (including a decentralised federated architecture being developed in the NHS (AI4VBH, London, UK)). Growth in global clinical AI research continues unabated, and introduction of vertically integrated teams and development practices can increase the translational potential of future clinical AI projects.
Collapse
|
13
|
Mercorelli L, Nguyen H, Gartell N, Brookes M, Morris J, Tam CS. A framework for de-identification of free-text data in electronic medical records enabling secondary use. AUST HEALTH REV 2022; 46:289-293. [PMID: 35546422 DOI: 10.1071/ah21361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 03/18/2022] [Indexed: 11/23/2022]
Abstract
Clinical free-text data represent a vast, untapped source of rich information. If more accessible for research it would supplement information captured in structured fields. Data need to be de-identified prior to being reused for research. However, a lack of transparency with existing de-identification software tools makes it difficult for data custodians to assess potential risks associated with the release of de-identified clinical free-text data. This case study describes the development of a framework for releasing de-identified clinical free-text data in two local health districts in NSW, Australia. A sample of clinical documents (n = 14 768 965), including progress notes, nursing and medical assessments and discharge summaries, were used for development. An algorithm was designed to identify and mask patient names without damaging data utility. For each note, the algorithm output the (i) note length before and after de-identification, (ii) the number of patient names and (iii) the number of common words. These outputs were used to iteratively refine the algorithm performance. This was followed by manual review of a random subset of records by a health information manager. Notes that were not correctly de-identified were fixed, and performance was reassessed until resolution. All notes in this sample were suitably de-identified using this method. Developing a transparent method for de-identifying clinical free-text data enables informed-decision making by data custodians and the safe re-use of clinical free-text data for research and public benefit.
Collapse
Affiliation(s)
- Louis Mercorelli
- Sydney Informatics Hub, University of Sydney, NSW, Australia; and Clinical Informatics Unit, Northern Sydney Local Health District, NSW, Australia
| | - Harrison Nguyen
- Performance and Analytics, Northern Sydney Local Health District, NSW, Australia; and Faculty of Medicine and Health, University of Sydney, Office 543, Level 5, School of Computer Science (J12), NSW 2006, Australia
| | - Nicole Gartell
- Health Information Services, Northern Sydney Local Health District, NSW, Australia
| | - Martyn Brookes
- Performance and Analytics, Northern Sydney Local Health District, NSW, Australia
| | | | - Charmaine S Tam
- Performance and Analytics, Northern Sydney Local Health District, NSW, Australia; and Faculty of Medicine and Health, University of Sydney, Office 543, Level 5, School of Computer Science (J12), NSW 2006, Australia
| |
Collapse
|
14
|
Abstract
With increasing digitization of healthcare, real-world data (RWD) are available in greater quantity and scope than ever before. Since the 2016 United States 21st Century Cures Act, innovations in the RWD life cycle have taken tremendous strides forward, largely driven by demand for regulatory-grade real-world evidence from the biopharmaceutical sector. However, use cases for RWD continue to grow in number, moving beyond drug development, to population health and direct clinical applications pertinent to payors, providers, and health systems. Effective RWD utilization requires disparate data sources to be turned into high-quality datasets. To harness the potential of RWD for emerging use cases, providers and organizations must accelerate life cycle improvements that support this process. We build on examples obtained from the academic literature and author experience of data curation practices across a diverse range of sectors to describe a standardized RWD life cycle containing key steps in production of useful data for analysis and insights. We delineate best practices that will add value to current data pipelines. Seven themes are highlighted that ensure sustainability and scalability for RWD life cycles: data standards adherence, tailored quality assurance, data entry incentivization, deploying natural language processing, data platform solutions, RWD governance, and ensuring equity and representation in data.
Collapse
|