Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Murugadoss K, Rajasekharan A, Malin B, Agarwal V, Bade S, Anderson JR, Ross JL, Faubion WA, Halamka JD, Soundararajan V, Ardhanari S. Building a best-in-class automated de-identification tool for electronic health records through ensemble learning. Patterns (N Y) 2021;2:100255. [PMID: 34179842 PMCID: PMC8212138 DOI: 10.1016/j.patter.2021.100255] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 02/24/2021] [Accepted: 04/07/2021] [Indexed: 10/29/2022]

For:	Murugadoss K, Rajasekharan A, Malin B, Agarwal V, Bade S, Anderson JR, Ross JL, Faubion WA, Halamka JD, Soundararajan V, Ardhanari S. Building a best-in-class automated de-identification tool for electronic health records through ensemble learning. Patterns (N Y) 2021;2:100255. [PMID: 34179842 PMCID: PMC8212138 DOI: 10.1016/j.patter.2021.100255] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 02/24/2021] [Accepted: 04/07/2021] [Indexed: 10/29/2022]

Number

Cited by Other Article(s)

Nedoshivina L, Halimi A, Bettencourt-Silva J, Braghin S. Pragmatic De-Identification of Cross-Domain Unstructured Documents: A Utility-Preserving Approach with Relation Extraction Filtering. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2024;2024:85-94. [PMID: 38827069 PMCID: PMC11141830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 06/04/2024]

Heider PM, Meystre SM. An Extensible Evaluation Framework Applied to Clinical Text Deidentification Natural Language Processing Tools: Multisystem and Multicorpus Study. J Med Internet Res 2024;26:e55676. [PMID: 38805692 PMCID: PMC11167315 DOI: 10.2196/55676] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2023] [Revised: 04/11/2024] [Accepted: 04/13/2024] [Indexed: 05/30/2024] Open

Abstract

BACKGROUND

Clinical natural language processing (NLP) researchers need access to directly comparable evaluation results for applications such as text deidentification across a range of corpus types and the means to easily test new systems or corpora within the same framework. Current systems, reported metrics, and the personally identifiable information (PII) categories evaluated are not easily comparable.

OBJECTIVE

This study presents an open-source and extensible end-to-end framework for comparing clinical NLP system performance across corpora even when the annotation categories do not align.

METHODS

As a use case for this framework, we use 6 off-the-shelf text deidentification systems (ie, CliniDeID, deid from PhysioNet, MITRE Identity Scrubber Toolkit [MIST], NeuroNER, National Library of Medicine [NLM] Scrubber, and Philter) across 3 standard clinical text corpora for the task (2 of which are publicly available) and 1 private corpus (all in English), with annotation categories that are not directly analogous. The framework is built on shell scripts that can be extended to include new systems, corpora, and performance metrics. We present this open tool, multiple means for aligning PII categories during evaluation, and our initial timing and performance metric findings. Code for running this framework with all settings needed to run all pairs are available via Codeberg and GitHub.

RESULTS

From this case study, we found large differences in processing speed between systems. The fastest system (ie, MIST) processed an average of 24.57 (SD 26.23) notes per second, while the slowest (ie, CliniDeID) processed an average of 1.00 notes per second. No system uniformly outperformed the others at identifying PII across corpora and categories. Instead, a rich tapestry of performance trade-offs emerged for PII categories. CliniDeID and Philter prioritize recall over precision (with an average recall 6.9 and 11.2 points higher, respectively, for partially matching spans of text matching any PII category), while the other 4 systems consistently have higher precision (with MIST's precision scoring 20.2 points higher, NLM Scrubber scoring 4.4 points higher, NeuroNER scoring 7.2 points higher, and deid scoring 17.1 points higher). The macroaverage recall across corpora for identifying names, one of the more sensitive PII categories, included deid (48.8%) and MIST (66.9%) at the low end and NeuroNER (84.1%), NLM Scrubber (88.1%), and CliniDeID (95.9%) at the high end. A variety of metrics across categories and corpora are reported with a wider variety (eg, F2-score) available via the tool.

CONCLUSIONS

NLP systems in general and deidentification systems and corpora in our use case tend to be evaluated in stand-alone research articles that only include a limited set of comparators. We hold that a single evaluation pipeline across multiple systems and corpora allows for more nuanced comparisons. Our open pipeline should reduce barriers to evaluation and system advancement.

Collapse

Kovačević A, Bašaragin B, Milošević N, Nenadić G. De-identification of clinical free text using natural language processing: A systematic review of current approaches. Artif Intell Med 2024;151:102845. [PMID: 38555848 DOI: 10.1016/j.artmed.2024.102845] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 03/13/2024] [Accepted: 03/18/2024] [Indexed: 04/02/2024]

Abstract

BACKGROUND

Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process.

OBJECTIVES

Our study aims to provide systematic evidence on how the de-identification of clinical free text written in English has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems for the English language. In addition, we aim to identify challenges and potential research opportunities in this field.

METHODS

A systematic search in PubMed, Web of Science, and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance.

RESULTS

A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. The majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora.

CONCLUSION

Earlier de-identification approaches aimed at English were mainly rule and machine learning hybrids with extensive feature engineering and post-processing, while more recent performance improvements are due to feature-inferring recurrent neural networks. Current leading performance is achieved using attention-based neural models. Recent studies report state-of-the-art F1-scores (over 98 %) when evaluated in the manner usually adopted by the clinical natural language processing community. However, their performance needs to be more thoroughly assessed with different measures to judge their reliability to safely de-identify data in a real-world setting. Without additional manually labeled training data, state-of-the-art systems fail to generalise well across a wide range of clinical sub-domains.

Collapse

Barman H, Venkateswaran S, Santo AD, Yoo U, Silvert E, Rao K, Raghunathan B, Kottschade LA, Block MS, Chandler GS, Zalis J, Wagner TE, Mohindra R. Identification and Characterization of Immune Checkpoint Inhibitor-Induced Toxicities From Electronic Health Records Using Natural Language Processing. JCO Clin Cancer Inform 2024;8:e2300151. [PMID: 38687915 PMCID: PMC11161244 DOI: 10.1200/cci.23.00151] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2023] [Revised: 01/09/2024] [Accepted: 03/01/2024] [Indexed: 05/02/2024] Open

Liu J, Gupta S, Chen A, Wang CK, Mishra P, Dai HJ, Wong ZSY, Jonnagaddala J. OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study. J Med Internet Res 2023;25:e48145. [PMID: 38055317 PMCID: PMC10733816 DOI: 10.2196/48145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 07/26/2023] [Accepted: 11/22/2023] [Indexed: 12/07/2023] Open

Abstract

BACKGROUND

Electronic health records (EHRs) in unstructured formats are valuable sources of information for research in both the clinical and biomedical domains. However, before such records can be used for research purposes, sensitive health information (SHI) must be removed in several cases to protect patient privacy. Rule-based and machine learning-based methods have been shown to be effective in deidentification. However, very few studies investigated the combination of transformer-based language models and rules.

OBJECTIVE

The objective of this study is to develop a hybrid deidentification pipeline for Australian EHR text notes using rules and transformers. The study also aims to investigate the impact of pretrained word embedding and transformer-based language models.

METHODS

In this study, we present a hybrid deidentification pipeline called OpenDeID, which is developed using an Australian multicenter EHR-based corpus called OpenDeID Corpus. The OpenDeID corpus consists of 2100 pathology reports with 38,414 SHI entities from 1833 patients. The OpenDeID pipeline incorporates a hybrid approach of associative rules, supervised deep learning, and pretrained language models.

RESULTS

The OpenDeID achieved a best F1-score of 0.9659 by fine-tuning the Discharge Summary BioBERT model and incorporating various preprocessing and postprocessing rules. The OpenDeID pipeline has been deployed at a large tertiary teaching hospital and has processed over 8000 unstructured EHR text notes in real time.

CONCLUSIONS

The OpenDeID pipeline is a hybrid deidentification pipeline to deidentify SHI entities in unstructured EHR text notes. The pipeline has been evaluated on a large multicenter corpus. External validation will be undertaken as part of our future work to evaluate the effectiveness of the OpenDeID pipeline.

Collapse

Awasthi S, Sachdeva N, Gupta Y, Anto AG, Asfahan S, Abbou R, Bade S, Sood S, Hegstrom L, Vellanki N, Alger HM, Babu M, Medina-Inojosa JR, McCully RB, Lerman A, Stampehl M, Barve R, Attia ZI, Friedman PA, Soundararajan V, Lopez-Jimenez F. Identification and risk stratification of coronary disease by artificial intelligence-enabled ECG. EClinicalMedicine 2023;65:102259. [PMID: 38106563 PMCID: PMC10725070 DOI: 10.1016/j.eclinm.2023.102259] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/12/2023] [Revised: 09/20/2023] [Accepted: 09/22/2023] [Indexed: 12/19/2023] Open

Abstract

Background

Atherosclerotic cardiovascular disease (ASCVD) is the leading cause of death worldwide, driven primarily by coronary artery disease (CAD). ASCVD risk estimators such as the pooled cohort equations (PCE) facilitate risk stratification and primary prevention of ASCVD but their accuracy is still suboptimal.

Methods

Using deep electronic health record data from 7,116,209 patients seen at 70+ hospitals and clinics across 5 states in the USA, we developed an artificial intelligence-based electrocardiogram analysis tool (ECG-AI) to detect CAD and assessed the additive value of ECG-AI-based ASCVD risk stratification to the PCE. We created independent ECG-AI models using separate neural networks including subjects without known history of ASCVD, to identify coronary artery calcium (CAC) score ≥300 Agatston units by computed tomography, obstructive CAD by angiography or procedural intervention, and regional left ventricular akinesis in ≥1 segment by echocardiogram, as a reflection of possible prior myocardial infarction (MI). These were used to assess the utility of ECG-AI-based ASCVD risk stratification in a retrospective observational study consisting of patients with PCE scores and no prior ASCVD. The study period covered all available digitized EHR data, with the first available ECG in 1987 and the last in February 2023.

Findings

ECG-AI for identifying CAC ≥300, obstructive CAD, and regional akinesis achieved area under the receiver operating characteristic (AUROC) values of 0.88, 0.85, and 0.94, respectively. An ensembled ECG-AI identified 3, 5, and 10-year risk for acute coronary events and mortality independently and additively to PCE. Hazard ratios for acute coronary events over 3-years in patients without ASCVD that tested positive on 1, 2, or 3 versus 0 disease-specific ECG-AI models at cohort entry were 2.41 (2.14-2.71), 4.23 (3.74-4.78), and 11.75 (10.2-13.52), respectively. Similar stratification was observed in cohorts stratified by PCE or age.

Interpretation

ECG-AI has potential to address unmet need for accessible risk stratification in patients in whom PCE under, over, or insufficiently estimates ASCVD risk, and in whom risk assessment over time periods shorter than 10 years is desired.

Funding

Anumana.

Collapse

Affiliation(s)

Samir Awasthi Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Nikhil Sachdeva Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Yash Gupta Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Ausath G. Anto Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Shahir Asfahan Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Ruben Abbou Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Sairam Bade Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Sanyam Sood Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Lars Hegstrom Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Nirupama Vellanki nference, Inc, One Main Street, Cambridge, MA, USA Beth Israel Deaconess Medical Center, Boston, MA, USA
Heather M. Alger Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Melwin Babu Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Jose R. Medina-Inojosa Mayo Clinic, Rochester, MN, USA
Robert B. McCully Mayo Clinic, Rochester, MN, USA
Amir Lerman Mayo Clinic, Rochester, MN, USA
Mark Stampehl Novartis Pharmaceuticals Corporation, East Hanover, NJ, USA
Rakesh Barve Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Zachi I. Attia Mayo Clinic, Rochester, MN, USA
Paul A. Friedman Mayo Clinic, Rochester, MN, USA
Venky Soundararajan Anumana, Inc, One Main Street, Cambridge, MA, USA nference, Inc, One Main Street, Cambridge, MA, USA
Francisco Lopez-Jimenez Mayo Clinic, Rochester, MN, USA

Collapse

Barman H, Sikirica V, Carlson K, Silvert E, Carlson KB, Boyer S, Glaser R, Morava E, Wagner T, Lanpher B. Retrospective study of propionic acidemia using natural language processing in Mayo Clinic electronic health record data. Mol Genet Metab 2023;140:107695. [PMID: 37708666 DOI: 10.1016/j.ymgme.2023.107695] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 06/29/2023] [Revised: 08/31/2023] [Accepted: 08/31/2023] [Indexed: 09/16/2023]

Abstract

BACKGROUND

Propionic acidemia (PA) is a rare autosomal recessive organic acidemia that classically presents within the first days of life with a metabolic crisis or via newborn screening and is confirmed with laboratory tests. Limited data exist on the natural history of patients with PA describing presentation, treatments, and clinical outcomes.

OBJECTIVE

To retrospectively describe the natural history of patients with PA in a clinical setting from a real-world database using both structured and unstructured electronic health record (EHR) data using novel data extraction techniques in a unique care setting.

DESIGN/METHODS

This retrospective study used EHR data to identify patients with PA seen at the Mayo Clinic. Unstructured clinical text (medical notes, pathology reports) were analyzed using augmented curation natural language processing models to enhance analysis of data extracted by structured data fields (International Classification of Diseases 9th or 10th revision [ICD-9/-10] codes, Current Procedural Terminology [CPT] codes, and medication orders). De-identified health records were also manually reviewed by clinical scientists to ensure data accuracy and completeness. The index date was defined as the patient's date of PA diagnosis at the Mayo Clinic. Results were reported as aggregate descriptive statistics relative to patients' index dates. Complications, therapeutic interventions, laboratory tests, procedures, and hospitalization encounters related to PA were described at and within 6 months of the patient's index date, and from medical history available before the index date.

RESULTS

In total, 13 patients with PA were identified, with visits occurring from 1998 to 2022. Age at diagnosis ranged from birth to 3 years; age at initial evaluation at the Mayo Clinic ranged from 3 days to 28 years. The mean number of Mayo Clinic outpatient visits was 31 (median duration of care, 2 years). PA-related complications were documented in 85% of patients and included nutritional difficulties (46%), metabolic decompensation events (MDEs; 38%), neurologic abnormalities (38%), and cardiomyopathy (7%). One pair of affected siblings had mild symptoms and no complications or MDEs. All 5 patients with a history of MDEs presented with developmental delays. Among patients with MDEs, the mean frequency of outpatient clinical care visits was 10 per year, and 3 patients required inpatient hospitalization (mean duration, 16 days). The incidence of severe complications was higher among patients with MDEs than those without MDEs. Of the patients with MDEs, 2 experienced crises while receiving treatment at the Mayo Clinic, with 9 total MDEs occurring between the 2 patients. Symptoms at presentation included hyperammonemia (78%), fever and/or decreased nutritional intake (67%), hyperglycemia/hypoglycemia (56%), intercurrent upper respiratory infection and/or lethargy (44%), constipation (33%), altered mental status (33%), and cough (33%).

CONCLUSIONS

This study highlights the range and frequency of clinical outcomes experienced by patients with PA and demonstrates the clinical burden of MDEs.

Collapse

Durango MC, Torres-Silva EA, Orozco-Duque A. Named Entity Recognition in Electronic Health Records: A Methodological Review. Healthc Inform Res 2023;29:286-300. [PMID: 37964451 PMCID: PMC10651400 DOI: 10.4258/hir.2023.29.4.286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2023] [Revised: 07/29/2023] [Accepted: 09/03/2023] [Indexed: 11/16/2023] Open

El-Hayek C, Barzegar S, Faux N, Doyle K, Pillai P, Mutch SJ, Vaisey A, Ward R, Sanci L, Dunn AG, Hellard ME, Hocking JS, Verspoor K, Boyle DI. An evaluation of existing text de-identification tools for use with patient progress notes from Australian general practice. Int J Med Inform 2023;173:105021. [PMID: 36870249 DOI: 10.1016/j.ijmedinf.2023.105021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 02/07/2023] [Accepted: 02/10/2023] [Indexed: 02/13/2023]

Abstract

INTRODUCTION

Digitized patient progress notes from general practice represent a significant resource for clinical and public health research but cannot feasibly and ethically be used for these purposes without automated de-identification. Internationally, several open-source natural language processing tools have been developed, however, given wide variations in clinical documentation practices, these cannot be utilized without appropriate review. We evaluated the performance of four de-identification tools and assessed their suitability for customization to Australian general practice progress notes.

METHODS

Four tools were selected: three rule-based (HMS Scrubber, MIT De-id, Philter) and one machine learning (MIST). 300 patient progress notes from three general practice clinics were manually annotated with personally identifying information. We conducted a pairwise comparison between the manual annotations and patient identifiers automatically detected by each tool, measuring recall (sensitivity), precision (positive predictive value), f1-score (harmonic mean of precision and recall), and f2-score (weighs recall 2x higher than precision). Error analysis was also conducted to better understand each tool's structure and performance.

RESULTS

Manual annotation detected 701 identifiers in seven categories. The rule-based tools detected identifiers in six categories and MIST in three. Philter achieved the highest aggregate recall (67%) and the highest recall for NAME (87%). HMS Scrubber achieved the highest recall for DATE (94%) and all tools performed poorly on LOCATION. MIST achieved the highest precision for NAME and DATE while also achieving similar recall to the rule-based tools for DATE and highest recall for LOCATION. Philter had the lowest aggregate precision (37%), however preliminary adjustments of its rules and dictionaries showed a substantial reduction in false positives.

CONCLUSION

Existing off-the-shelf solutions for automated de-identification of clinical text are not immediately suitable for our context without modification. Philter is the most promising candidate due to its high recall and flexibility however will require extensive revising of its pattern matching rules and dictionaries.

Collapse

Ghosh P, Niesen MJ, Pawlowski C, Bandi H, Yoo U, Lenehan PJ, M. PK, Nadig M, Ross J, Ardhanari S, O’Horo JC, Venkatakrishnan AJ, Rosen CJ, Telenti A, Hurt RT, Soundararajan V. Severe acute infection and chronic pulmonary disease are risk factors for developing post-COVID-19 conditions. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2022:2022.11.30.22282831. [PMID: 36523407 PMCID: PMC9753786 DOI: 10.1101/2022.11.30.22282831] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/17/2023]

Abstract

Post-COVID-19 conditions, also known as "long COVID", has significantly impacted the lives of many individuals, but the risk factors for this condition are poorly understood. In this study, we performed a retrospective EHR analysis of 89,843 individuals at a multi-state health system in the United States with PCR-confirmed COVID-19, including 1,086 patients diagnosed with long COVID and 1,086 matched controls not diagnosed with long COVID. For these two cohorts, we evaluated a wide range of clinical covariates, including laboratory tests, medication orders, phenotypes recorded in the clinical notes, and outcomes. We found that chronic pulmonary disease (CPD) was significantly more common as a pre-existing condition for the long COVID cohort than the control cohort (odds ratio: 1.9, 95% CI: [1.5, 2.6]). Additionally, long-COVID patients were more likely to have a history of migraine (odds ratio: 2.2, 95% CI: [1.6, 3.1]) and fibromyalgia (odds ratio: 2.3, 95% CI: [1.3, 3.8]). During the acute infection phase, the following lab measurements were abnormal in the long COVID cohort: high triglycerides (meanlongCOVID: 278.5 mg/dL vs. meancontrol: 141.4 mg/dL), low HDL cholesterol levels (meanlongCOVID: 38.4 mg/dL vs. meancontrol: 52.5 mg/dL), and high neutrophil-lymphocyte ratio (meanlongCOVID: 10.7 vs. meancontrol: 7.2). The hospitalization rate during the acute infection phase was also higher in the long COVID cohort compared to the control cohort (ratelongCOVID: 5% vs. ratecontrol: 1%). Overall, this study suggests that the severity of acute infection and a history of CPD, migraine, CFS, or fibromyalgia may be risk factors for long COVID symptoms. Our findings motivate clinical studies to evaluate whether suppressing acute disease severity proactively, especially in patients at high risk, can reduce incidence of long COVID.

Collapse

Chen JS, Lin WC, Yang S, Chiang MF, Hribar MR. Development of an Open-Source Annotated Glaucoma Medication Dataset From Clinical Notes in the Electronic Health Record. Transl Vis Sci Technol 2022;11:20. [PMID: 36441131 PMCID: PMC9710490 DOI: 10.1167/tvst.11.11.20] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2022] [Accepted: 10/21/2022] [Indexed: 11/30/2022] Open

Abstract

Purpose

To describe the methods involved in processing and characteristics of an open dataset of annotated clinical notes from the electronic health record (EHR) annotated for glaucoma medications.

Methods

In this study, 480 clinical notes from office visits, medical record numbers (MRNs), visit identification numbers, provider names, and billing codes were extracted for 480 patients seen for glaucoma by a comprehensive or glaucoma ophthalmologist from January 1, 2019, to August 31, 2020. MRNs and all visit data were de-identified using a hash function with salt from the deidentifyr package. All progress notes were annotated for glaucoma medication name, route, frequency, dosage, and drug use using an open-source annotation tool, Doccano. Annotations were saved separately. All protected health information (PHI) in progress notes and annotated files were de-identified using the published de-identifying algorithm Philter. All progress notes and annotations were manually validated by two ophthalmologists to ensure complete de-identification.

Results

The final dataset contained 5520 annotated sentences, including those with and without medications, for 480 clinical notes. Manual validation revealed 10 instances of remaining PHI which were manually corrected.

Conclusions

Annotated free-text clinical notes can be de-identified for upload as an open dataset. As data availability increases with the adoption of EHRs, free-text open datasets will become increasingly valuable for "big data" research and artificial intelligence development. This dataset is published online and publicly available at https://github.com/jche253/Glaucoma_Med_Dataset.

Translational Relevance

This open access medication dataset may be a source of raw data for future research involving big data and artificial intelligence research using free-text.

Collapse

Moving towards vertically integrated artificial intelligence development. NPJ Digit Med 2022;5:143. [PMID: 36104535 PMCID: PMC9474277 DOI: 10.1038/s41746-022-00690-x] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2022] [Accepted: 08/31/2022] [Indexed: 11/08/2022] Open

Mercorelli L, Nguyen H, Gartell N, Brookes M, Morris J, Tam CS. A framework for de-identification of free-text data in electronic medical records enabling secondary use. AUST HEALTH REV 2022;46:289-293. [PMID: 35546422 DOI: 10.1071/ah21361] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/23/2021] [Accepted: 03/18/2022] [Indexed: 11/23/2022]

Best practices in the real-world data life cycle. PLOS DIGITAL HEALTH 2022;1:e0000003. [PMID: 36812509 PMCID: PMC9931348 DOI: 10.1371/journal.pdig.0000003] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]