1
|
Pilgram L, Meurers T, Malin B, Schaeffner E, Eckardt KU, Prasser F. The Costs of Anonymization: Case Study Using Clinical Data. J Med Internet Res 2024; 26:e49445. [PMID: 38657232 DOI: 10.2196/49445] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2023] [Revised: 01/14/2024] [Accepted: 02/13/2024] [Indexed: 04/26/2024] Open
Abstract
BACKGROUND Sharing data from clinical studies can accelerate scientific progress, improve transparency, and increase the potential for innovation and collaboration. However, privacy concerns remain a barrier to data sharing. Certain concerns, such as reidentification risk, can be addressed through the application of anonymization algorithms, whereby data are altered so that it is no longer reasonably related to a person. Yet, such alterations have the potential to influence the data set's statistical properties, such that the privacy-utility trade-off must be considered. This has been studied in theory, but evidence based on real-world individual-level clinical data is rare, and anonymization has not broadly been adopted in clinical practice. OBJECTIVE The goal of this study is to contribute to a better understanding of anonymization in the real world by comprehensively evaluating the privacy-utility trade-off of differently anonymized data using data and scientific results from the German Chronic Kidney Disease (GCKD) study. METHODS The GCKD data set extracted for this study consists of 5217 records and 70 variables. A 2-step procedure was followed to determine which variables constituted reidentification risks. To capture a large portion of the risk-utility space, we decided on risk thresholds ranging from 0.02 to 1. The data were then transformed via generalization and suppression, and the anonymization process was varied using a generic and a use case-specific configuration. To assess the utility of the anonymized GCKD data, general-purpose metrics (ie, data granularity and entropy), as well as use case-specific metrics (ie, reproducibility), were applied. Reproducibility was assessed by measuring the overlap of the 95% CI lengths between anonymized and original results. RESULTS Reproducibility measured by 95% CI overlap was higher than utility obtained from general-purpose metrics. For example, granularity varied between 68.2% and 87.6%, and entropy varied between 25.5% and 46.2%, whereas the average 95% CI overlap was above 90% for all risk thresholds applied. A nonoverlapping 95% CI was detected in 6 estimates across all analyses, but the overwhelming majority of estimates exhibited an overlap over 50%. The use case-specific configuration outperformed the generic one in terms of actual utility (ie, reproducibility) at the same level of privacy. CONCLUSIONS Our results illustrate the challenges that anonymization faces when aiming to support multiple likely and possibly competing uses, while use case-specific anonymization can provide greater utility. This aspect should be taken into account when evaluating the associated costs of anonymized data and attempting to maintain sufficiently high levels of privacy for anonymized data. TRIAL REGISTRATION German Clinical Trials Register DRKS00003971; https://drks.de/search/en/trial/DRKS00003971. INTERNATIONAL REGISTERED REPORT IDENTIFIER (IRRID) RR2-10.1093/ndt/gfr456.
Collapse
Affiliation(s)
- Lisa Pilgram
- Junior Digital Clinician Scientist Program, Biomedical Innovation Academy, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany
- Department of Nephrology and Medical Intensive Care, Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Thierry Meurers
- Medical Informatics Group, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Bradley Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Elke Schaeffner
- Institute of Public Health, Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Kai-Uwe Eckardt
- Department of Nephrology and Medical Intensive Care, Charité-Universitätsmedizin Berlin, Berlin, Germany
- Department of Nephrology and Hypertension, Universitätsklinikum Erlangen, Friedrich-Alexander University Erlangen-Nürnberg, Erlangen, Germany
| | - Fabian Prasser
- Medical Informatics Group, Berlin Institute of Health at Charité-Universitätsmedizin Berlin, Berlin, Germany
| |
Collapse
|
2
|
Moffatt C, Leshin J. Best Practices in Evolving Privacy Frameworks for Patient Age Data: Census Data Study. JMIR Form Res 2024; 8:e47248. [PMID: 38526530 PMCID: PMC11002729 DOI: 10.2196/47248] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2023] [Revised: 10/17/2023] [Accepted: 11/27/2023] [Indexed: 03/26/2024] Open
Abstract
BACKGROUND Over the previous 4 decennial censuses, the population of the United States has grown older, with the proportion of individuals aged at least 90 years old in the 2010 census being more than 2 and a half times what it was in the 1980 census. This suggests that the threshold for constraining age introduced in the Safe Harbor method of the HIPAA (Health Insurance Portability and Accountability Act) in 1996 may be increased without exceeding the original levels of risk. This is desirable to maintain or even increase the utility of affected data sets without compromising privacy. OBJECTIVE In light of the upcoming release of 2020 census data, this study presents a straightforward recipe for updating age-constrained thresholds in the context of new census data and derives recommendations for new thresholds from the 2010 census. METHODS Using census data dating back to 1980, we used group size considerations to analyze the risk associated with various maximum age thresholds over time. We inferred the level of risk of the age cutoff of 90 years at the time of HIPAA's inception in 1996 and used this as a baseline from which to recommend updated cutoffs. RESULTS The maximum age threshold may be increased by at least 2 years without exceeding the levels of risk conferred in HIPAA's original recommendations. Moreover, in the presence of additional information that restricts the population in question to a known subgroup with increased longevity (for example, restricting to female patients), the threshold may be increased further. CONCLUSIONS Increasing the maximum age threshold would enable the data user to gain more utility from the data without introducing risk beyond what was originally envisioned with the enactment of HIPAA. Going forward, a recurring update of such thresholds is advised, in line with the considerations detailed in the paper.
Collapse
|
3
|
Lee YQ, Chen CT, Chen CC, Lee CH, Chen P, Wu CS, Dai HJ. Unlocking the Secrets Behind Advanced Artificial Intelligence Language Models in Deidentifying Chinese-English Mixed Clinical Text: Development and Validation Study. J Med Internet Res 2024; 26:e48443. [PMID: 38271060 PMCID: PMC10853853 DOI: 10.2196/48443] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/24/2023] [Revised: 10/27/2023] [Accepted: 12/05/2023] [Indexed: 01/27/2024] Open
Abstract
BACKGROUND The widespread use of electronic health records in the clinical and biomedical fields makes the removal of protected health information (PHI) essential to maintain privacy. However, a significant portion of information is recorded in unstructured textual forms, posing a challenge for deidentification. In multilingual countries, medical records could be written in a mixture of more than one language, referred to as code mixing. Most current clinical natural language processing techniques are designed for monolingual text, and there is a need to address the deidentification of code-mixed text. OBJECTIVE The aim of this study was to investigate the effectiveness and underlying mechanism of fine-tuned pretrained language models (PLMs) in identifying PHI in the code-mixed context. Additionally, we aimed to evaluate the potential of prompting large language models (LLMs) for recognizing PHI in a zero-shot manner. METHODS We compiled the first clinical code-mixed deidentification data set consisting of text written in Chinese and English. We explored the effectiveness of fine-tuned PLMs for recognizing PHI in code-mixed content, with a focus on whether PLMs exploit naming regularity and mention coverage to achieve superior performance, by probing the developed models' outputs to examine their decision-making process. Furthermore, we investigated the potential of prompt-based in-context learning of LLMs for recognizing PHI in code-mixed text. RESULTS The developed methods were evaluated on a code-mixed deidentification corpus of 1700 discharge summaries. We observed that different PHI types had preferences in their occurrences within the different types of language-mixed sentences, and PLMs could effectively recognize PHI by exploiting the learned name regularity. However, the models may exhibit suboptimal results when regularity is weak or mentions contain unknown words that the representations cannot generate well. We also found that the availability of code-mixed training instances is essential for the model's performance. Furthermore, the LLM-based deidentification method was a feasible and appealing approach that can be controlled and enhanced through natural language prompts. CONCLUSIONS The study contributes to understanding the underlying mechanism of PLMs in addressing the deidentification process in the code-mixed context and highlights the significance of incorporating code-mixed training instances into the model training phase. To support the advancement of research, we created a manipulated subset of the resynthesized data set available for research purposes. Based on the compiled data set, we found that the LLM-based deidentification method is a feasible approach, but carefully crafted prompts are essential to avoid unwanted output. However, the use of such methods in the hospital setting requires careful consideration of data security and privacy concerns. Further research could explore the augmentation of PLMs and LLMs with external knowledge to improve their strength in recognizing rare PHI.
Collapse
Affiliation(s)
- You-Qian Lee
- Dialogue System Technical Department, Intelligent Robot, Asustek Computer Inc, Taipei, Taiwan
- Intelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
| | - Ching-Tai Chen
- Department of Bioinformatics and Medical Engineering, Asia University, Taichung, Taiwan
- Center for Precision Health Research, Asia University, Taichung, Taiwan
| | - Chien-Chang Chen
- Electromagnetic Sensing Control and AI Computing System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
| | - Chung-Hong Lee
- Knowledge Discovery and Data Mining Lab, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
| | - Peitsz Chen
- Department of Chemical Engineering, Feng Chia University, Taichung, Taiwan
| | - Chi-Shin Wu
- National Center for Geriatrics and Welfare Research, National Health Research Institutes, Zhunan, Taiwan
| | - Hong-Jie Dai
- Intelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung, Taiwan
- National Institute of Cancer Research, National Health Research Institutes, Tainan, Taiwan
- School of Post-Baccalaureate Medicine, College of Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
- Center for Big Data Research, Kaohsiung Medical University, Kaohsiung, Taiwan
| |
Collapse
|
4
|
Liu J, Gupta S, Chen A, Wang CK, Mishra P, Dai HJ, Wong ZSY, Jonnagaddala J. OpenDeID Pipeline for Unstructured Electronic Health Record Text Notes Based on Rules and Transformers: Deidentification Algorithm Development and Validation Study. J Med Internet Res 2023; 25:e48145. [PMID: 38055317 PMCID: PMC10733816 DOI: 10.2196/48145] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2023] [Revised: 07/26/2023] [Accepted: 11/22/2023] [Indexed: 12/07/2023] Open
Abstract
BACKGROUND Electronic health records (EHRs) in unstructured formats are valuable sources of information for research in both the clinical and biomedical domains. However, before such records can be used for research purposes, sensitive health information (SHI) must be removed in several cases to protect patient privacy. Rule-based and machine learning-based methods have been shown to be effective in deidentification. However, very few studies investigated the combination of transformer-based language models and rules. OBJECTIVE The objective of this study is to develop a hybrid deidentification pipeline for Australian EHR text notes using rules and transformers. The study also aims to investigate the impact of pretrained word embedding and transformer-based language models. METHODS In this study, we present a hybrid deidentification pipeline called OpenDeID, which is developed using an Australian multicenter EHR-based corpus called OpenDeID Corpus. The OpenDeID corpus consists of 2100 pathology reports with 38,414 SHI entities from 1833 patients. The OpenDeID pipeline incorporates a hybrid approach of associative rules, supervised deep learning, and pretrained language models. RESULTS The OpenDeID achieved a best F1-score of 0.9659 by fine-tuning the Discharge Summary BioBERT model and incorporating various preprocessing and postprocessing rules. The OpenDeID pipeline has been deployed at a large tertiary teaching hospital and has processed over 8000 unstructured EHR text notes in real time. CONCLUSIONS The OpenDeID pipeline is a hybrid deidentification pipeline to deidentify SHI entities in unstructured EHR text notes. The pipeline has been evaluated on a large multicenter corpus. External validation will be undertaken as part of our future work to evaluate the effectiveness of the OpenDeID pipeline.
Collapse
Affiliation(s)
- Jiaxing Liu
- School of Statistics and Mathematics, Zhongnan University of Economics and Law, Wuhan, China
| | | | - Aipeng Chen
- School of Computer Science and Engineering, UNSW, Sydney, Australia
| | - Chen-Kai Wang
- Department of Computer Science, National Yang Ming Chiao Tung University, Hsinchu, Taiwan
| | | | - Hong-Jie Dai
- School of Post-Baccalaureate Medicine, Kaohsiung Medical University, Kaohsiung, Taiwan
| | - Zoie Shui-Yee Wong
- Graduate School of Public Health, St. Luke's International University, Tokyo, Japan
- The Kirby Institute, University of New South Wales, Sydney, Australia
| | - Jitendra Jonnagaddala
- School of Population Health, UNSW Sydney, Kensington, Australia
- NMC Royal Hospital, Khalifa City, Abu Dhabi, United Arab Emirates
| |
Collapse
|
5
|
Patel R, Provenzano D, Loew M. Anonymization and validation of three-dimensional volumetric renderings of computed tomography data using commercially available T1-weighted magnetic resonance imaging-based algorithms. J Med Imaging (Bellingham) 2023; 10:066501. [PMID: 38074629 PMCID: PMC10704182 DOI: 10.1117/1.jmi.10.6.066501] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/14/2023] [Revised: 11/03/2023] [Accepted: 11/07/2023] [Indexed: 02/12/2024] Open
Abstract
Purpose Previous studies have demonstrated that three-dimensional (3D) volumetric renderings of magnetic resonance imaging (MRI) brain data can be used to identify patients using facial recognition. We have shown that facial features can be identified on simulation-computed tomography (CT) images for radiation oncology and mapped to face images from a database. We aim to determine whether CT images can be anonymized using anonymization software that was designed for T1-weighted MRI data. Approach Our study examines (1) the ability of off-the-shelf anonymization algorithms to anonymize CT data and (2) the ability of facial recognition algorithms to identify whether faces could be detected from a database of facial images. Our study generated 3D renderings from 57 head CT scans from The Cancer Imaging Archive database. Data were anonymized using AFNI (deface, reface, and 3Dskullstrip) and FSL's BET. Anonymized data were compared to the original renderings and passed through facial recognition algorithms (VGG-Face, FaceNet, DLib, and SFace) using a facial database (labeled faces in the wild) to determine what matches could be found. Results Our study found that all modules were able to process CT data and that AFNI's 3Dskullstrip and FSL's BET data consistently showed lower reidentification rates compared to the original. Conclusions The results from this study highlight the potential usage of anonymization algorithms as a clinical standard for deidentifying brain CT data. Our study demonstrates the importance of continued vigilance for patient privacy in publicly shared datasets and the importance of continued evaluation of anonymization methods for CT data.
Collapse
Affiliation(s)
- Rahil Patel
- George Washington University School of Engineering and Applied Science, Department of Biomedical Engineering, Washington, District of Columbia, United States
| | - Destie Provenzano
- George Washington University School of Engineering and Applied Science, Department of Biomedical Engineering, Washington, District of Columbia, United States
| | - Murray Loew
- George Washington University School of Engineering and Applied Science, Department of Biomedical Engineering, Washington, District of Columbia, United States
| |
Collapse
|
6
|
Liu L, Perez-Concha O, Nguyen A, Bennett V, Blake V, Gallego B, Jorm L. Web-Based Application Based on Human-in-the-Loop Deep Learning for Deidentifying Free-Text Data in Electronic Medical Records: Development and Usability Study. Interact J Med Res 2023; 12:e46322. [PMID: 37624624 PMCID: PMC10492176 DOI: 10.2196/46322] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 05/31/2023] [Accepted: 07/24/2023] [Indexed: 08/26/2023] Open
Abstract
BACKGROUND The narrative free-text data in electronic medical records (EMRs) contain valuable clinical information for analysis and research to inform better patient care. However, the release of free text for secondary use is hindered by concerns surrounding personally identifiable information (PII), as protecting individuals' privacy is paramount. Therefore, it is necessary to deidentify free text to remove PII. Manual deidentification is a time-consuming and labor-intensive process. Numerous automated deidentification approaches and systems have been attempted to overcome this challenge over the past decade. OBJECTIVE We sought to develop an accurate, web-based system deidentifying free text (DEFT), which can be readily and easily adopted in real-world settings for deidentification of free text in EMRs. The system has several key features including a simple and task-focused web user interface, customized PII types, use of a state-of-the-art deep learning model for tagging PII from free text, preannotation by an interactive learning loop, rapid manual annotation with autosave, support for project management and team collaboration, user access control, and central data storage. METHODS DEFT comprises frontend and backend modules and communicates with central data storage through a filesystem path access. The frontend web user interface provides end users with a user-friendly workspace for managing and annotating free text. The backend module processes the requests from the frontend and performs relevant persistence operations. DEFT manages the deidentification workflow as a project, which can contain one or more data sets. Customized PII types and user access control can also be configured. The deep learning model is based on a Bidirectional Long Short-Term Memory-Conditional Random Field (BiLSTM-CRF) with RoBERTa as the word embedding layer. The interactive learning loop is further integrated into DEFT to speed up the deidentification process and increase its performance over time. RESULTS DEFT has many advantages over existing deidentification systems in terms of its support for project management, user access control, data management, and an interactive learning process. Experimental results from DEFT on the 2014 i2b2 data set obtained the highest performance compared to 5 benchmark models in terms of microaverage strict entity-level recall and F1-scores of 0.9563 and 0.9627, respectively. In a real-world use case of deidentifying clinical notes, extracted from 1 referral hospital in Sydney, New South Wales, Australia, DEFT achieved a high microaverage strict entity-level F1-score of 0.9507 on a corpus of 600 annotated clinical notes. Moreover, the manual annotation process with preannotation demonstrated a 43% increase in work efficiency compared to the process without preannotation. CONCLUSIONS DEFT is designed for health domain researchers and data custodians to easily deidentify free text in EMRs. DEFT supports an interactive learning loop and end users with minimal technical knowledge can perform the deidentification work with only a shallow learning curve.
Collapse
Affiliation(s)
- Leibo Liu
- Centre for Big Data Research in Health, University of New South Wales, Sydney, Australia
| | - Oscar Perez-Concha
- Centre for Big Data Research in Health, University of New South Wales, Sydney, Australia
| | - Anthony Nguyen
- Australian e-Health Research Centre (AEHRC), Commonwealth Scientific and Industrial Research Organisation (CSIRO), Brisbane, Australia
| | - Vicki Bennett
- Metadata, Information Management and Classifications Unit (MIMCU), Australian Institute of Health and Welfare, Canberra, Australia
| | - Victoria Blake
- Eastern Heart Clinic, Prince of Wales Hospital, Randwick, Australia
| | - Blanca Gallego
- Centre for Big Data Research in Health, University of New South Wales, Sydney, Australia
| | - Louisa Jorm
- Centre for Big Data Research in Health, University of New South Wales, Sydney, Australia
| |
Collapse
|
7
|
Chambon PJ, Wu C, Steinkamp JM, Adleberg J, Cook TS, Langlotz CP. Automated deidentification of radiology reports combining transformer and "hide in plain sight" rule-based methods. J Am Med Inform Assoc 2023; 30:318-328. [PMID: 36416419 PMCID: PMC9846681 DOI: 10.1093/jamia/ocac219] [Citation(s) in RCA: 8] [Impact Index Per Article: 8.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 10/11/2022] [Accepted: 11/09/2022] [Indexed: 11/24/2022] Open
Abstract
OBJECTIVE To develop an automated deidentification pipeline for radiology reports that detect protected health information (PHI) entities and replaces them with realistic surrogates "hiding in plain sight." MATERIALS AND METHODS In this retrospective study, 999 chest X-ray and CT reports collected between November 2019 and November 2020 were annotated for PHI at the token level and combined with 3001 X-rays and 2193 medical notes previously labeled, forming a large multi-institutional and cross-domain dataset of 6193 documents. Two radiology test sets, from a known and a new institution, as well as i2b2 2006 and 2014 test sets, served as an evaluation set to estimate model performance and to compare it with previously released deidentification tools. Several PHI detection models were developed based on different training datasets, fine-tuning approaches and data augmentation techniques, and a synthetic PHI generation algorithm. These models were compared using metrics such as precision, recall and F1 score, as well as paired samples Wilcoxon tests. RESULTS Our best PHI detection model achieves 97.9 F1 score on radiology reports from a known institution, 99.6 from a new institution, 99.5 on i2b2 2006, and 98.9 on i2b2 2014. On reports from a known institution, it achieves 99.1 recall of detecting the core of each PHI span. DISCUSSION Our model outperforms all deidentifiers it was compared to on all test sets as well as human labelers on i2b2 2014 data. It enables accurate and automatic deidentification of radiology reports. CONCLUSIONS A transformer-based deidentification pipeline can achieve state-of-the-art performance for deidentifying radiology reports and other medical documents.
Collapse
Affiliation(s)
- Pierre J Chambon
- Department of Radiology, Stanford University, Stanford, California, USA
- Department of Applied Mathematics and Engineering, Paris-Saclay University, Ecole Centrale Paris, Paris, France
| | - Christopher Wu
- Department of Radiology, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Jackson M Steinkamp
- Department of Radiology, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Jason Adleberg
- Department of Radiology, Mount Sinai Health System, New York, New York, USA
| | - Tessa S Cook
- Department of Radiology, University of Pennsylvania, Philadelphia, Pennsylvania, USA
| | - Curtis P Langlotz
- Department of Radiology, Stanford University, Stanford, California, USA
| |
Collapse
|
8
|
Bruña R, Vaghari D, Greve A, Cooper E, Mada MO, Henson RN. Modified MRI Anonymization (De-Facing) for Improved MEG Coregistration. Bioengineering (Basel) 2022; 9:bioengineering9100591. [PMID: 36290559 PMCID: PMC9598466 DOI: 10.3390/bioengineering9100591] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/11/2022] [Revised: 10/02/2022] [Accepted: 10/17/2022] [Indexed: 01/28/2023]
Abstract
Localising the sources of MEG/EEG signals often requires a structural MRI to create a head model, while ensuring reproducible scientific results requires sharing data and code. However, sharing structural MRI data often requires the face go be hidden to help protect the identity of the individuals concerned. While automated de-facing methods exist, they tend to remove the whole face, which can impair methods for coregistering the MRI data with the EEG/MEG data. We show that a new, automated de-facing method that retains the nose maintains good MRI-MEG/EEG coregistration. Importantly, behavioural data show that this "face-trimming" method does not increase levels of identification relative to a standard de-facing approach and has less effect on the automated segmentation and surface extraction sometimes used to create head models for MEG/EEG localisation. We suggest that this trimming approach could be employed for future sharing of structural MRI data, at least for those to be used in forward modelling (source reconstruction) of EEG/MEG data.
Collapse
Affiliation(s)
- Ricardo Bruña
- Center for Cognitive and Computational Neuroscience, Universidad Complutense de Madrid, 28040 Madrid, Spain
- Department of Radiology, Rehabilitation and Physical Therapy, Universidad Complutense de Madrid, IdISSC, 28040 Madrid, Spain
- Correspondence:
| | - Delshad Vaghari
- Department of Electrical & Computer Engineering, Tarbiat Modares University, Tehran P.O. Box 14115-111, Iran
| | - Andrea Greve
- Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, Cambridge CB2 7EF, UK
| | - Elisa Cooper
- Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, Cambridge CB2 7EF, UK
| | - Marius O. Mada
- Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, Cambridge CB2 7EF, UK
| | - Richard N. Henson
- Medical Research Council Cognition and Brain Sciences Unit, University of Cambridge, Cambridge CB2 7EF, UK
- Department of Psychiatry, University of Cambridge, Cambridge CB2 OSZ, UK
| |
Collapse
|
9
|
Nanda JK, Marchetti MA. Consent and Deidentification of Patient Images in Dermatology Journals: Observational Study. JMIR Dermatol 2022; 5:e37398. [PMID: 36777646 PMCID: PMC9910807 DOI: 10.2196/37398] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2022] [Revised: 06/03/2022] [Accepted: 06/10/2022] [Indexed: 11/13/2022] Open
Affiliation(s)
- Japbani K Nanda
- Dermatology Service, Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| | - Michael Armando Marchetti
- Dermatology Service, Department of Medicine, Memorial Sloan Kettering Cancer Center, New York, NY, United States
| |
Collapse
|
10
|
Parobek CM, Thorsen MM, Has P, Lorenzi P, Clark MA, Russo ML, Lewkowitz AK. Video education about genetic privacy and patient perspectives about sharing prenatal genetic data: a randomized trial. Am J Obstet Gynecol 2022; 227:87.e1-87.e13. [PMID: 35351406 DOI: 10.1016/j.ajog.2022.03.047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 03/18/2022] [Accepted: 03/24/2022] [Indexed: 11/01/2022]
Abstract
BACKGROUND Laboratories offering cell-free DNA often reserve the right to share prenatal genetic data for research or even commercial purposes, and obtain this permission on the patient consent form. Although it is known that nonpregnant patients are often reluctant to share their genetic data for research, pregnant patients' knowledge of, and opinions about, genetic data privacy are unknown. OBJECTIVE We investigated whether pregnant patients who had already undergone cell-free DNA screening were aware that genetic data derived from cell-free DNA may be shared for research. Furthermore, we examined whether pregnant patients exposed to video education about the Genetic Information Nondiscrimination Act-a federal law that mandates workplace and health insurance protections against genetic discrimination-were more willing to share cell-free DNA-related genetic data for research than pregnant patients who were unexposed. STUDY DESIGN In this randomized controlled trial (ClinicalTrials.gov Identifier: NCT04420858), English-speaking patients with singleton pregnancies who underwent cell-free DNA and subsequently presented at 17 0/7 to 23 6/7 weeks of gestation for a detailed anatomy scan were randomized 1:1 to a control or intervention group. Both groups viewed an infographic about cell-free DNA. In addition, the intervention group viewed an educational video about the Genetic Information Nondiscrimination Act. The primary outcomes were knowledge about, and willingness to share, prenatal genetic data from cell-free DNA by commercial laboratories for nonclinical purposes, such as research. The secondary outcomes included knowledge about existing genetic privacy laws, knowledge about the potential for reidentification of anonymized genetic data, and acceptability of various use and sharing scenarios for prenatal genetic data. Eighty-one participants per group were required for 80% power to detect an increase in willingness to share data from 60% to 80% (α=0.05). RESULTS A total of 747 pregnant patients were screened, and 213 patients were deemed eligible and approached for potential study participation. Of these patients, 163 (76.5%) consented and were randomized; one participant discontinued the intervention, and two participants were excluded from analysis after the intervention when it was discovered that they did not fulfill all eligibility criteria. Overall, 160 (75.1%) of those approached were included in the final analysis. Most patients in the control group (72 [90.0%]) and intervention (76 [97.4%]) group were either unsure about or incorrectly thought that cell-free DNA companies could not share prenatal genetic data for research. Participants in the intervention group were more likely to incorrectly believe that their prenatal genetic data would not be shared for nonclinical purposes than participants in the control group (28.8% in the control group vs 46.2% in the intervention; P=.03). However, video education did not increase participant willingness to share genetic data in multiple scenarios. Non-White participants were less willing than White participants to allow sharing of genetic data specifically for academic research (P<.001). CONCLUSION Most participants were unaware that their prenatal genetic data may be used for nonclinical purposes. Pregnant patients who were educated about the Genetic Information Nondiscrimination Act were not more willing to share genetic data than those who did not receive this education. Surprisingly, video education about the Genetic Information Nondiscrimination Act led patients to falsely believe that their data would not be shared for research, and participants who identified as racial minorities were less willing to share genetic data. New strategies are needed to improve pregnant patients' understanding of genetic privacy.
Collapse
|
11
|
Lee K, Dobbins NJ, McInnes B, Yetisgen M, Uzuner Ö. Transferability of neural network clinical deidentification systems. J Am Med Inform Assoc 2021; 28:2661-2669. [PMID: 34586386 DOI: 10.1093/jamia/ocab207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/05/2021] [Revised: 07/19/2021] [Accepted: 09/10/2021] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Neural network deidentification studies have focused on individual datasets. These studies assume the availability of a sufficient amount of human-annotated data to train models that can generalize to corresponding test data. In real-world situations, however, researchers often have limited or no in-house training data. Existing systems and external data can help jump-start deidentification on in-house data; however, the most efficient way of utilizing existing systems and external data is unclear. This article investigates the transferability of a state-of-the-art neural clinical deidentification system, NeuroNER, across a variety of datasets, when it is modified architecturally for domain generalization and when it is trained strategically for domain transfer. MATERIALS AND METHODS We conducted a comparative study of the transferability of NeuroNER using 4 clinical note corpora with multiple note types from 2 institutions. We modified NeuroNER architecturally to integrate 2 types of domain generalization approaches. We evaluated each architecture using 3 training strategies. We measured transferability from external sources; transferability across note types; the contribution of external source data when in-domain training data are available; and transferability across institutions. RESULTS AND CONCLUSIONS Transferability from a single external source gave inconsistent results. Using additional external sources consistently yielded an F1-score of approximately 80%. Fine-tuning emerged as a dominant transfer strategy, with or without domain generalization. We also found that external sources were useful even in cases where in-domain training data were available. Transferability across institutions differed by note type and annotation label but resulted in improved performance.
Collapse
Affiliation(s)
- Kahyun Lee
- Department of Information Science and Technology, George Mason University, Fairfax, Virginia, USA
| | - Nicholas J Dobbins
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, USA
| | - Bridget McInnes
- Department of Computer Science, Virginia Commonwealth University, Richmond, Virginia, USA
| | - Meliha Yetisgen
- Department of Biomedical Informatics and Medical Education, University of Washington, Seattle, Washington, USA
| | - Özlem Uzuner
- Department of Information Science and Technology, George Mason University, Fairfax, Virginia, USA
| |
Collapse
|
12
|
Gupta A, Lai A, Mozersky J, Ma X, Walsh H, DuBois JM. Enabling qualitative research data sharing using a natural language processing pipeline for deidentification: moving beyond HIPAA Safe Harbor identifiers. JAMIA Open 2021; 4:ooab069. [PMID: 34435175 PMCID: PMC8382275 DOI: 10.1093/jamiaopen/ooab069] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/10/2021] [Revised: 07/25/2021] [Accepted: 08/10/2021] [Indexed: 11/20/2022] Open
Abstract
Objective Sharing health research data is essential for accelerating the translation of research into actionable knowledge that can impact health care services and outcomes. Qualitative health research data are rarely shared due to the challenge of deidentifying text and the potential risks of participant reidentification. Here, we establish and evaluate a framework for deidentifying qualitative research data using automated computational techniques including removal of identifiers that are not considered HIPAA Safe Harbor (HSH) identifiers but are likely to be found in unstructured qualitative data. Materials and Methods We developed and validated a pipeline for deidentifying qualitative research data using automated computational techniques. An in-depth analysis and qualitative review of different types of qualitative health research data were conducted to inform and evaluate the development of a natural language processing (NLP) pipeline using named-entity recognition, pattern matching, dictionary, and regular expression methods to deidentify qualitative texts. Results We collected 2 datasets with 1.2 million words derived from over 400 qualitative research data documents. We created a gold-standard dataset with 280K words (70 files) to evaluate our deidentification pipeline. The majority of identifiers in qualitative data are non-HSH and not captured by existing systems. Our NLP deidentification pipeline had a consistent F1-score of ∼0.90 for both datasets. Conclusion The results of this study demonstrate that NLP methods can be used to identify both HSH identifiers and non-HSH identifiers. Automated tools to assist researchers with the deidentification of qualitative data will be increasingly important given the new National Institutes of Health (NIH) data-sharing mandate.
Collapse
Affiliation(s)
- Aditi Gupta
- Institute for Informatics, Washington University, St. Louis, Missouri, USA
| | - Albert Lai
- Institute for Informatics, Washington University, St. Louis, Missouri, USA
| | - Jessica Mozersky
- Bioethics Research Center, Division of General Medical Sciences, Washington University, St. Louis, Missouri, USA
| | - Xiaoteng Ma
- Institute for Informatics, Washington University, St. Louis, Missouri, USA
| | - Heidi Walsh
- Bioethics Research Center, Division of General Medical Sciences, Washington University, St. Louis, Missouri, USA
| | - James M DuBois
- Bioethics Research Center, Division of General Medical Sciences, Washington University, St. Louis, Missouri, USA
| |
Collapse
|
13
|
Zhao Z, Yang M, Tang B, Zhao T. Re-examination of Rule-Based Methods in Deidentification of Electronic Health Records: Algorithm Development and Validation. JMIR Med Inform 2020; 8:e17622. [PMID: 32352384 PMCID: PMC7226054 DOI: 10.2196/17622] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/30/2019] [Revised: 02/28/2020] [Accepted: 03/11/2020] [Indexed: 11/28/2022] Open
Abstract
Background Deidentification of clinical records is a critical step before their publication. This is usually treated as a type of sequence labeling task, and ensemble learning is one of the best performing solutions. Under the framework of multi-learner ensemble, the significance of a candidate rule-based learner remains an open issue. Objective The aim of this study is to investigate whether a rule-based learner is useful in a hybrid deidentification system and offer suggestions on how to build and integrate a rule-based learner. Methods We chose a data-driven rule-learner named transformation-based error-driven learning (TBED) and integrated it into the best performing hybrid system in this task. Results On the popular Informatics for Integrating Biology and the Bedside (i2b2) deidentification data set, experiments showed that TBED can offer high performance with its generated rules, and integrating the rule-based model into an ensemble framework, which reached an F1 score of 96.76%, achieved the best performance reported in the community. Conclusions We proved the rule-based method offers an effective contribution to the current ensemble learning approach for the deidentification of clinical records. Such a rule system could be automatically learned by TBED, avoiding the high cost and low reliability of manual rule composition. In particular, we boosted the ensemble model with rules to create the best performance of the deidentification of clinical records.
Collapse
Affiliation(s)
- Zhenyu Zhao
- Harbin Institute of Technology, Harbin, China
| | - Muyun Yang
- Harbin Institute of Technology, Harbin, China
| | - Buzhou Tang
- Harbin Institute of Technology, Shenzhen, China
| | - Tiejun Zhao
- Harbin Institute of Technology, Harbin, China
| |
Collapse
|
14
|
Johnson AEW, Bulgarelli L, Pollard TJ. Deidentification of free-text medical records using pre-trained bidirectional transformers. Proc ACM Conf Health Inference Learn (2020) 2020; 2020:214-221. [PMID: 34350426 PMCID: PMC8330601 DOI: 10.1145/3368555.3384455] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The ability of caregivers and investigators to share patient data is fundamental to many areas of clinical practice and biomedical research. Prior to sharing, it is often necessary to remove identifiers such as names, contact details, and dates in order to protect patient privacy. Deidentification, the process of removing identifiers, is challenging, however. High-quality annotated data for developing models is scarce; many target identifiers are highly heterogenous (for example, there are uncountable variations of patient names); and in practice anything less than perfect sensitivity may be considered a failure. As a result, patient data is often withheld when sharing would be beneficial, and identifiable patient data is often divulged when a deidentified version would suffice. In recent years, advances in machine learning methods have led to rapid performance improvements in natural language processing tasks, in particular with the advent of large-scale pretrained language models. In this paper we develop and evaluate an approach for deidentification of clinical notes based on a bidirectional transformer model. We propose human interpretable evaluation measures and demonstrate state of the art performance against modern baseline models. Finally, we highlight current challenges in deidentification, including the absence of clear annotation guidelines, lack of portability of models, and paucity of training data. Code to develop our model is open source, allowing for broad reuse.
Collapse
Affiliation(s)
| | | | - Tom J Pollard
- Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
15
|
Carrell DS, Cronkite DJ, Li M(R, Nyemba S, Malin BA, Aberdeen JS, Hirschman L. The machine giveth and the machine taketh away: a parrot attack on clinical text deidentified with hiding in plain sight. J Am Med Inform Assoc 2019; 26:1536-1544. [PMID: 31390016 PMCID: PMC6857511 DOI: 10.1093/jamia/ocz114] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/07/2019] [Revised: 05/08/2019] [Accepted: 06/13/2019] [Indexed: 12/12/2022] Open
Abstract
OBJECTIVE Clinical corpora can be deidentified using a combination of machine-learned automated taggers and hiding in plain sight (HIPS) resynthesis. The latter replaces detected personally identifiable information (PII) with random surrogates, allowing leaked PII to blend in or "hide in plain sight." We evaluated the extent to which a malicious attacker could expose leaked PII in such a corpus. MATERIALS AND METHODS We modeled a scenario where an institution (the defender) externally shared an 800-note corpus of actual outpatient clinical encounter notes from a large, integrated health care delivery system in Washington State. These notes were deidentified by a machine-learned PII tagger and HIPS resynthesis. A malicious attacker obtained and performed a parrot attack intending to expose leaked PII in this corpus. Specifically, the attacker mimicked the defender's process by manually annotating all PII-like content in half of the released corpus, training a PII tagger on these data, and using the trained model to tag the remaining encounter notes. The attacker hypothesized that untagged identifiers would be leaked PII, discoverable by manual review. We evaluated the attacker's success using measures of leak-detection rate and accuracy. RESULTS The attacker correctly hypothesized that 211 (68%) of 310 actual PII leaks in the corpus were leaks, and wrongly hypothesized that 191 resynthesized PII instances were also leaks. One-third of actual leaks remained undetected. DISCUSSION AND CONCLUSION A malicious parrot attack to reveal leaked PII in clinical text deidentified by machine-learned HIPS resynthesis can attenuate but not eliminate the protective effect of HIPS deidentification.
Collapse
Affiliation(s)
- David S Carrell
- Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA
| | - David J Cronkite
- Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA
| | | | - Steve Nyemba
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, Tennessee, USA
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, Tennessee, USA
| | | | | |
Collapse
|
16
|
Chevrier R, Foufi V, Gaudet-Blavignac C, Robert A, Lovis C. Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review. J Med Internet Res 2019; 21:e13484. [PMID: 31152528 PMCID: PMC6658290 DOI: 10.2196/13484] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 03/29/2019] [Accepted: 04/26/2019] [Indexed: 01/19/2023] Open
Abstract
Background The secondary use of health data is central to biomedical research in the era of data science and precision medicine. National and international initiatives, such as the Global Open Findable, Accessible, Interoperable, and Reusable (GO FAIR) initiative, are supporting this approach in different ways (eg, making the sharing of research data mandatory or improving the legal and ethical frameworks). Preserving patients’ privacy is crucial in this context. De-identification and anonymization are the two most common terms used to refer to the technical approaches that protect privacy and facilitate the secondary use of health data. However, it is difficult to find a consensus on the definitions of the concepts or on the reliability of the techniques used to apply them. A comprehensive review is needed to better understand the domain, its capabilities, its challenges, and the ratio of risk between the data subjects’ privacy on one side, and the benefit of scientific advances on the other. Objective This work aims at better understanding how the research community comprehends and defines the concepts of de-identification and anonymization. A rich overview should also provide insights into the use and reliability of the methods. Six aspects will be studied: (1) terminology and definitions, (2) backgrounds and places of work of the researchers, (3) reasons for anonymizing or de-identifying health data, (4) limitations of the techniques, (5) legal and ethical aspects, and (6) recommendations of the researchers. Methods Based on a scoping review protocol designed a priori, MEDLINE was searched for publications discussing de-identification or anonymization and published between 2007 and 2017. The search was restricted to MEDLINE to focus on the life sciences community. The screening process was performed by two reviewers independently. Results After searching 7972 records that matched at least one search term, 135 publications were screened and 60 full-text articles were included. (1) Terminology: Definitions of the terms de-identification and anonymization were provided in less than half of the articles (29/60, 48%). When both terms were used (41/60, 68%), their meanings divided the authors into two equal groups (19/60, 32%, each) with opposed views. The remaining articles (3/60, 5%) were equivocal. (2) Backgrounds and locations: Research groups were based predominantly in North America (31/60, 52%) and in the European Union (22/60, 37%). The authors came from 19 different domains; computer science (91/248, 36.7%), biomedical informatics (47/248, 19.0%), and medicine (38/248, 15.3%) were the most prevalent ones. (3) Purpose: The main reason declared for applying these techniques is to facilitate biomedical research. (4) Limitations: Progress is made on specific techniques but, overall, limitations remain numerous. (5) Legal and ethical aspects: Differences exist between nations in the definitions, approaches, and legal practices. (6) Recommendations: The combination of organizational, legal, ethical, and technical approaches is necessary to protect health data. Conclusions Interest is growing for privacy-enhancing techniques in the life sciences community. This interest crosses scientific boundaries, involving primarily computer science, biomedical informatics, and medicine. The variability observed in the use of the terms de-identification and anonymization emphasizes the need for clearer definitions as well as for better education and dissemination of information on the subject. The same observation applies to the methods. Several legislations, such as the American Health Insurance Portability and Accountability Act (HIPAA) and the European General Data Protection Regulation (GDPR), regulate the domain. Using the definitions they provide could help address the variable use of these two concepts in the research community.
Collapse
Affiliation(s)
- Raphaël Chevrier
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Vasiliki Foufi
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Christophe Gaudet-Blavignac
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Arnaud Robert
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Christian Lovis
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| |
Collapse
|
17
|
Ismail M, Philbin J. Fast processing of digital imaging and communications in medicine (DICOM) metadata using multiseries DICOM format. J Med Imaging (Bellingham) 2015; 2:026501. [PMID: 26158117 DOI: 10.1117/1.jmi.2.2.026501] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/18/2014] [Accepted: 05/08/2015] [Indexed: 11/14/2022] Open
Abstract
The digital imaging and communications in medicine (DICOM) information model combines pixel data and its metadata in a single object. There are user scenarios that only need metadata manipulation, such as deidentification and study migration. Most picture archiving and communication system use a database to store and update the metadata rather than updating the raw DICOM files themselves. The multiseries DICOM (MSD) format separates metadata from pixel data and eliminates duplicate attributes. This work promotes storing DICOM studies in MSD format to reduce the metadata processing time. A set of experiments are performed that update the metadata of a set of DICOM studies for deidentification and migration. The studies are stored in both the traditional single frame DICOM (SFD) format and the MSD format. The results show that it is faster to update studies' metadata in MSD format than in SFD format because the bulk data is separated in MSD and is not retrieved from the storage system. In addition, it is space efficient to store the deidentified studies in MSD format as it shares the same bulk data object with the original study. In summary, separation of metadata from pixel data using the MSD format provides fast metadata access and speeds up applications that process only the metadata.
Collapse
Affiliation(s)
- Mahmoud Ismail
- Johns Hopkins University , Department of Computer Science, 3400 N. Charles Street, Baltimore, Maryland 21218, United States
| | - James Philbin
- Johns Hopkins University , Department of Radiology, 5801 Smith Avenue, McCauley Building, Suite 100, Baltimore, Maryland 21209, United States
| |
Collapse
|
18
|
Clunie DA, Gebow D. Block selective redaction for minimizing loss during de-identification of burned in text in irreversibly compressed JPEG medical images. J Med Imaging (Bellingham) 2015; 2:016501. [PMID: 26158090 DOI: 10.1117/1.jmi.2.1.016501] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2014] [Accepted: 03/03/2015] [Indexed: 11/14/2022] Open
Abstract
Deidentification of medical images requires attention to both header information as well as the pixel data itself, in which burned-in text may be present. If the pixel data to be deidentified is stored in a compressed form, traditionally it is decompressed, identifying text is redacted, and if necessary, pixel data are recompressed. Decompression without recompression may result in images of excessive or intractable size. Recompression with an irreversible scheme is undesirable because it may cause additional loss in the diagnostically relevant regions of the images. The irreversible (lossy) JPEG compression scheme works on small blocks of the image independently, hence, redaction can selectively be confined only to those blocks containing identifying text, leaving all other blocks unchanged. An open source implementation of selective redaction and a demonstration of its applicability to multiframe color ultrasound images is described. The process can be applied either to standalone JPEG images or JPEG bit streams encapsulated in other formats, which in the case of medical images, is usually DICOM.
Collapse
Affiliation(s)
- David A Clunie
- PixelMed , 943 Heiden Road, Bangor, Pennsylvania 18013, United States
| | - Dan Gebow
- MDDX Research and Informatics , 580 California Street, Fl 16, San Francisco, California 94104, United States
| |
Collapse
|
19
|
Abstract
OBJECTIVE As the use of medical images in applications other than direct patient care increases, the need for deidentified images grows. Federal regulations govern the requirements for deidentification, and software developers offer several methods for deidentification. CONCLUSION However, there are numerous ways for protected health information to be included in images other than in DICOM headers. Either such information must be obscured or the images containing the information must be deleted to comply with deidentification requirements.
Collapse
|