1
|
A Method for Efficient De-identification of DICOM Metadata and Burned-in Pixel Text. JOURNAL OF IMAGING INFORMATICS IN MEDICINE 2024:10.1007/s10278-024-01098-7. [PMID: 38587767 DOI: 10.1007/s10278-024-01098-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/01/2023] [Revised: 03/19/2024] [Accepted: 03/22/2024] [Indexed: 04/09/2024]
Abstract
De-identification of DICOM images is an essential component of medical image research. While many established methods exist for the safe removal of protected health information (PHI) in DICOM metadata, approaches for the removal of PHI "burned-in" to image pixel data are typically manual, and automated high-throughput approaches are not well validated. Emerging optical character recognition (OCR) models can potentially detect and remove PHI-bearing text from medical images but are very time-consuming to run on the high volume of images found in typical research studies. We present a data processing method that performs metadata de-identification for all images combined with a targeted approach to only apply OCR to images with a high likelihood of burned-in text. The method was validated on a dataset of 415,182 images across ten modalities representative of the de-identification requests submitted at our institution over a 20-year span. Of the 12,578 images in this dataset with burned-in text of any kind, only 10 passed undetected with the method. OCR was only required for 6050 images (1.5% of the dataset).
Collapse
|
2
|
High Accuracy Open-Source Clinical Data De-Identification: The CliniDeID Solution. Stud Health Technol Inform 2024; 310:1370-1371. [PMID: 38270048 DOI: 10.3233/shti231199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/26/2024]
Abstract
Clinical data de-identification offers patient data privacy protection and eases reuse of clinical data. As an open-source solution to de-identify unstructured clinical text with high accuracy, CliniDeID applies an ensemble method combining deep and shallow machine learning with rule-based algorithms. It reached high recall and precision when recently evaluated with a selection of clinical text corpora.
Collapse
|
3
|
Sensitive Data Detection with High-Throughput Machine Learning Models in Electrical Health Records. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2024; 2023:814-823. [PMID: 38222389 PMCID: PMC10785837] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 01/16/2024]
Abstract
In the era of big data, there is an increasing need for healthcare providers, communities, and researchers to share data and collaborate to improve health outcomes, generate valuable insights, and advance research. The Health Insurance Portability and Accountability Act of 1996 (HIPAA) is a federal law designed to protect sensitive health information by defining regulations for protected health information (PHI). However, it does not provide efficient tools for detecting or removing PHI before data sharing. One of the challenges in this area of research is the heterogeneous nature of PHI fields in data across different parties. This variability makes rule-based sensitive variable identification systems that work on one database fail on another. To address this issue, our paper explores the use of machine learning algorithms to identify sensitive variables in structured data, thus facilitating the de-identification process. We made a key observation that the distributions of metadata of PHI fields and non-PHI fields are very different. Based on this novel finding, we engineered over 30 features from the metadata of the original features and used machine learning to build classification models to automatically identify PHI fields in structured Electronic Health Record (EHR) data. We trained the model on a variety of large EHR databases from different data sources and found that our algorithm achieves 99% accuracy when detecting PHI-related fields for unseen datasets. The implications of our study are significant and can benefit industries that handle sensitive data.
Collapse
|
4
|
OBIA: An Open Biomedical Imaging Archive. GENOMICS, PROTEOMICS & BIOINFORMATICS 2023; 21:1059-1065. [PMID: 37806555 PMCID: PMC10928373 DOI: 10.1016/j.gpb.2023.09.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/28/2023] [Revised: 09/26/2023] [Accepted: 09/29/2023] [Indexed: 10/10/2023]
Abstract
With the development of artificial intelligence (AI) technologies, biomedical imaging data play an important role in scientific research and clinical application, but the available resources are limited. Here we present Open Biomedical Imaging Archive (OBIA), a repository for archiving biomedical imaging and related clinical data. OBIA adopts five data objects (Collection, Individual, Study, Series, and Image) for data organization, and accepts the submission of biomedical images of multiple modalities, organs, and diseases. In order to protect personal privacy, OBIA has formulated a unified de-identification and quality control process. In addition, OBIA provides friendly and intuitive web interfaces for data submission, browsing, and retrieval, as well as image retrieval. As of September 2023, OBIA has housed data for a total of 937 individuals, 4136 studies, 24,701 series, and 1,938,309 images covering 9 modalities and 30 anatomical sites. Collectively, OBIA provides a reliable platform for biomedical imaging data management and offers free open access to all publicly available data to support research activities throughout the world. OBIA can be accessed at https://ngdc.cncb.ac.cn/obia.
Collapse
|
5
|
Effects of de-facing software mri_reface on utility of imaging biomarkers used in Alzheimer's disease research. Neuroimage Clin 2023; 40:103507. [PMID: 37703605 PMCID: PMC10502400 DOI: 10.1016/j.nicl.2023.103507] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Revised: 08/07/2023] [Accepted: 09/05/2023] [Indexed: 09/15/2023]
Abstract
Brain imaging research studies increasingly use "de-facing" software to remove or replace facial imagery before public data sharing. Several works have studied the effects of de-facing software on brain imaging biomarkers by directly comparing automated measurements from unmodified vs de-faced images, but most research brain images are used in analyses of correlations with cognitive measurements or clinical statuses, and the effects of de-facing on these types of imaging-to-cognition correlations has not been measured. In this work, we focused on brain imaging measures of amyloid (A), tau (T), neurodegeneration (N), and vascular (V) measures used in Alzheimer's Disease (AD) research. We created a retrospective sample of participants from three age- and sex-matched clinical groups (cognitively unimpaired, mild cognitive impairment, and AD dementia, and we performed region- and voxel-wise analyses of: hippocampal volume (N), white matter hyperintensity volume (V), amyloid PET (A), and tau PET (T) measures, each from multiple software pipelines, on their ability to separate cognitively defined groups and their degrees of correlation with age and Clinical Dementia Rating (CDR)-Sum of Boxes (CDR-SB). We performed each of these analyses twice: once with unmodified images and once with images de-faced with leading de-facing software mri_reface, and we directly compared the findings and their statistical strengths between the original vs. the de-faced images. Analyses with original and with de-faced images had very high agreement. There were no significant differences between any voxel-wise comparisons. Among region-wise comparisons, only three out of 55 correlations were significantly different between original and de-faced images, and these were not significant after correction for multiple comparisons. Overall, the statistical power of the imaging data for AD biomarkers was almost identical between unmodified and de-faced images, and their analyses results were extremely consistent.
Collapse
|
6
|
A face-off of MRI research sequences by their need for de-facing. Neuroimage 2023; 276:120199. [PMID: 37269958 PMCID: PMC10389782 DOI: 10.1016/j.neuroimage.2023.120199] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/08/2023] [Revised: 04/19/2023] [Accepted: 05/25/2023] [Indexed: 06/05/2023] Open
Abstract
It is now widely known that research brain MRI, CT, and PET images may potentially be re-identified using face recognition, and this potential can be reduced by applying face-deidentification ("de-facing") software. However, for research MRI sequences beyond T1-weighted (T1-w) and T2-FLAIR structural images, the potential for re-identification and quantitative effects of de-facing are both unknown, and the effects of de-facing T2-FLAIR are also unknown. In this work we examine these questions (where applicable) for T1-w, T2-w, T2*-w, T2-FLAIR, diffusion MRI (dMRI), functional MRI (fMRI), and arterial spin labelling (ASL) sequences. Among current-generation, vendor-product research-grade sequences, we found that 3D T1-w, T2-w, and T2-FLAIR were highly re-identifiable (96-98%). 2D T2-FLAIR and 3D multi-echo GRE (ME-GRE) were also moderately re-identifiable (44-45%), and our derived T2* from ME-GRE (comparable to a typical 2D T2*) matched at only 10%. Finally, diffusion, functional and ASL images were each minimally re-identifiable (0-8%). Applying de-facing with mri_reface version 0.3 reduced successful re-identification to ≤8%, while differential effects on popular quantitative pipelines for cortical volumes and thickness, white matter hyperintensities (WMH), and quantitative susceptibility mapping (QSM) measurements were all either comparable with or smaller than scan-rescan estimates. Consequently, high-quality de-facing software can greatly reduce the risk of re-identification for identifiable MRI sequences with only negligible effects on automated intracranial measurements. The current-generation echo-planar and spiral sequences (dMRI, fMRI, and ASL) each had minimal match rates, suggesting that they have a low risk of re-identification and can be shared without de-facing, but this conclusion should be re-evaluated if they are acquired without fat suppression, with a full-face scan coverage, or if newer developments reduce the current levels of artifacts and distortion around the face.
Collapse
|
7
|
De-Identification Technique with Facial Deformation in Head CT Images. Neuroinformatics 2023; 21:575-587. [PMID: 37226013 PMCID: PMC10406725 DOI: 10.1007/s12021-023-09631-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 05/08/2023] [Indexed: 05/26/2023]
Abstract
Head CT, which includes the facial region, can visualize faces using 3D reconstruction, raising concern that individuals may be identified. We developed a new de-identification technique that distorts the faces of head CT images. Head CT images that were distorted were labeled as "original images" and the others as "reference images." Reconstructed face models of both were created, with 400 control points on the facial surfaces. All voxel positions in the original image were moved and deformed according to the deformation vectors required to move to corresponding control points on the reference image. Three face detection and identification programs were used to determine face detection rates and match confidence scores. Intracranial volume equivalence tests were performed before and after deformation, and correlation coefficients between intracranial pixel value histograms were calculated. Output accuracy of the deep learning model for intracranial segmentation was determined using Dice Similarity Coefficient before and after deformation. The face detection rate was 100%, and match confidence scores were < 90. Equivalence testing of the intracranial volume revealed statistical equivalence before and after deformation. The median correlation coefficient between intracranial pixel value histograms before and after deformation was 0.9965, indicating high similarity. Dice Similarity Coefficient values of original and deformed images were statistically equivalent. We developed a technique to de-identify head CT images while maintaining the accuracy of deep-learning models. The technique involves deforming images to prevent face identification, with minimal changes to the original information.
Collapse
|
8
|
De-identified Bayesian personal identity matching for privacy-preserving record linkage despite errors: development and validation. BMC Med Inform Decis Mak 2023; 23:85. [PMID: 37147600 PMCID: PMC10163749 DOI: 10.1186/s12911-023-02176-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/04/2022] [Accepted: 04/21/2023] [Indexed: 05/07/2023] Open
Abstract
BACKGROUND Epidemiological research may require linkage of information from multiple organizations. This can bring two problems: (1) the information governance desirability of linkage without sharing direct identifiers, and (2) a requirement to link databases without a common person-unique identifier. METHODS We develop a Bayesian matching technique to solve both. We provide an open-source software implementation capable of de-identified probabilistic matching despite discrepancies, via fuzzy representations and complete mismatches, plus de-identified deterministic matching if required. We validate the technique by testing linkage between multiple medical records systems in a UK National Health Service Trust, examining the effects of decision thresholds on linkage accuracy. We report demographic factors associated with correct linkage. RESULTS The system supports dates of birth (DOBs), forenames, surnames, three-state gender, and UK postcodes. Fuzzy representations are supported for all except gender, and there is support for additional transformations, such as accent misrepresentation, variation for multi-part surnames, and name re-ordering. Calculated log odds predicted a proband's presence in the sample database with an area under the receiver operating curve of 0.997-0.999 for non-self database comparisons. Log odds were converted to a decision via a consideration threshold θ and a leader advantage threshold δ. Defaults were chosen to penalize misidentification 20-fold versus linkage failure. By default, complete DOB mismatches were disallowed for computational efficiency. At these settings, for non-self database comparisons, the mean probability of a proband being correctly declared to be in the sample was 0.965 (range 0.931-0.994), and the misidentification rate was 0.00249 (range 0.00123-0.00429). Correct linkage was positively associated with male gender, Black or mixed ethnicity, and the presence of diagnostic codes for severe mental illnesses or other mental disorders, and negatively associated with birth year, unknown ethnicity, residential area deprivation, and presence of a pseudopostcode (e.g. indicating homelessness). Accuracy rates would be improved further if person-unique identifiers were also used, as supported by the software. Our two largest databases were linked in 44 min via an interpreted programming language. CONCLUSIONS Fully de-identified matching with high accuracy is feasible without a person-unique identifier and appropriate software is freely available.
Collapse
|
9
|
An evaluation of existing text de-identification tools for use with patient progress notes from Australian general practice. Int J Med Inform 2023; 173:105021. [PMID: 36870249 DOI: 10.1016/j.ijmedinf.2023.105021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 02/07/2023] [Accepted: 02/10/2023] [Indexed: 02/13/2023]
Abstract
INTRODUCTION Digitized patient progress notes from general practice represent a significant resource for clinical and public health research but cannot feasibly and ethically be used for these purposes without automated de-identification. Internationally, several open-source natural language processing tools have been developed, however, given wide variations in clinical documentation practices, these cannot be utilized without appropriate review. We evaluated the performance of four de-identification tools and assessed their suitability for customization to Australian general practice progress notes. METHODS Four tools were selected: three rule-based (HMS Scrubber, MIT De-id, Philter) and one machine learning (MIST). 300 patient progress notes from three general practice clinics were manually annotated with personally identifying information. We conducted a pairwise comparison between the manual annotations and patient identifiers automatically detected by each tool, measuring recall (sensitivity), precision (positive predictive value), f1-score (harmonic mean of precision and recall), and f2-score (weighs recall 2x higher than precision). Error analysis was also conducted to better understand each tool's structure and performance. RESULTS Manual annotation detected 701 identifiers in seven categories. The rule-based tools detected identifiers in six categories and MIST in three. Philter achieved the highest aggregate recall (67%) and the highest recall for NAME (87%). HMS Scrubber achieved the highest recall for DATE (94%) and all tools performed poorly on LOCATION. MIST achieved the highest precision for NAME and DATE while also achieving similar recall to the rule-based tools for DATE and highest recall for LOCATION. Philter had the lowest aggregate precision (37%), however preliminary adjustments of its rules and dictionaries showed a substantial reduction in false positives. CONCLUSION Existing off-the-shelf solutions for automated de-identification of clinical text are not immediately suitable for our context without modification. Philter is the most promising candidate due to its high recall and flexibility however will require extensive revising of its pattern matching rules and dictionaries.
Collapse
|
10
|
[Protecting the rights and freedoms of individuals with regard to health data processing: the risk approach of the EU General Data Protection Regulation (GDPR)]. Bundesgesundheitsblatt Gesundheitsforschung Gesundheitsschutz 2023; 66:143-153. [PMID: 36648500 PMCID: PMC9844932 DOI: 10.1007/s00103-022-03652-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/08/2022] [Accepted: 12/19/2022] [Indexed: 01/18/2023]
Abstract
Merging sensitive data and tracing their analysis results back to the data subjects is an essential part of data processing in the health sector. This challenges the protection of the data and thus its very purpose, the protection of the data subjects, since the scientific and health findings are often based on certain characteristics in the datasets, which should be preserved in their property as personal in order to make the results of the data analysis fruitful. The EU General Data Protection Regulation (GDPR) establishes a risk-based approach that determines both the identifiability of data and the proportionality of their processing.This paper analyses how the risk-based approach opens the scope of the GDPR and relates it to the risks for the rights and freedoms of data subjects posed by the processing of personal data. Furthermore, the question is explored to what extent the risk-based approach of the GDPR influences the rules for international data transfer and how international data processing in the health sector is currently organised on its basis.Overall, the present analysis sheds light on how the technical measures of data processing and the organisational measures for handling them can contribute to maintaining the proportionality of data processing under the GDPR, which can essentially be determined on a risk-based basis, while at the same time taking into account the specificity of data processing in the health sector.
Collapse
|
11
|
Ensemble Approaches to Recognize Protected Health Information in Radiology Reports. J Digit Imaging 2022; 35:1694-1698. [PMID: 35715655 PMCID: PMC9712864 DOI: 10.1007/s10278-022-00673-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2022] [Revised: 06/02/2022] [Accepted: 06/07/2022] [Indexed: 10/18/2022] Open
Abstract
Natural language processing (NLP) techniques for electronic health records have shown great potential to improve the quality of medical care. The text of radiology reports frequently constitutes a large fraction of EHR data, and can provide valuable information about patients' diagnoses, medical history, and imaging findings. The lack of a major public repository for radiological reports severely limits the development, testing, and application of new NLP tools. De-identification of protected health information (PHI) presents a major challenge to building such repositories, as many automated tools for de-identification were trained or designed for clinical notes and do not perform sufficiently well to build a public database of radiology reports. We developed and evaluated six ensemble models based on three publically available de-identification tools: MIT de-id, NeuroNER, and Philter. A set of 1023 reports was set aside as the testing partition. Two individuals with medical training annotated the test set for PHI; differences were resolved by consensus. Ensemble methods included simple voting schemes (1-Vote, 2-Votes, and 3-Votes), a decision tree, a naïve Bayesian classifier, and Adaboost boosting. The 1-Vote ensemble achieved recall of 998 / 1043 (95.7%); the 3-Votes ensemble had precision of 1035 / 1043 (99.2%). F1 scores were: 93.4% for the decision tree, 71.2% for the naïve Bayesian classifier, and 87.5% for the boosting method. Basic voting algorithms and machine learning classifiers incorporating the predictions of multiple tools can outperform each tool acting alone in de-identifying radiology reports. Ensemble methods hold substantial potential to improve automated de-identification tools for radiology reports to make such reports more available for research use to improve patient care and outcomes.
Collapse
|
12
|
Perceived Risk of Re-Identification in OMOP-CDM Database: A Cross-Sectional Survey. J Korean Med Sci 2022; 37:e205. [PMID: 35790207 PMCID: PMC9259248 DOI: 10.3346/jkms.2022.37.e205] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 04/08/2022] [Accepted: 05/30/2022] [Indexed: 11/20/2022] Open
Abstract
BACKGROUND The advancement of information technology has immensely increased the quality and volume of health data. This has led to an increase in observational study, as well as to the threat of privacy invasion. Recently, a distributed research network based on the common data model (CDM) has emerged, enabling collaborative international medical research without sharing patient-level data. Although the CDM database for each institution is built inside a firewall, the risk of re-identification requires management. Hence, this study aims to elucidate the perceptions CDM users have towards CDM and risk management for re-identification. METHODS The survey, targeted to answer specific in-depth questions on CDM, was conducted from October to November 2020. We targeted well-experienced researchers who actively use CDM. Basic statistics (total number and percent) were computed for all covariates. RESULTS There were 33 valid respondents. Of these, 43.8% suggested additional anonymization was unnecessary beyond, "minimum cell count" policy, which obscures a cell with a value lower than certain number (usually 5) in shared results to minimize the liability of re-identification due to rare conditions. During extract-transform-load processes, 81.8% of respondents assumed structured data is under control from the risk of re-identification. However, respondents noted that date of birth and death were highly re-identifiable information. The majority of respondents (n = 22, 66.7%) conceded the possibility of identifier-contained unstructured data in the NOTE table. CONCLUSION Overall, CDM users generally attributed high reliability for privacy protection to the intrinsic nature of CDM. There was little demand for additional de-identification methods. However, unstructured data in the CDM were suspected to have risks. The necessity for a coordinating consortium to define and manage the re-identification risk of CDM was urged.
Collapse
|
13
|
Face recognition from research brain PET: An unexpected PET problem. Neuroimage 2022; 258:119357. [PMID: 35660089 PMCID: PMC9358410 DOI: 10.1016/j.neuroimage.2022.119357] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2022] [Revised: 05/31/2022] [Accepted: 06/02/2022] [Indexed: 11/04/2022] Open
Abstract
It is well known that de-identified research brain images from MRI and CT can potentially be re-identified using face recognition; however, this has not been examined for PET images. We generated face reconstruction images of 182 volunteers using amyloid, tau, and FDG PET scans, and we measured how accurately commercial face recognition software (Microsoft Azure’s Face API) automatically matched them with the individual participants’ face photographs. We then compared this accuracy with the same experiments using participants’ CT and MRI. Face reconstructions from PET images from PET/CT scanners were correctly matched at rates of 42% (FDG), 35% (tau), and 32% (amyloid), while CT were matched at 78% and MRI at 97–98%. We propose that these recognition rates are high enough that research studies should consider using face de-identification (“de-facing”) software on PET images, in addition to CT and structural MRI, before data sharing. We also updated our mri_reface de-identification software with extended functionality to replace face imagery in PET and CT images. Rates of face recognition on de-faced images were reduced to 0–4% for PET, 5% for CT, and 8% for MRI. We measured the effects of de-facing on regional amyloid PET measurements from two different measurement pipelines (PETSurfer/FreeSurfer 6.0, and one in-house method based on SPM12 and ANTs), and these effects were small: ICC values between de-faced and original images were > 0.98, biases were <2%, and median relative errors were <2%. Effects on global amyloid PET SUVR measurements were even smaller: ICC values were 1.00, biases were <0.5%, and median relative errors were also <0.5%.
Collapse
|
14
|
Impact Analysis of De-Identification in Clinical Notes Classification. Stud Health Technol Inform 2022; 293:189-196. [PMID: 35592981 DOI: 10.3233/shti220368] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
BACKGROUND Clinical notes provide valuable data in telemonitoring systems for disease management. Such data must be converted into structured information to be effective in automated analysis. One way to achieve this is by classification (e.g. into categories). However, to conform with privacy regulations and concerns, text is usually de-identified. OBJECTIVES This study investigated the effects of de-identification on classification. METHODS Two pseudonymisation and two classification algorithms were applied to clinical messages from a telehealth system. Divergence in classification compared to clear text classification was measured. RESULTS Overall, de-identification notably altered classification. The delicate classification algorithm was severely impacted, especially losses of sensitivity were noticeable. However, the simpler classification method was more robust and in combination with a more yielding pseudonymisation technique, had only a negligible impact on classification. CONCLUSION The results indicate that de-identification can impact text classification and suggest, that considering de-identification during development of the classification methods could be beneficial.
Collapse
|
15
|
DeIDNER Model: A Neural Network Named Entity Recognition Model for Use in the De-identification of Clinical Notes. BIOMEDICAL ENGINEERING SYSTEMS AND TECHNOLOGIES, INTERNATIONAL JOINT CONFERENCE, BIOSTEC ... REVISED SELECTED PAPERS. BIOSTEC (CONFERENCE) 2022; 5:640-647. [PMID: 35386186 PMCID: PMC8981408 DOI: 10.5220/0010884500003123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Clinical named entity recognition (NER) is an essential building block for many downstream natural language processing (NLP) applications such as information extraction and de-identification. Recently, deep learning (DL) methods that utilize word embeddings have become popular in clinical NLP tasks. However, there has been little work on evaluating and combining the word embeddings trained from different domains. The goal of this study is to improve the performance of NER in clinical discharge summaries by developing a DL model that combines different embeddings and investigate the combination of standard and contextual embeddings from the general and clinical domains. We developed: 1) A human-annotated high-quality internal corpus with discharge summaries and 2) A NER model with an input embedding layer that combines different embeddings: standard word embeddings, context-based word embeddings, a character-level word embedding using a convolutional neural network (CNN), and an external knowledge sources along with word features as one-hot vectors. Embedding was followed by bidirectional long short-term memory (Bi-LSTM) and conditional random field (CRF) layers. The proposed model reaches or overcomes state-of-the-art performance on two publicly available data sets and an F1 score of 94.31% on an internal corpus. After incorporating mixed-domain clinically pre-trained contextual embeddings, the F1 score further improved to 95.36% on the internal corpus. This study demonstrated an efficient way of combining different embeddings that will improve the recognition performance aiding the downstream de-identification of clinical notes.
Collapse
|
16
|
Data Pseudonymization in a Range That Does Not Affect Data Quality: Correlation with the Degree of Participation of Clinicians. J Korean Med Sci 2021; 36:e299. [PMID: 34783216 PMCID: PMC8593412 DOI: 10.3346/jkms.2021.36.e299] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 05/04/2021] [Accepted: 10/18/2021] [Indexed: 12/28/2022] Open
Abstract
Personal medical information is an essential resource for research; however, there are laws that regulate its use, and it typically has to be pseudonymized or anonymized. When data are anonymized, the quantity and quality of extractable information decrease significantly. From the perspective of a clinical researcher, a method of achieving pseudonymized data without degrading data quality while also preventing data loss is proposed herein. As the level of pseudonymization varies according to the research purpose, the pseudonymization method applied should be carefully chosen. Therefore, the active participation of clinicians is crucial to transform the data according to the research purpose. This can contribute to data security by simply transforming the data through secondary data processing. Case studies demonstrated that, compared with the initial baseline data, there was a clinically significant difference in the number of datapoints added with the participation of a clinician (from 267,979 to 280,127 points, P < 0.001). Thus, depending on the degree of clinician participation, data anonymization may not affect data quality and quantity, and proper data quality management along with data security are emphasized. Although the pseudonymization level and clinical use of data have a trade-off relationship, it is possible to create pseudonymized data while maintaining the data quality required for a given research purpose. Therefore, rather than relying solely on security guidelines, the active participation of clinicians is important.
Collapse
|
17
|
Research Goal-Driven Data Model and Harmonization for De-Identifying Patient Data in Radiomics. J Digit Imaging 2021; 34:986-1004. [PMID: 34241789 DOI: 10.1007/s10278-021-00476-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2020] [Revised: 05/22/2021] [Accepted: 06/09/2021] [Indexed: 10/20/2022] Open
Abstract
There are various efforts in de-identifying patient's radiation oncology data for their uses in the advancement of research in medicine. Though the task of de-identification needs to be defined in the context of research goals and objectives, existing systems lack the flexibility of modeling data and normalization of names of attributes for accomplishing them. In this work, we describe a de-identification process of radiation and clinical oncology data, which is guided by a data model and a schema of dynamically capturing domain ontology and normalization of terminologies, defined in tune with the research goals in this area. The radiological images are obtained in DICOM format. It consists of diagnostic, radiation therapy (RT) treatment planning, RT verification, and RT response images. During the DICOM de-identification, a few crucial pieces of information are taken about the dataset. The proposed model is generic in organizing information modeling in sync with the de-identification of a patient's clinical information. The treatment and clinical data are provided in the comma-separated values (CSV) format, which follows a predefined data structure. The de-identified data is harmonized throughout the entire process. We have presented four specific case studies on four different types of cancers, namely glioblastoma multiforme, head-neck, breast, and lung. We also present experimental validation on a few patients' data in these four areas. A few aspects are taken care of during de-identification, such as preservation of longitudinal date changes (LDC), incremental de-identification, referential data integrity between the clinical and image data, de-identified data harmonization, and transformation of the data to an underlined database schema.
Collapse
|
18
|
DeIDNER Corpus: Annotation of Clinical Discharge Summary Notes for Named Entity Recognition Using BRAT Tool. Stud Health Technol Inform 2021; 281:432-436. [PMID: 34042780 PMCID: PMC9019788 DOI: 10.3233/shti210195] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/17/2023]
Abstract
Named Entity Recognition (NER) aims to identify and classify entities into predefined categories is a critical pre-processing task in Natural Language Processing (NLP) pipeline. Readily available off-the-shelf NER algorithms or programs are trained on a general corpus and often need to be retrained when applied on a different domain. The end model's performance depends on the quality of named entities generated by these NER models used in the NLP task. To improve NER model accuracy, researchers build domain-specific corpora for both model training and evaluation. However, in the clinical domain, there is a dearth of training data because of privacy reasons, forcing many studies to use NER models that are trained in the non-clinical domain to generate NER feature-set. Thus, influencing the performance of the downstream NLP tasks like information extraction and de-identification. In this paper, our objective is to create a high quality annotated clinical corpus for training NER models that can be easily generalizable and can be used in a downstream de-identification task to generate named entities feature-set.
Collapse
|
19
|
API Driven On-Demand Participant ID Pseudonymization in Heterogeneous Multi-Study Research. Healthc Inform Res 2021; 27:39-47. [PMID: 33611875 PMCID: PMC7921568 DOI: 10.4258/hir.2021.27.1.39] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/17/2020] [Revised: 09/23/2020] [Accepted: 10/18/2020] [Indexed: 11/29/2022] Open
Abstract
OBJECTIVES To facilitate clinical and translational research, imaging and non-imaging clinical data from multiple disparate systems must be aggregated for analysis. Study participant records from various sources are linked together and to patient records when possible to address research questions while ensuring patient privacy. This paper presents a novel tool that pseudonymizes participant identifiers (PIDs) using a researcher-driven automated process that takes advantage of application-programming interface (API) and the Perl Open-Source Digital Imaging and Communications in Medicine Archive (POSDA) to further de-identify PIDs. The tool, on-demand cohort and API participant identifier pseudonymization (O-CAPP), employs a pseudonymization method based on the type of incoming research data. METHODS For images, pseudonymization of PIDs is done using API calls that receive PIDs present in Digital Imaging and Communications in Medicine (DICOM) headers and returns the pseudonymized identifiers. For non-imaging clinical research data, PIDs provided by study principal investigators (PIs) are pseudonymized using a nightly automated process. The pseudonymized PIDs (P-PIDs) along with other protected health information is further de-identified using POSDA. RESULTS A sample of 250 PIDs pseudonymized by O-CAPP were selected and successfully validated. Of those, 125 PIDs that were pseudonymized by the nightly automated process were validated by multiple clinical trial investigators (CTIs). For the other 125, CTIs validated radiologic image pseudonymization by API request based on the provided PID and P-PID mappings. CONCLUSIONS We developed a novel approach of an ondemand pseudonymization process that will aide researchers in obtaining a comprehensive and holistic view of study participant data without compromising patient privacy.
Collapse
|
20
|
De-identifying free text of Japanese electronic health records. J Biomed Semantics 2020; 11:11. [PMID: 32958039 PMCID: PMC7504663 DOI: 10.1186/s13326-020-00227-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Accepted: 08/07/2020] [Indexed: 11/25/2022] Open
Abstract
Background Recently, more electronic data sources are becoming available in the healthcare domain. Electronic health records (EHRs), with their vast amounts of potentially available data, can greatly improve healthcare. Although EHR de-identification is necessary to protect personal information, automatic de-identification of Japanese language EHRs has not been studied sufficiently. This study was conducted to raise de-identification performance for Japanese EHRs through classic machine learning, deep learning, and rule-based methods, depending on the dataset. Results Using three datasets, we implemented de-identification systems for Japanese EHRs and compared the de-identification performances found for rule-based, Conditional Random Fields (CRF), and Long-Short Term Memory (LSTM)-based methods. Gold standard tags for de-identification are annotated manually for age, hospital, person, sex, and time. We used different combinations of our datasets to train and evaluate our three methods. Our best F1-scores were 84.23, 68.19, and 81.67 points, respectively, for evaluations of the MedNLP dataset, a dummy EHR dataset that was virtually written by a medical doctor, and a Pathology Report dataset. Our LSTM-based method was the best performing, except for the MedNLP dataset. The rule-based method was best for the MedNLP dataset. The LSTM-based method achieved a good score of 83.07 points for this MedNLP dataset, which differs by 1.16 points from the best score obtained using the rule-based method. Results suggest that LSTM adapted well to different characteristics of our datasets. Our LSTM-based method performed better than our CRF-based method, yielding a 7.41 point F1-score, when applied to our Pathology Report dataset. This report is the first of study applying this LSTM-based method to any de-identification task of a Japanese EHR. Conclusions Our LSTM-based machine learning method was able to extract named entities to be de-identified with better performance, in general, than that of our rule-based methods. However, machine learning methods are inadequate for processing expressions with low occurrence. Our future work will specifically examine the combination of LSTM and rule-based methods to achieve better performance. Our currently achieved level of performance is sufficiently higher than that of publicly available Japanese de-identification tools. Therefore, our system will be applied to actual de-identification tasks in hospitals.
Collapse
|
21
|
De-Identification of Radiomics Data Retaining Longitudinal Temporal Information. J Med Syst 2020; 44:99. [PMID: 32240368 DOI: 10.1007/s10916-020-01563-0] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2019] [Accepted: 03/17/2020] [Indexed: 11/30/2022]
Abstract
We propose a de-identification system which runs in a standalone mode. The system takes care of the de-identification of radiation oncology patient's clinical and annotated imaging data including RTSTRUCT, RTPLAN, and RTDOSE. The clinical data consists of diagnosis, stages, outcome, and treatment information of the patient. The imaging data could be the diagnostic, therapy planning, and verification images. Archival of the longitudinal radiation oncology verification images like cone beam CT scans along with the initial imaging and clinical data are preserved in the process. During the de-identification, the system keeps the reference of original data identity in encrypted form. These could be useful for the re-identification if necessary.
Collapse
|
22
|
Customization scenarios for de-identification of clinical notes. BMC Med Inform Decis Mak 2020; 20:14. [PMID: 32000770 PMCID: PMC6993314 DOI: 10.1186/s12911-020-1026-2] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2019] [Accepted: 01/14/2020] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Automated machine-learning systems are able to de-identify electronic medical records, including free-text clinical notes. Use of such systems would greatly boost the amount of data available to researchers, yet their deployment has been limited due to uncertainty about their performance when applied to new datasets. OBJECTIVE We present practical options for clinical note de-identification, assessing performance of machine learning systems ranging from off-the-shelf to fully customized. METHODS We implement a state-of-the-art machine learning de-identification system, training and testing on pairs of datasets that match the deployment scenarios. We use clinical notes from two i2b2 competition corpora, the Physionet Gold Standard corpus, and parts of the MIMIC-III dataset. RESULTS Fully customized systems remove 97-99% of personally identifying information. Performance of off-the-shelf systems varies by dataset, with performance mostly above 90%. Providing a small labeled dataset or large unlabeled dataset allows for fine-tuning that improves performance over off-the-shelf systems. CONCLUSION Health organizations should be aware of the levels of customization available when selecting a de-identification deployment solution, in order to choose the one that best matches their resources and target performance level.
Collapse
|
23
|
A study of deep learning methods for de-identification of clinical notes in cross-institute settings. BMC Med Inform Decis Mak 2019; 19:232. [PMID: 31801524 PMCID: PMC6894104 DOI: 10.1186/s12911-019-0935-4] [Citation(s) in RCA: 26] [Impact Index Per Article: 5.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Background De-identification is a critical technology to facilitate the use of unstructured clinical text while protecting patient privacy and confidentiality. The clinical natural language processing (NLP) community has invested great efforts in developing methods and corpora for de-identification of clinical notes. These annotated corpora are valuable resources for developing automated systems to de-identify clinical text at local hospitals. However, existing studies often utilized training and test data collected from the same institution. There are few studies to explore automated de-identification under cross-institute settings. The goal of this study is to examine deep learning-based de-identification methods at a cross-institute setting, identify the bottlenecks, and provide potential solutions. Methods We created a de-identification corpus using a total 500 clinical notes from the University of Florida (UF) Health, developed deep learning-based de-identification models using 2014 i2b2/UTHealth corpus, and evaluated the performance using UF corpus. We compared five different word embeddings trained from the general English text, clinical text, and biomedical literature, explored lexical and linguistic features, and compared two strategies to customize the deep learning models using UF notes and resources. Results Pre-trained word embeddings using a general English corpus achieved better performance than embeddings from de-identified clinical text and biomedical literature. The performance of deep learning models trained using only i2b2 corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8568 and 0.8958) when applied to another corpus annotated at UF Health. Linguistic features could further improve the performance of de-identification in cross-institute settings. After customizing the models using UF notes and resource, the best model achieved the strict and relaxed F1 scores of 0.9288 and 0.9584, respectively. Conclusions It is necessary to customize de-identification models using local clinical text and other resources when applied in cross-institute settings. Fine-tuning is a potential solution to re-use pre-trained parameters and reduce the training time to customize deep learning-based de-identification models trained using clinical corpus from a different institution.
Collapse
|
24
|
Abstract
Reusable, publicly available data is a pillar of open science and rapid advancement of cancer imaging research. Sharing data from completed research studies not only saves research dollars required to collect data, but also helps insure that studies are both replicable and reproducible. The Cancer Imaging Archive (TCIA) is a global shared repository for imaging data related to cancer. Insuring the consistency, scientific utility, and anonymity of data stored in TCIA is of utmost importance. As the rate of submission to TCIA has been increasing, both in volume and complexity of DICOM objects stored, the process of curation of collections has become a bottleneck in acquisition of data. In order to increase the rate of curation of image sets, improve the quality of the curation, and better track the provenance of changes made to submitted DICOM image sets, a custom set of tools was developed, using novel methods for the analysis of DICOM data sets. These tools are written in the programming language perl, use the open-source database PostgreSQL, make use of the perl DICOM routines in the open-source package Posda, and incorporate DICOM diagnostic tools from other open-source packages, such as dicom3tools. These tools are referred to as the “Posda Tools.” The Posda Tools are open source and available via git at https://github.com/UAMS-DBMI/PosdaTools. In this paper, we briefly describe the Posda Tools and discuss the novel methods employed by these tools to facilitate rapid analysis of DICOM data, including the following: (1) use a database schema which is more permissive, and differently normalized from traditional DICOM databases; (2) perform integrity checks automatically on a bulk basis; (3) apply revisions to DICOM datasets on an bulk basis, either through a web-based interface or via command line executable perl scripts; (4) all such edits are tracked in a revision tracker and may be rolled back; (5) a UI is provided to inspect the results of such edits, to verify that they are what was intended; (6) identification of DICOM Studies, Series, and SOP instances using “nicknames” which are persistent and have well-defined scope to make expression of reported DICOM errors easier to manage; and (7) rapidly identify potential duplicate DICOM datasets by pixel data is provided; this can be used, e.g., to identify submission subjects which may relate to the same individual, without identifying the individual.
Collapse
|
25
|
Protecting and Utilizing Health and Medical Big Data: Policy Perspectives from Korea. Healthc Inform Res 2019; 25:239-247. [PMID: 31777667 PMCID: PMC6859269 DOI: 10.4258/hir.2019.25.4.239] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2019] [Revised: 10/06/2019] [Accepted: 10/13/2019] [Indexed: 11/23/2022] Open
Abstract
Objectives We analyzed Korea's data privacy regime in the context of protecting and utilizing health and medical big data and tried to draw policy implications from the analyses. Methods We conducted comparative analyses of the legal and regulatory environments governing health and medical big data with a view to drawing policy implications for Korea. The legal and regulatory regimes considered include the following: the European Union, the United Kingdom, France, the United States, and Japan. We reviewed relevant statutory materials as well as various non-statutory materials and guidelines issued by public authorities. Where available, we also examined policy measures implemented by government agencies. Results In this study, we investigated how various jurisdictions deal with legal and regulatory issues that may arise from the use of health and medical information with regard to the protection of data subjects' rights and the protection of personal information. We compared and analyzed various forms of legislation in various jurisdictions and also considered technical methods, such as de-identification. The main findings include the following: there is a need to streamline the relationship between the general data privacy regime and the regulatory regime governing health and medical big data; the regulatory and institutional structure for data governance should be more clearly delineated; and regulation should encourage the development of suitable methodologies for the de-identification of data and, in doing so, a principle-based and risk-based approach should be taken. Conclusions Following our comparative legal analyses, implications were drawn. The main conclusion is that the relationship between the legal requirements imposed for purposes of personal information protection and the regulatory requirements governing the use of health and medical data is complicated and multi-faceted and, as such, their relationship should be more clearly streamlined and delineated.
Collapse
|
26
|
[Adaptation of the General Data Protection Regulation (GDPR) to a smartphone app for rhinitis and asthma (MASK-air®)]. Rev Mal Respir 2019; 36:1019-1031. [PMID: 31611024 DOI: 10.1016/j.rmr.2019.08.003] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2019] [Accepted: 08/16/2019] [Indexed: 12/27/2022]
Abstract
The General Data Protection Regulation (GDPR) regulates the processing of personal data in the European Union. The legal context is adapted to follow the evolution of technologies and of society. This new European regulation became mandatory, especially for connected devices, on May 25, 2018. An app originally known as "The Allergy Diary" is available for Android phones and iPhones. Its name was recently changed to MASK-air. The downloading and use of this app are free of charge and there are no adverts. It enables users to record their symptoms and their medications to better track the progress of their allergic rhinitis and/or asthma. It has been developed by public (Foundation FMC VIA-LR, University of Montpellier) and private (KYomed INNOV) organizations based in France and therefore falls under French jurisdiction. This article summarizes the five main principles of personal data protection to be respected during the development of the app: purpose, proportionality and relevance, limited retention period, security and confidentiality, as well as the rights of the people who are involved in the management of the personal data (including withdrawal and modification).
Collapse
|
27
|
Sharing De-identified Medical Images Electronically for Research: A Survey of Patients' Opinion Regarding Data Management. Can Assoc Radiol J 2019; 70:212-218. [PMID: 31376884 DOI: 10.1016/j.carj.2019.04.002] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/12/2018] [Revised: 04/04/2019] [Accepted: 04/08/2019] [Indexed: 10/26/2022] Open
Abstract
PURPOSE Secondary usage of patient data has recently become of increasing interest for the development and application of computer analytic techniques. Strict oversight of these data is required and the individual patients themselves are integral to providing guidance. We sought to understand patients' attitudes to sharing their imaging data for research purposes. These images could provide a great wealth of information for researchers. METHODS Patients from the Greater Toronto Area attending Sunnybrook Health Sciences Centre for imaging (magnetic resonance imagining, computed tomography, or ultrasound) examination areas were invited to participate in an electronic survey. RESULTS Of the 1083 patients who were approached (computed tomography 609, ultrasound 314, and magnetic resonance imaging 160), 798 (74%) agreed to take the survey. Overall median age was 60 (interquartile range = 18, Q1 = 52, Q3 = 70), 52% were women, 42% had a university degree, and 7% had no high school diploma. In terms of willingness to share de-identified medical images for research, 76% were willing (agreed and strongly agreed), while 7% refused. Most participants gave their family physicians (73%) and other physicians (57%) unconditional data access. Participants chose hospitals/research institutions to regulate electronic images databases (70%), 89% wanted safeguards against unauthorized access to their data, and over 70% wanted control over who will be permitted, for how long, and the ability to revoke that permission. CONCLUSIONS Our study found that people are willing to share their clinically acquired de-identified medical images for research studies provided that they have control over permissions and duration of access.
Collapse
|
28
|
A Study of Deep Learning Methods for De-identification of Clinical Notes at Cross Institute Settings. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS. IEEE INTERNATIONAL CONFERENCE ON HEALTHCARE INFORMATICS 2019; 2019:10.1109/ICHI.2019.8904544. [PMID: 31879734 PMCID: PMC6932867 DOI: 10.1109/ichi.2019.8904544] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/10/2023]
Abstract
In this study, we examined a deep learning method for de-identification of clinical notes at UF Health under a cross-institute setting. We developed deep learning models using 2014 i2b2/UTHealth corpus and evaluated the performance using clinical notes collected from UF Health. We compared four pre-trained word embeddings, including two embeddings from the general domain and two embeddings from the clinical domain. We also explored linguistic features (i.e., word shape and part-of-speech) to further improve the performance of de-identification. The experimental results show that the performance of deep learning models trained using i2b2/UTHealth corpus significantly dropped (strict and relax F1 scores dropped from 0.9547 and 0.9646 to 0.8360 and 0.8870) when applied to another corpus from a different institution (UF Health). Linguistic features, including word shapes and part-of-speech, could further improve the performance of de-identification in cross-institute settings (improved to 0.8527 and 0.9052).
Collapse
|
29
|
Identification and classification of DICOM files with burned-in text content. Int J Med Inform 2019; 126:128-137. [PMID: 31029254 DOI: 10.1016/j.ijmedinf.2019.02.011] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2018] [Revised: 02/12/2019] [Accepted: 02/19/2019] [Indexed: 11/23/2022]
Abstract
BACKGROUND Protected health information burned in pixel data is not indicated for various reasons in DICOM. It complicates the secondary use of such data. In recent years, there have been several attempts to anonymize or de-identify DICOM files. Existing approaches have different constraints. No completely reliable solution exists. Especially for large datasets, it is necessary to quickly analyse and identify files potentially violating privacy. METHODS Classification is based on adaptive-iterative algorithm designed to identify one of three classes. There are several image transformations, optical character recognition, and filters; then a local decision is made. A confirmed local decision is the final one. The classifier was trained on a dataset composed of 15,334 images of various modalities. RESULTS The false positive rates are in all cases below 4.00%, and 1.81% in the mission-critical problem of detecting protected health information. The classifier's weighted average recall was 94.85%, the weighted average inverse recall was 97.42% and Cohen's Kappa coefficient was 0.920. CONCLUSION The proposed novel approach for classification of burned-in text is highly configurable and able to analyse images from different modalities with a noisy background. The solution was validated and is intended to identify DICOM files that need to have restricted access or be thoroughly de-identified due to privacy issues. Unlike with existing tools, the recognised text, including its coordinates, can be further used for de-identification.
Collapse
|
30
|
A machine learning based approach to identify protected health information in Chinese clinical text. Int J Med Inform 2018; 116:24-32. [PMID: 29887232 DOI: 10.1016/j.ijmedinf.2018.05.010] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/18/2017] [Revised: 04/19/2018] [Accepted: 05/17/2018] [Indexed: 11/24/2022]
Abstract
BACKGROUND With the increasing application of electronic health records (EHRs) in the world, protecting private information in clinical text has drawn extensive attention from healthcare providers to researchers. De-identification, the process of identifying and removing protected health information (PHI) from clinical text, has been central to the discourse on medical privacy since 2006. While de-identification is becoming the global norm for handling medical records, there is a paucity of studies on its application on Chinese clinical text. Without efficient and effective privacy protection algorithms in place, the use of indispensable clinical information would be confined. OBJECTIVES We aimed to (i) describe the current process for PHI in China, (ii) propose a machine learning based approach to identify PHI in Chinese clinical text, and (iii) validate the effectiveness of the machine learning algorithm for de-identification in Chinese clinical text. METHODS Based on 14,719 discharge summaries from regional health centers in Ya'an City, Sichuan province, China, we built a conditional random fields (CRF) model to identify PHI in clinical text, and then used the regular expressions to optimize the recognition results of the PHI categories with fewer samples. RESULTS We constructed a Chinese clinical text corpus with PHI tags through substantial manual annotation, wherein the descriptive statistics of PHI manifested its wide range and diverse categories. The evaluation showed with a high F-measure of 0.9878 that our CRF-based model had a good performance for identifying PHI in Chinese clinical text. CONCLUSION The rapid adoption of EHR in the health sector has created an urgent need for tools that can parse patient specific information from Chinese clinical text. Our application of CRF algorithms for de-identification has shown the potential to meet this need by offering a highly accurate and flexible solution to analyzing Chinese clinical text.
Collapse
|
31
|
Abstract
BACKGROUND De-identification is the first step to use these records for data processing or further medical investigations in electronic medical records. Consequently, a reliable automated de-identification system would be of high value. METHODS In this paper, a method of combining text skeleton and recurrent neural network is proposed to solve the problem of de-identification. Text skeleton is the general structure of a medical record, which can help neural networks to learn better. RESULTS We evaluated our method on three datasets involving two English datasets from i2b2 de-identification challenge and a Chinese dataset we annotated. Empirical results show that the text skeleton based method we proposed can help the network to recognize protected health information. CONCLUSIONS The comparison between our method and state-of-the-art frameworks indicates that our method achieves high performance on the problem of medical record de-identification.
Collapse
|
32
|
Abstract
Data sharing of large genomic databases and biorepositories provides researchers adequately powered samples to advance the goals of precision medicine. Data sharing may also introduce, however, participant privacy concerns including possible reidentification. This study compares views of research participants, genetic researchers, and institutional review board (IRB) professionals regarding concerns about the use of de-identified data. An online survey was completed by cancer patients, their relatives, and controls from the Northwest Cancer Genetics Registry (n = 450) querying views about potential harms with the use of de-identified data. This was compared to our previous online national survey of human genetic researchers (n = 351) and IRB professionals (n = 208). Researchers were less likely to feel that participants would be personally identified or harmed from a study involving de-identified data or feel that a federal agency might compel researchers to disclose information about research participants. Compared to genetic researchers, IRB professionals and participants were significantly more likely to express that personal identification or harm was likely or that researchers might be forced to disclose information by a federal agency. An understanding of the differences in views regarding possible harm from the use of de-identified data between these three important stakeholder groups is necessary to move forward with genomic research.
Collapse
|
33
|
The research participant perspective related to the conduct of genomic cohort studies: A systematic review of the quantitative literature. Transl Behav Med 2018; 8:119-129. [PMID: 29385589 PMCID: PMC6065547 DOI: 10.1093/tbm/ibx056] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/29/2022] Open
Abstract
Observational genome-wide association studies require large sample sizes. Evaluating the interplay between genomic, environmental, and lifestyle factors can require even larger sample sizes. The All of Us Research Program will recruit 1 million participants to facilitate research on genomic, environmental, and lifestyle factors. Integrating participant preferences into the research process is a new paradigm and a necessary component of the All of Us Research Program. The purpose of the study is to summarize quantitative studies of participant preferences related to participation in observational genomic research studies, starting with consent through return of results. Integrating this information into the conduct of genomic studies may benefit participants, and improve participant satisfaction, recruitment, and retention. We conducted a systematic review of the literature regarding participant views related to reconsent and broad consent, use of de-identified data, contribution of data to a biorepository, risk of identification, return of individual genetic results, and motivation for participation in genomic studies. Twenty-three articles met our inclusion and exclusion criteria. Study results found that most participants support broad consent; however, significant differences related to reconsent preferences have been shown by gender and age. Most participants support the return of individual genomic results and do not feel it is necessary to maintain a link to their de-identified data. Reasons given for joining research studies varied by population source. These findings, in addition to the knowledge that participants are more accepting of broad informed consent methods when the rationale is explained, can assist in developing guidelines for future observational genomic research.
Collapse
|
34
|
De-Identification of German Medical Admission Notes. Stud Health Technol Inform 2018; 253:165-169. [PMID: 30147065] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Medical texts are a vast resource for medical and computational research. In contrast to newswire or wikipedia texts medical texts need to be de-identified before making them accessible to a wider NLP research community. We created a prototype for German medical text de-identification and named entity recognition using a three-step approach. First, we used well known rule-based models based on regular expressions and gazetteers, second we used a spelling variant detector based on Levenshtein distance, exploiting the fact that the medical texts contain semi-structured headers including sensible personal data, and third we trained a named entity recognition model on out of domain data to add statistical capabilities to our prototype. Using a baseline based on regular expressions and gazetteers we could improve F2-score from 78% to 85% for de-identification. Our prototype is a first step for further research on German medical text de-identification and could show that using spelling variant detection and out of domain trained statistical models can improve de-identification performance significantly.
Collapse
|
35
|
Controlled searching in reversibly de-identified medical imaging archives. J Biomed Inform 2017; 77:81-90. [PMID: 29224856 DOI: 10.1016/j.jbi.2017.12.002] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2017] [Revised: 11/22/2017] [Accepted: 12/04/2017] [Indexed: 11/17/2022]
Abstract
Nowadays, digital medical imaging in healthcare has become a fundamental tool for medical diagnosis. This growth has been accompanied by the development of technologies and standards, such as the DICOM standard and PACS. This environment led to the creation of collaborative projects where there is a need to share medical data between different institutions for research and educational purposes. In this context, it is necessary to maintain patient data privacy and provide an easy and secure mechanism for authorized personnel access. This paper presents a solution that fully de-identifies standard medical imaging objects, including metadata and pixel data, providing at the same time a reversible de-identifier mechanism that retains search capabilities from the original data. The last feature is important in some scenarios, for instance, in collaborative platforms where data is anonymized when shared with the community but searchable for data custodians or authorized entities. The solution was integrated into an open source PACS archive and validated in a multidisciplinary collaborative scenario.
Collapse
|
36
|
A hybrid approach to automatic de-identification of psychiatric notes. J Biomed Inform 2017; 75S:S19-S27. [PMID: 28602904 PMCID: PMC5705430 DOI: 10.1016/j.jbi.2017.06.006] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Revised: 06/02/2017] [Accepted: 06/05/2017] [Indexed: 11/17/2022]
Abstract
De-identification, or identifying and removing protected health information (PHI) from clinical data, is a critical step in making clinical data available for clinical applications and research. This paper presents a natural language processing system for automatic de-identification of psychiatric notes, which was designed to participate in the 2016 CEGS N-GRID shared task Track 1. The system has a hybrid structure that combines machine leaning techniques and rule-based approaches. The rule-based components exploit the structure of the psychiatric notes as well as characteristic surface patterns of PHI mentions. The machine learning components utilize supervised learning with rich features. In addition, the system performance was boosted with integration of additional data to the training set through domain adaptation. The hybrid system showed overall micro-averaged F-score 90.74 on the test set, second-best among all the participants of the CEGS N-GRID task.
Collapse
|
37
|
De-identification of medical records using conditional random fields and long short-term memory networks. J Biomed Inform 2017; 75S:S43-S53. [PMID: 29032162 DOI: 10.1016/j.jbi.2017.10.003] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2017] [Revised: 09/30/2017] [Accepted: 10/03/2017] [Indexed: 10/18/2022]
Abstract
The CEGS N-GRID 2016 Shared Task 1 in Clinical Natural Language Processing focuses on the de-identification of psychiatric evaluation records. This paper describes two participating systems of our team, based on conditional random fields (CRFs) and long short-term memory networks (LSTMs). A pre-processing module was introduced for sentence detection and tokenization before de-identification. For CRFs, manually extracted rich features were utilized to train the model. For LSTMs, a character-level bi-directional LSTM network was applied to represent tokens and classify tags for each token, following which a decoding layer was stacked to decode the most probable protected health information (PHI) terms. The LSTM-based system attained an i2b2 strict micro-F1 measure of 0.8986, which was higher than that of the CRF-based system.
Collapse
|
38
|
A cascaded approach for Chinese clinical text de-identification with less annotation effort. J Biomed Inform 2017; 73:76-83. [PMID: 28756160 PMCID: PMC5583002 DOI: 10.1016/j.jbi.2017.07.017] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/13/2017] [Revised: 07/09/2017] [Accepted: 07/25/2017] [Indexed: 11/28/2022]
Abstract
With rapid adoption of Electronic Health Records (EHR) in China, an increasing amount of clinical data has been available to support clinical research. Clinical data secondary use usually requires de-identification of personal information to protect patient privacy. Since manually de-identification of free clinical text requires significant amount of human work, developing an automated de-identification system is necessary. While there are many de-identification systems available for English clinical text, designing a de-identification system for Chinese clinical text faces many challenges such as unavailability of necessary lexical resources and sparsity of patient health information (PHI) in Chinese clinical text. In this paper, we designed a de-identification pipeline taking advantage of both rule-based and machine learning techniques. Our method, in particular, can effectively construct a data set with dense PHI information, which saves annotation time significantly for subsequent supervised learning. We experiment on a dataset of 3000 heterogeneous clinical documents to evaluate the annotation cost and the de-identification performance. Our approach can increase the efficiency of the annotation effort by over 60% while reaching performance as high as over 90% measured by F score. We demonstrate that combing rule-based and machine learning is an effective way to reduce the annotation cost and achieve high performance in Chinese clinical text de-identification task.
Collapse
|
39
|
Learning to identify Protected Health Information by integrating knowledge- and data-driven algorithms: A case study on psychiatric evaluation notes. J Biomed Inform 2017; 75S:S28-S33. [PMID: 28602908 DOI: 10.1016/j.jbi.2017.06.005] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/02/2017] [Revised: 06/01/2017] [Accepted: 06/05/2017] [Indexed: 10/19/2022]
Abstract
De-identification of clinical narratives is one of the main obstacles to making healthcare free text available for research. In this paper we describe our experience in expanding and tailoring two existing tools as part of the 2016 CEGS N-GRID Shared Tasks Track 1, which evaluated de-identification methods on a set of psychiatric evaluation notes for up to 25 different types of Protected Health Information (PHI). The methods we used rely on machine learning on either a large or small feature space, with additional strategies, including two-pass tagging and multi-class models, which both proved to be beneficial. The results show that the integration of the proposed methods can identify Health Information Portability and Accountability Act (HIPAA) defined PHIs with overall F1-scores of ∼90% and above. Yet, some classes (Profession, Organization) proved again to be challenging given the variability of expressions used to reference given information.
Collapse
|
40
|
De-identification of clinical notes via recurrent neural network and conditional random field. J Biomed Inform 2017; 75S:S34-S42. [PMID: 28579533 DOI: 10.1016/j.jbi.2017.05.023] [Citation(s) in RCA: 55] [Impact Index Per Article: 7.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2017] [Revised: 05/26/2017] [Accepted: 05/30/2017] [Indexed: 11/26/2022]
Abstract
De-identification, identifying information from data, such as protected health information (PHI) present in clinical data, is a critical step to enable data to be shared or published. The 2016 Centers of Excellence in Genomic Science (CEGS) Neuropsychiatric Genome-scale and RDOC Individualized Domains (N-GRID) clinical natural language processing (NLP) challenge contains a de-identification track in de-identifying electronic medical records (EMRs) (i.e., track 1). The challenge organizers provide 1000 annotated mental health records for this track, 600 out of which are used as a training set and 400 as a test set. We develop a hybrid system for the de-identification task on the training set. Firstly, four individual subsystems, that is, a subsystem based on bidirectional LSTM (long-short term memory, a variant of recurrent neural network), a subsystem-based on bidirectional LSTM with features, a subsystem based on conditional random field (CRF) and a rule-based subsystem, are used to identify PHI instances. Then, an ensemble learning-based classifiers is deployed to combine all PHI instances predicted by above three machine learning-based subsystems. Finally, the results of the ensemble learning-based classifier and the rule-based subsystem are merged together. Experiments conducted on the official test set show that our system achieves the highest micro F1-scores of 93.07%, 91.43% and 95.23% under the "token", "strict" and "binary token" criteria respectively, ranking first in the 2016 CEGS N-GRID NLP challenge. In addition, on the dataset of 2014 i2b2 NLP challenge, our system achieves the highest micro F1-scores of 96.98%, 95.11% and 98.28% under the "token", "strict" and "binary token" criteria respectively, outperforming other state-of-the-art systems. All these experiments prove the effectiveness of our proposed method.
Collapse
|
41
|
Clinical records anonymisation and text extraction (CRATE): an open-source software system. BMC Med Inform Decis Mak 2017; 17:50. [PMID: 28441940 PMCID: PMC5405523 DOI: 10.1186/s12911-017-0437-1] [Citation(s) in RCA: 30] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/06/2016] [Accepted: 03/30/2017] [Indexed: 11/24/2022] Open
Abstract
Background Electronic medical records contain information of value for research, but contain identifiable and often highly sensitive confidential information. Patient-identifiable information cannot in general be shared outside clinical care teams without explicit consent, but anonymisation/de-identification allows research uses of clinical data without explicit consent. Results This article presents CRATE (Clinical Records Anonymisation and Text Extraction), an open-source software system with separable functions: (1) it anonymises or de-identifies arbitrary relational databases, with sensitivity and precision similar to previous comparable systems; (2) it uses public secure cryptographic methods to map patient identifiers to research identifiers (pseudonyms); (3) it connects relational databases to external tools for natural language processing; (4) it provides a web front end for research and administrative functions; and (5) it supports a specific model through which patients may consent to be contacted about research. Conclusions Creation and management of a research database from sensitive clinical records with secure pseudonym generation, full-text indexing, and a consent-to-contact process is possible and practical using entirely free and open-source software.
Collapse
|
42
|
A De-Identification Pipeline for Ultrasound Medical Images in DICOM Format. J Med Syst 2017; 41:89. [PMID: 28405948 DOI: 10.1007/s10916-017-0736-1] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/16/2016] [Accepted: 04/03/2017] [Indexed: 11/24/2022]
Abstract
Clinical data sharing between healthcare institutions, and between practitioners is often hindered by privacy protection requirements. This problem is critical in collaborative scenarios where data sharing is fundamental for establishing a workflow among parties. The anonymization of patient information burned in DICOM images requires elaborate processes somewhat more complex than simple de-identification of textual information. Usually, before sharing, there is a need for manual removal of specific areas containing sensitive information in the images. In this paper, we present a pipeline for ultrasound medical image de-identification, provided as a free anonymization REST service for medical image applications, and a Software-as-a-Service to streamline automatic de-identification of medical images, which is freely available for end-users. The proposed approach applies image processing functions and machine-learning models to bring about an automatic system to anonymize medical images. To perform character recognition, we evaluated several machine-learning models, being Convolutional Neural Networks (CNN) selected as the best approach. For accessing the system quality, 500 processed images were manually inspected showing an anonymization rate of 89.2%. The tool can be accessed at https://bioinformatics.ua.pt/dicom/anonymizer and it is available with the most recent version of Google Chrome, Mozilla Firefox and Safari. A Docker image containing the proposed service is also publicly available for the community.
Collapse
|
43
|
De-identified genomic data sharing: the research participant perspective. J Community Genet 2017; 8:173-181. [PMID: 28382417 DOI: 10.1007/s12687-017-0300-1] [Citation(s) in RCA: 41] [Impact Index Per Article: 5.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2016] [Accepted: 03/24/2017] [Indexed: 02/06/2023] Open
Abstract
Combining datasets into larger and separate datasets is becoming increasingly common, and personal identifiers are often removed in order to maintain participant anonymity. Views of research participants on the use of de-identified data in large research datasets are important for future projects, such as the Precision Medicine Initiative and Cancer Moonshot Initiative. This quantitative study set in the USA examines participant preferences and evaluates differences by demographics and cancer history. Study participants were recruited from the Northwest Cancer Genetics Registry and included cancer patients, their relatives, and controls. A secure online survey was administered to 450 participants. While the majority participants were not concerned about personal identification when participating in a genetic study using de-identified data, they expressed their concern that researchers protect their privacy and information. Most participants expressed a desire that their data should be available for as many research studies as possible, and in doing so, they would increase their chance of receiving personal health information. About 20% of participants felt that a link should not be maintained between the participant and their de-identified data. Reasons to maintain a link included an ability to return individual health results and an ability to support further research. Knowledge of participants' attitudes regarding the use of data into a research repository and the maintenance of a link to de-identified data is critical to the success of recruitment into future genomic research projects.
Collapse
|
44
|
A unified framework for evaluating the risk of re-identification of text de-identification tools. J Biomed Inform 2016; 63:174-183. [PMID: 27426236 DOI: 10.1016/j.jbi.2016.07.015] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2015] [Revised: 06/14/2016] [Accepted: 07/13/2016] [Indexed: 10/21/2022]
Abstract
OBJECTIVES It has become regular practice to de-identify unstructured medical text for use in research using automatic methods, the goal of which is to remove patient identifying information to minimize re-identification risk. The metrics commonly used to determine if these systems are performing well do not accurately reflect the risk of a patient being re-identified. We therefore developed a framework for measuring the risk of re-identification associated with textual data releases. METHODS We apply the proposed evaluation framework to a data set from the University of Michigan Medical School. Our risk assessment results are then compared with those that would be obtained using a typical contemporary micro-average evaluation of recall in order to illustrate the difference between the proposed evaluation framework and the current baseline method. RESULTS We demonstrate how this framework compares against common measures of the re-identification risk associated with an automated text de-identification process. For the probability of re-identification using our evaluation framework we obtained a mean value for direct identifiers of 0.0074 and a mean value for quasi-identifiers of 0.0022. The 95% confidence interval for these estimates were below the relevant thresholds. The threshold for direct identifier risk was based on previously used approaches in the literature. The threshold for quasi-identifiers was determined based on the context of the data release following commonly used de-identification criteria for structured data. DISCUSSION Our framework attempts to correct for poorly distributed evaluation corpora, accounts for the data release context, and avoids the often optimistic assumptions that are made using the more traditional evaluation approach. It therefore provides a more realistic estimate of the true probability of re-identification. CONCLUSIONS This framework should be used as a basis for computing re-identification risk in order to more realistically evaluate future text de-identification tools.
Collapse
|
45
|
Abstract
BACKGROUND Greater transparency and, in particular, sharing of patient-level data for further scientific research is an increasingly important topic for the pharmaceutical industry and other organisations who sponsor and conduct clinical trials as well as generally in the interests of patients participating in studies. A concern remains, however, over how to appropriately prepare and share clinical trial data with third party researchers, whilst maintaining patient confidentiality. Clinical trial datasets contain very detailed information on each participant. Risk to patient privacy can be mitigated by data reduction techniques. However, retention of data utility is important in order to allow meaningful scientific research. In addition, for clinical trial data, an excessive application of such techniques may pose a public health risk if misleading results are produced. After considering existing guidance, this article makes recommendations with the aim of promoting an approach that balances data utility and privacy risk and is applicable across clinical trial data holders. DISCUSSION Our key recommendations are as follows: 1. Data anonymisation/de-identification: Data holders are responsible for generating de-identified datasets which are intended to offer increased protection for patient privacy through masking or generalisation of direct and some indirect identifiers. 2. Controlled access to data, including use of a data sharing agreement: A legally binding data sharing agreement should be in place, including agreements not to download or further share data and not to attempt to seek to identify patients. Appropriate levels of security should be used for transferring data or providing access; one solution is use of a secure 'locked box' system which provides additional safeguards. This article provides recommendations on best practices to de-identify/anonymise clinical trial data for sharing with third-party researchers, as well as controlled access to data and data sharing agreements. The recommendations are applicable to all clinical trial data holders. Further work will be needed to identify and evaluate competing possibilities as regulations, attitudes to risk and technologies evolve.
Collapse
|
46
|
Efficient and effective pruning strategies for health data de-identification. BMC Med Inform Decis Mak 2016; 16:49. [PMID: 27130179 PMCID: PMC4851781 DOI: 10.1186/s12911-016-0287-2] [Citation(s) in RCA: 16] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/17/2015] [Accepted: 04/21/2016] [Indexed: 11/10/2022] Open
Abstract
Background Privacy must be protected when sensitive biomedical data is shared, e.g. for research purposes. Data de-identification is an important safeguard, where datasets are transformed to meet two conflicting objectives: minimizing re-identification risks while maximizing data quality. Typically, de-identification methods search a solution space of possible data transformations to find a good solution to a given de-identification problem. In this process, parts of the search space must be excluded to maintain scalability. Objectives The set of transformations which are solution candidates is typically narrowed down by storing the results obtained during the search process and then using them to predict properties of the output of other transformations in terms of privacy (first objective) and data quality (second objective). However, due to the exponential growth of the size of the search space, previous implementations of this method are not well-suited when datasets contain many attributes which need to be protected. As this is often the case with biomedical research data, e.g. as a result of longitudinal collection, we have developed a novel method. Methods Our approach combines the mathematical concept of antichains with a data structure inspired by prefix trees to represent properties of a large number of data transformations while requiring only a minimal amount of information to be stored. To analyze the improvements which can be achieved by adopting our method, we have integrated it into an existing algorithm and we have also implemented a simple best-first branch and bound search (BFS) algorithm as a first step towards methods which fully exploit our approach. We have evaluated these implementations with several real-world datasets and the k-anonymity privacy model. Results When integrated into existing de-identification algorithms for low-dimensional data, our approach reduced memory requirements by up to one order of magnitude and execution times by up to 25 %. This allowed us to increase the size of solution spaces which could be processed by almost a factor of 10. When using the simple BFS method, we were able to further increase the size of the solution space by a factor of three. When used as a heuristic strategy for high-dimensional data, the BFS approach outperformed a state-of-the-art algorithm by up to 12 % in terms of the quality of output data. Conclusions This work shows that implementing methods of data de-identification for real-world applications is a challenging task. Our approach solves a problem often faced by data custodians: a lack of scalability of de-identification software when used with datasets having realistic schemas and volumes. The method described in this article has been implemented into ARX, an open source de-identification software for biomedical data.
Collapse
|
47
|
Image De-Identification Methods for Clinical Research in the XDS Environment. J Med Syst 2016; 40:83. [PMID: 26811074 PMCID: PMC4728177 DOI: 10.1007/s10916-016-0431-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2014] [Accepted: 01/07/2016] [Indexed: 12/05/2022]
Abstract
To investigate possible de-identification methodologies within the Cross-Enterprise Document Sharing for imaging (XDS-I) environment in order to provide strengthened support for image data exchange as part of clinical research projects. De-identification, using anonymization or pseudonymization, is the most common method to perform information removal within DICOM data. However, it is not a standard part of the XDS-I profiles. Different methodologies were observed to define how and where de-identification should take place within an XDS environment used for scientific research. De-identification service can be placed in three locations within the XDS-I framework: 1) within the Document Source, 2) between the Document Source and Document Consumer, and 3) within the Document Consumer. First method has a potential advantage with respect to the exposure of the images to outside systems but has drawbacks with respect to additional hardware and configuration requirements. Second and third method have big concern in exposing original documents with all identifiable data being intact after leaving the Document Source. De-identification within the Document Source has more advantages compared to the other methods. On the contrary, it is less recommended to perform de-identification within the Document Consumer since it has the highest risk of the exposure of patients identity due to the fact that images are exposed without de-identification during the transfers.
Collapse
|
48
|
Hidden Markov model using Dirichlet process for de-identification. J Biomed Inform 2015; 58 Suppl:S60-S66. [PMID: 26407642 DOI: 10.1016/j.jbi.2015.09.004] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2015] [Revised: 09/02/2015] [Accepted: 09/03/2015] [Indexed: 10/23/2022]
Abstract
For the 2014 i2b2/UTHealth de-identification challenge, we introduced a new non-parametric Bayesian hidden Markov model using a Dirichlet process (HMM-DP). The model intends to reduce task-specific feature engineering and to generalize well to new data. In the challenge we developed a variational method to learn the model and an efficient approximation algorithm for prediction. To accommodate out-of-vocabulary words, we designed a number of feature functions to model such words. The results show the model is capable of understanding local context cues to make correct predictions without manual feature engineering and performs as accurately as state-of-the-art conditional random field models in a number of categories. To incorporate long-range and cross-document context cues, we developed a skip-chain conditional random field model to align the results produced by HMM-DP, which further improved the performance.
Collapse
|
49
|
The cost of quality: Implementing generalization and suppression for anonymizing biomedical data with minimal information loss. J Biomed Inform 2015; 58:37-48. [PMID: 26385376 DOI: 10.1016/j.jbi.2015.09.007] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/15/2015] [Revised: 08/07/2015] [Accepted: 09/04/2015] [Indexed: 12/01/2022]
Abstract
OBJECTIVE With the ARX data anonymization tool structured biomedical data can be de-identified using syntactic privacy models, such as k-anonymity. Data is transformed with two methods: (a) generalization of attribute values, followed by (b) suppression of data records. The former method results in data that is well suited for analyses by epidemiologists, while the latter method significantly reduces loss of information. Our tool uses an optimal anonymization algorithm that maximizes output utility according to a given measure. To achieve scalability, existing optimal anonymization algorithms exclude parts of the search space by predicting the outcome of data transformations regarding privacy and utility without explicitly applying them to the input dataset. These optimizations cannot be used if data is transformed with generalization and suppression. As optimal data utility and scalability are important for anonymizing biomedical data, we had to develop a novel method. METHODS In this article, we first confirm experimentally that combining generalization with suppression significantly increases data utility. Next, we proof that, within this coding model, the outcome of data transformations regarding privacy and utility cannot be predicted. As a consequence, existing algorithms fail to deliver optimal data utility. We confirm this finding experimentally. The limitation of previous work can be overcome at the cost of increased computational complexity. However, scalability is important for anonymizing data with user feedback. Consequently, we identify properties of datasets that may be predicted in our context and propose a novel and efficient algorithm. Finally, we evaluate our solution with multiple datasets and privacy models. RESULTS This work presents the first thorough investigation of which properties of datasets can be predicted when data is anonymized with generalization and suppression. Our novel approach adopts existing optimization strategies to our context and combines different search methods. The experiments show that our method is able to efficiently solve a broad spectrum of anonymization problems. CONCLUSION Our work shows that implementing syntactic privacy models is challenging and that existing algorithms are not well suited for anonymizing data with transformation models which are more complex than generalization alone. As such models have been recommended for use in the biomedical domain, our results are of general relevance for de-identifying structured biomedical data.
Collapse
|
50
|
Annotating longitudinal clinical narratives for de-identification: The 2014 i2b2/UTHealth corpus. J Biomed Inform 2015; 58 Suppl:S20-S29. [PMID: 26319540 DOI: 10.1016/j.jbi.2015.07.020] [Citation(s) in RCA: 70] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/10/2015] [Revised: 07/24/2015] [Accepted: 07/26/2015] [Indexed: 11/18/2022]
Abstract
The 2014 i2b2/UTHealth natural language processing shared task featured a track focused on the de-identification of longitudinal medical records. For this track, we de-identified a set of 1304 longitudinal medical records describing 296 patients. This corpus was de-identified under a broad interpretation of the HIPAA guidelines using double-annotation followed by arbitration, rounds of sanity checking, and proof reading. The average token-based F1 measure for the annotators compared to the gold standard was 0.927. The resulting annotations were used both to de-identify the data and to set the gold standard for the de-identification track of the 2014 i2b2/UTHealth shared task. All annotated private health information were replaced with realistic surrogates automatically and then read over and corrected manually. The resulting corpus is the first of its kind made available for de-identification research. This corpus was first used for the 2014 i2b2/UTHealth shared task, during which the systems achieved a mean F-measure of 0.872 and a maximum F-measure of 0.964 using entity-based micro-averaged evaluations.
Collapse
|