1
|
Kovačević A, Bašaragin B, Milošević N, Nenadić G. De-identification of clinical free text using natural language processing: A systematic review of current approaches. Artif Intell Med 2024; 151:102845. [PMID: 38555848 DOI: 10.1016/j.artmed.2024.102845] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/20/2023] [Revised: 03/13/2024] [Accepted: 03/18/2024] [Indexed: 04/02/2024]
Abstract
BACKGROUND Electronic health records (EHRs) are a valuable resource for data-driven medical research. However, the presence of protected health information (PHI) makes EHRs unsuitable to be shared for research purposes. De-identification, i.e. the process of removing PHI is a critical step in making EHR data accessible. Natural language processing has repeatedly demonstrated its feasibility in automating the de-identification process. OBJECTIVES Our study aims to provide systematic evidence on how the de-identification of clinical free text written in English has evolved in the last thirteen years, and to report on the performances and limitations of the current state-of-the-art systems for the English language. In addition, we aim to identify challenges and potential research opportunities in this field. METHODS A systematic search in PubMed, Web of Science, and the DBLP was conducted for studies published between January 2010 and February 2023. Titles and abstracts were examined to identify the relevant studies. Selected studies were then analysed in-depth, and information was collected on de-identification methodologies, data sources, and measured performance. RESULTS A total of 2125 publications were identified for the title and abstract screening. 69 studies were found to be relevant. Machine learning (37 studies) and hybrid (26 studies) approaches are predominant, while six studies relied only on rules. The majority of the approaches were trained and evaluated on public corpora. The 2014 i2b2/UTHealth corpus is the most frequently used (36 studies), followed by the 2006 i2b2 (18 studies) and 2016 CEGS N-GRID (10 studies) corpora. CONCLUSION Earlier de-identification approaches aimed at English were mainly rule and machine learning hybrids with extensive feature engineering and post-processing, while more recent performance improvements are due to feature-inferring recurrent neural networks. Current leading performance is achieved using attention-based neural models. Recent studies report state-of-the-art F1-scores (over 98 %) when evaluated in the manner usually adopted by the clinical natural language processing community. However, their performance needs to be more thoroughly assessed with different measures to judge their reliability to safely de-identify data in a real-world setting. Without additional manually labeled training data, state-of-the-art systems fail to generalise well across a wide range of clinical sub-domains.
Collapse
Affiliation(s)
- Aleksandar Kovačević
- The University of Novi Sad, Faculty of Technical Sciences, Trg Dositeja Obradovića 6, 21002 Novi Sad, Serbia
| | - Bojana Bašaragin
- The Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, 21000 Novi Sad, Serbia.
| | - Nikola Milošević
- The Institute for Artificial Intelligence Research and Development of Serbia, Fruškogorska 1, 21000 Novi Sad, Serbia; Bayer A.G., Research and Development, Mullerstrasse 173, Berlin 13342, Germany
| | - Goran Nenadić
- The University of Manchester, Department of Computer Science, Manchester, United Kingdom
| |
Collapse
|
2
|
Negash B, Katz A, Neilson CJ, Moni M, Nesca M, Singer A, Enns JE. De-identification of free text data containing personal health information: a scoping review of reviews. Int J Popul Data Sci 2023; 8:2153. [PMID: 38414537 PMCID: PMC10898315 DOI: 10.23889/ijpds.v8i1.2153] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/29/2024] Open
Abstract
Introduction Using data in research often requires that the data first be de-identified, particularly in the case of health data, which often include Personal Identifiable Information (PII) and/or Personal Health Identifying Information (PHII). There are established procedures for de-identifying structured data, but de-identifying clinical notes, electronic health records, and other records that include free text data is more complex. Several different ways to achieve this are documented in the literature. This scoping review identifies categories of de-identification methods that can be used for free text data. Methods We adopted an established scoping review methodology to examine review articles published up to May 9, 2022, in Ovid MEDLINE; Ovid Embase; Scopus; the ACM Digital Library; IEEE Explore; and Compendex. Our research question was: What methods are used to de-identify free text data? Two independent reviewers conducted title and abstract screening and full-text article screening using the online review management tool Covidence. Results The initial literature search retrieved 3,312 articles, most of which focused primarily on structured data. Eighteen publications describing methods of de-identification of free text data met the inclusion criteria for our review. The majority of the included articles focused on removing categories of personal health information identified by the Health Insurance Portability and Accountability Act (HIPAA). The de-identification methods they described combined rule-based methods or machine learning with other strategies such as deep learning. Conclusion Our review identifies and categorises de-identification methods for free text data as rule-based methods, machine learning, deep learning and a combination of these and other approaches. Most of the articles we found in our search refer to de-identification methods that target some or all categories of PHII. Our review also highlights how de-identification systems for free text data have evolved over time and points to hybrid approaches as the most promising approach for the future.
Collapse
Affiliation(s)
- Bekelu Negash
- Manitoba Centre for Health Policy, Department of Community Health Sciences, Rady Faculty of Health Sciences, University of Manitoba
| | - Alan Katz
- Manitoba Centre for Health Policy, Department of Community Health Sciences, Rady Faculty of Health Sciences, University of Manitoba
- Department of Family Medicine, Rady Faculty of Health Sciences, University of Manitoba
| | | | - Moniruzzaman Moni
- George & Fay Yee Centre for Healthcare Innovation, Department of Community Health Sciences, Rady Faculty of Health Sciences, University of Manitoba
| | - Marcello Nesca
- Manitoba Centre for Health Policy, Department of Community Health Sciences, Rady Faculty of Health Sciences, University of Manitoba
| | - Alexander Singer
- Department of Family Medicine, Rady Faculty of Health Sciences, University of Manitoba
| | - Jennifer E. Enns
- Manitoba Centre for Health Policy, Department of Community Health Sciences, Rady Faculty of Health Sciences, University of Manitoba
| |
Collapse
|
3
|
Radhakrishnan L, Schenk G, Muenzen K, Oskotsky B, Ashouri Choshali H, Plunkett T, Israni S, Butte AJ. A certified de-identification system for all clinical text documents for information extraction at scale. JAMIA Open 2023; 6:ooad045. [PMID: 37416449 PMCID: PMC10320112 DOI: 10.1093/jamiaopen/ooad045] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/20/2022] [Revised: 03/25/2023] [Accepted: 06/27/2023] [Indexed: 07/08/2023] Open
Abstract
Objectives Clinical notes are a veritable treasure trove of information on a patient's disease progression, medical history, and treatment plans, yet are locked in secured databases accessible for research only after extensive ethics review. Removing personally identifying and protected health information (PII/PHI) from the records can reduce the need for additional Institutional Review Boards (IRB) reviews. In this project, our goals were to: (1) develop a robust and scalable clinical text de-identification pipeline that is compliant with the Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule for de-identification standards and (2) share routinely updated de-identified clinical notes with researchers. Materials and Methods Building on our open-source de-identification software called Philter, we added features to: (1) make the algorithm and the de-identified data HIPAA compliant, which also implies type 2 error-free redaction, as certified via external audit; (2) reduce over-redaction errors; and (3) normalize and shift date PHI. We also established a streamlined de-identification pipeline using MongoDB to automatically extract clinical notes and provide truly de-identified notes to researchers with periodic monthly refreshes at our institution. Results To the best of our knowledge, the Philter V1.0 pipeline is currently the first and only certified, de-identified redaction pipeline that makes clinical notes available to researchers for nonhuman subjects' research, without further IRB approval needed. To date, we have made over 130 million certified de-identified clinical notes available to over 600 UCSF researchers. These notes were collected over the past 40 years, and represent data from 2757016 UCSF patients.
Collapse
Affiliation(s)
- Lakshmi Radhakrishnan
- Corresponding Author: Lakshmi Radhakrishnan, MS, Academic Research Services, Information Technology, University of California, San Francisco, UCSF Mission Bay, 490 Illinois St, Floor 2, Box 2933, San Francisco, CA 94143, USA;
| | - Gundolf Schenk
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California, USA
| | - Kathleen Muenzen
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California, USA
| | - Boris Oskotsky
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California, USA
| | - Habibeh Ashouri Choshali
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California, USA
| | | | - Sharat Israni
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California, USA
| | - Atul J Butte
- Bakar Computational Health Sciences Institute, University of California, San Francisco, San Francisco, California, USA
- Department of Pediatrics, University of California, San Francisco, San Francisco, California, USA
- Center for Data-Driven Insights and Innovation, University of California Health, Oakland, California, USA
| |
Collapse
|
4
|
Pilgram L, Schäffner E, Eckardt KU, Prasser F. Utility-Preserving Anonymization in a Real-World Scenario: Evidence from the German Chronic Kidney Disease (GCKD) Study. Stud Health Technol Inform 2023; 302:28-32. [PMID: 37203603 DOI: 10.3233/shti230058] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 05/20/2023]
Abstract
Data sharing provides benefits in terms of transparency and innovation. Privacy concerns in this context can be addressed by anonymization techniques. In our study, we evaluated anonymization approaches which transform structured data in a real-world scenario of a chronic kidney disease cohort study and checked for replicability of research results via 95% CI overlap in two differently anonymized datasets with different protection degrees. Calculated 95% CI overlapped in both applied anonymization approaches and visual comparison presented similar results. Thus, in our use case scenario, research results were not relevantly impacted by anonymization, which adds to the growing evidence of utility-preserving anonymization techniques.
Collapse
Affiliation(s)
- Lisa Pilgram
- Department of Nephrology and Medical Intensive Care, Charité - Universitätsmedizin Berlin, Berlin, Germany
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, BIH Biomedical Innovation Academy, BIH Charité Junior Digital Clinician Scientist Program, Berlin, Germany
| | - Elke Schäffner
- Institute of Public Health, Charité - Universitätsmedizin Berlin, Berlin, Germany
| | - Kai-Uwe Eckardt
- Department of Nephrology and Medical Intensive Care, Charité - Universitätsmedizin Berlin, Berlin, Germany
- Department of Nephrology and Hypertension, Friedrich-Alexander Universität Erlangen-Nürnberg, Erlangen, Germany
| | - Fabian Prasser
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, BIH Biomedical Innovation Academy, BIH Charité Junior Digital Clinician Scientist Program, Berlin, Germany
| |
Collapse
|
5
|
Willmott C, Bryant J. Genomics is here: what can we do with it, and what ethical issues has it brought along for the ride? New Bioeth 2023; 29:1-9. [PMID: 36871201 DOI: 10.1080/20502877.2023.2180839] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/06/2023]
Affiliation(s)
- Chris Willmott
- Department of Molecular and Cell Biology, University of Leicester, UK
| | | |
Collapse
|
6
|
Tichopád A, Augustynek M, Beneš J, Dlouhý M, Doležal T, Horáková D, Kršek M, Lhotska L, Panzner P, Penhaker M, Petr M, Piťha J, Popesko B, Rožánek M, Táborský M, Vrablík M. The way to data: opinions and recommendations for the provision of health data for secondary use. Cas Lek Cesk 2023; 162:61-66. [PMID: 37474288] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Subscribe] [Scholar Register] [Indexed: 07/22/2023]
Abstract
Healthcare data held by state-run organisations is a valuable intangible asset for society. Its use should be a priority for its administrators and the state. A completely paternalistic approach by administrators and the state is undesirable, however much it aims to protect the privacy rights of persons registered in databases. In line with European policies and the global trend, these measures should not outweigh the social benefit that arises from the analysis of these data if the technical possibilities exist to sufficiently protect the privacy rights of individuals. Czech society is having an intense discussion on the topic, but according to the authors, it is insufficiently based on facts and lacks clearly articulated opinions of the expert public. The aim of this article is to fill these gaps. Data anonymization techniques provide a solution to protect individuals' privacy rights while preserving the scientific value of the data. The risk of identifying individuals in anonymised data sets is scalable and can be minimised depending on the type and content of the data and its use by the specific applicant. Finding the optimal form and scope of deidentified data requires competence and knowledge on the part of both the applicant and the administrator. It is in the interest of the applicant, the administrator, as well as the protected persons in the databases that both parties show willingness and have the ability and expertise to communicate during the application and its processing.
Collapse
|
7
|
Sepas A, Bangash AH, Alraoui O, El Emam K, El-Hussuna A. Algorithms to anonymize structured medical and healthcare data: A systematic review. Front Bioinform 2022; 2:984807. [PMID: 36619476 PMCID: PMC9815524 DOI: 10.3389/fbinf.2022.984807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Accepted: 11/28/2022] [Indexed: 12/24/2022] Open
Abstract
Introduction: With many anonymization algorithms developed for structured medical health data (SMHD) in the last decade, our systematic review provides a comprehensive bird's eye view of algorithms for SMHD anonymization. Methods: This systematic review was conducted according to the recommendations in the Cochrane Handbook for Reviews of Interventions and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). Eligible articles from the PubMed, ACM digital library, Medline, IEEE, Embase, Web of Science Collection, Scopus, ProQuest Dissertation, and Theses Global databases were identified through systematic searches. The following parameters were extracted from the eligible studies: author, year of publication, sample size, and relevant algorithms and/or software applied to anonymize SMHD, along with the summary of outcomes. Results: Among 1,804 initial hits, the present study considered 63 records including research articles, reviews, and books. Seventy five evaluated the anonymization of demographic data, 18 assessed diagnosis codes, and 3 assessed genomic data. One of the most common approaches was k-anonymity, which was utilized mainly for demographic data, often in combination with another algorithm; e.g., l-diversity. No approaches have yet been developed for protection against membership disclosure attacks on diagnosis codes. Conclusion: This study reviewed and categorized different anonymization approaches for MHD according to the anonymized data types (demographics, diagnosis codes, and genomic data). Further research is needed to develop more efficient algorithms for the anonymization of diagnosis codes and genomic data. The risk of reidentification can be minimized with adequate application of the addressed anonymization approaches. Systematic Review Registration: [http://www.crd.york.ac.uk/prospero], identifier [CRD42021228200].
Collapse
Affiliation(s)
- Ali Sepas
- Open Source Research Collaboration, Aalborg, Denmark
- Department of Materials and Production, Aalborg University, Aalborg, Denmark
| | - Ali Haider Bangash
- Open Source Research Collaboration, Aalborg, Denmark
- STMU Shifa College of Medicine, Islamabad, Pakistan
| | - Omar Alraoui
- Department of Health Science and Technology, Aalborg University, Aalborg, Denmark
| | - Khaled El Emam
- Canada Research Chair in Medical AI, University of Ottawa, Ottawa, ON, Canada
| | | |
Collapse
|
8
|
Bashir SR, Raza S, Kocaman V, Qamar U. Clinical Application of Detecting COVID-19 Risks: A Natural Language Processing Approach. Viruses 2022; 14:v14122761. [PMID: 36560764 PMCID: PMC9781729 DOI: 10.3390/v14122761] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/28/2022] [Accepted: 12/08/2022] [Indexed: 12/14/2022] Open
Abstract
The clinical application of detecting COVID-19 factors is a challenging task. The existing named entity recognition models are usually trained on a limited set of named entities. Besides clinical, the non-clinical factors, such as social determinant of health (SDoH), are also important to study the infectious disease. In this paper, we propose a generalizable machine learning approach that improves on previous efforts by recognizing a large number of clinical risk factors and SDoH. The novelty of the proposed method lies in the subtle combination of a number of deep neural networks, including the BiLSTM-CNN-CRF method and a transformer-based embedding layer. Experimental results on a cohort of COVID-19 data prepared from PubMed articles show the superiority of the proposed approach. When compared to other methods, the proposed approach achieves a performance gain of about 1-5% in terms of macro- and micro-average F1 scores. Clinical practitioners and researchers can use this approach to obtain accurate information regarding clinical risks and SDoH factors, and use this pipeline as a tool to end the pandemic or to prepare for future pandemics.
Collapse
Affiliation(s)
- Syed Raza Bashir
- Department of Computer Science, Toronto Metropolitan University, Toronto, ON M5B 2K3, Canada
| | - Shaina Raza
- Dalla Lana School of Public Health, University of Toronto, Toronto, ON M5T 3M7, Canada
- Correspondence:
| | | | - Urooj Qamar
- Institute of Business & Information Technology, University of the Punjab, Lahore 54590, Pakistan
| |
Collapse
|
9
|
Wang P, Li Y, Yang L, Li S, Li L, Zhao Z, Long S, Wang F, Wang H, Li Y, Wang C. An Efficient Method for Deidentifying Protected Health Information in Chinese Electronic Health Records: Algorithm Development and Validation. JMIR Med Inform 2022; 10:e38154. [PMID: 36040774 PMCID: PMC9472063 DOI: 10.2196/38154] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/21/2022] [Revised: 07/19/2022] [Accepted: 07/31/2022] [Indexed: 11/13/2022] Open
Abstract
Background With the popularization of electronic health records in China, the utilization of digitalized data has great potential for the development of real-world medical research. However, the data usually contains a great deal of protected health information and the direct usage of this data may cause privacy issues. The task of deidentifying protected health information in electronic health records can be regarded as a named entity recognition problem. Existing rule-based, machine learning–based, or deep learning–based methods have been proposed to solve this problem. However, these methods still face the difficulties of insufficient Chinese electronic health record data and the complex features of the Chinese language. Objective This paper proposes a method to overcome the difficulties of overfitting and a lack of training data for deep neural networks to enable Chinese protected health information deidentification. Methods We propose a new model that merges TinyBERT (bidirectional encoder representations from transformers) as a text feature extraction module and the conditional random field method as a prediction module for deidentifying protected health information in Chinese medical electronic health records. In addition, a hybrid data augmentation method that integrates a sentence generation strategy and a mention-replacement strategy is proposed for overcoming insufficient Chinese electronic health records. Results We compare our method with 5 baseline methods that utilize different BERT models as their feature extraction modules. Experimental results on the Chinese electronic health records that we collected demonstrate that our method had better performance (microprecision: 98.7%, microrecall: 99.13%, and micro-F1 score: 98.91%) and higher efficiency (40% faster) than all the BERT-based baseline methods. Conclusions Compared to baseline methods, the efficiency advantage of TinyBERT on our proposed augmented data set was kept while the performance improved for the task of Chinese protected health information deidentification.
Collapse
Affiliation(s)
- Peng Wang
- College of Computer Science, Chongqing University, Chongqing, China
| | - Yong Li
- School of Computer Science, South China Normal University, Guangzhou, China
| | - Liang Yang
- Yidu Cloud Technology Inc, Beijing, China
| | - Simin Li
- Yidu Cloud Technology Inc, Beijing, China
| | - Linfeng Li
- Yidu Cloud Technology Inc, Beijing, China
| | - Zehan Zhao
- School of Software & Microelectronics, Peking University, Beijing, China
| | - Shaopei Long
- School of Computer Science, South China Normal University, Guangzhou, China
| | - Fei Wang
- Medical Big Data Center of Southwest Hospital, Chongqing, China
| | - Hongqian Wang
- Medical Big Data Center of Southwest Hospital, Chongqing, China
| | - Ying Li
- Medical Big Data Center of Southwest Hospital, Chongqing, China
| | - Chengliang Wang
- College of Computer Science, Chongqing University, Chongqing, China
| |
Collapse
|
10
|
Shahid A, Bazargani MH, Banahan P, Mac Namee B, Kechadi T, Treacy C, Regan G, MacMahon P. A Two-Stage De-Identification Process for Privacy-Preserving Medical Image Analysis. Healthcare (Basel) 2022; 10:755. [PMID: 35627892 PMCID: PMC9141493 DOI: 10.3390/healthcare10050755] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/22/2022] [Revised: 04/12/2022] [Accepted: 04/14/2022] [Indexed: 11/17/2022] Open
Abstract
Identification and re-identification are two major security and privacy threats to medical imaging data. De-identification in DICOM medical data is essential to preserve the privacy of patients' Personally Identifiable Information (PII) and requires a systematic approach. However, there is a lack of sufficient detail regarding the de-identification process of DICOM attributes, for example, what needs to be considered before removing a DICOM attribute. In this paper, we first highlight and review the key challenges in the medical image data de-identification process. In this paper, we develop a two-stage de-identification process for CT scan images available in DICOM file format. In the first stage of the de-identification process, the patient's PII-including name, date of birth, etc., are removed at the hospital facility using the export process available in their Picture Archiving and Communication System (PACS). The second stage employs the proposed DICOM de-identification tool for an exhaustive attribute-level investigation to further de-identify and ensure that all PII has been removed. Finally, we provide a roadmap for future considerations to build a semi-automated or automated tool for the DICOM datasets de-identification.
Collapse
Affiliation(s)
- Arsalan Shahid
- School of Computer Science, University College Dublin, D04 V1W8 Dublin, Ireland; (M.H.B.); (B.M.N.); (T.K.)
| | - Mehran H. Bazargani
- School of Computer Science, University College Dublin, D04 V1W8 Dublin, Ireland; (M.H.B.); (B.M.N.); (T.K.)
| | - Paul Banahan
- Department of Radiology, Mater Misericordiae University Hospital, D07 R2WY Dublin, Ireland; (P.B.); (P.M.)
| | - Brian Mac Namee
- School of Computer Science, University College Dublin, D04 V1W8 Dublin, Ireland; (M.H.B.); (B.M.N.); (T.K.)
| | - Tahar Kechadi
- School of Computer Science, University College Dublin, D04 V1W8 Dublin, Ireland; (M.H.B.); (B.M.N.); (T.K.)
| | - Ceara Treacy
- Regulated Software Research Centre, Dundalk Institute of Technology, A91 K584 Dundalk, Ireland; (C.T.); (G.R.)
| | - Gilbert Regan
- Regulated Software Research Centre, Dundalk Institute of Technology, A91 K584 Dundalk, Ireland; (C.T.); (G.R.)
| | - Peter MacMahon
- Department of Radiology, Mater Misericordiae University Hospital, D07 R2WY Dublin, Ireland; (P.B.); (P.M.)
| |
Collapse
|
11
|
Chomutare T. Clinical Notes De-Identification: Scoping Recent Benchmarks for n2c2 Datasets. Stud Health Technol Inform 2022; 289:293-296. [PMID: 35062150 DOI: 10.3233/shti210917] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/14/2023]
Abstract
Publicly shared repositories play an important role in advancing performance benchmarks for some of the most important tasks in natural language processing (NLP) and healthcare in general. This study reviews most recent benchmarks based on the 2014 n2c2 de-identification dataset. Pre-processing challenges were uncovered, and attention brought to the discrepancies in reported number of Protected Health Information (PHI) entities among the studies. Improved reporting is required for greater transparency and reproducibility.
Collapse
|
12
|
Brakel LAW. Self-constitution and "Infrastructural" Change: An Interdisciplinary Account of Psychoanalytic Action. Am J Psychoanal 2022; 82:618-630. [PMID: 36470990 PMCID: PMC9734568 DOI: 10.1057/s11231-022-09383-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/12/2022]
Abstract
Beyond revealing unconscious pathological identifications and traits-including their past usefulness but current toxicity-what techniques in our psychoanalytic practice can lead to change? Radically different from mainstream philosophical views advocating that such undesirable self-aspects should not be endorsed as Self, psychoanalysts hold that these negative traits must instead be understood as part of one's Self. But then what? Investigating concepts from classical conditioning, neuroscience, the philosophy of mind and action, and psychoanalytic practice itself, this article will suggest a preliminary account of the mechanism of action of psychoanalytic work after insight.
Collapse
|
13
|
Lee K, Kayaalp M, Henry S, Uzuner Ö. A Context-Enhanced De-identification System. ACM Trans Comput Healthc 2021; 3. [PMID: 34676376 DOI: 10.1145/3470980] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
Many modern entity recognition systems, including the current state-of-the-art de-identification systems, are based on bidirectional long short-term memory (biLSTM) units augmented by a conditional random field (CRF) sequence optimizer. These systems process the input sentence by sentence. This approach prevents the systems from capturing dependencies over sentence boundaries and makes accurate sentence boundary detection a prerequisite. Since sentence boundary detection can be problematic especially in clinical reports, where dependencies and co-references across sentence boundaries are abundant, these systems have clear limitations. In this study, we built a new system on the framework of one of the current state-of-the-art de-identification systems, NeuroNER, to overcome these limitations. This new system incorporates context embeddings through forward and backward n -grams without using sentence boundaries. Our context-enhanced de-identification (CEDI) system captures dependencies over sentence boundaries and bypasses the sentence boundary detection problem altogether. We enhanced this system with deep affix features and an attention mechanism to capture the pertinent parts of the input. The CEDI system outperforms NeuroNER on the 2006 i2b2 de-identification challenge dataset, the 2014 i2b2 shared task de-identification dataset, and the 2016 CEGS N-GRID de-identification dataset (p < 0.01). All datasets comprise narrative clinical reports in English but contain different note types varying from discharge summaries to psychiatric notes. Enhancing CEDI with deep affix features and the attention mechanism further increased performance.
Collapse
Affiliation(s)
- Kahyun Lee
- George Mason University, Fairfax, VA, USA
| | | | - Sam Henry
- George Mason University, Fairfax, VA, USA
| | | |
Collapse
|
14
|
Meurers T, Bild R, Do KM, Prasser F. A scalable software solution for anonymizing high-dimensional biomedical data. Gigascience 2021; 10:giab068. [PMID: 34605868 PMCID: PMC8489190 DOI: 10.1093/gigascience/giab068] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/30/2020] [Revised: 07/19/2021] [Accepted: 09/09/2021] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND Data anonymization is an important building block for ensuring privacy and fosters the reuse of data. However, transforming the data in a way that preserves the privacy of subjects while maintaining a high degree of data quality is challenging and particularly difficult when processing complex datasets that contain a high number of attributes. In this article we present how we extended the open source software ARX to improve its support for high-dimensional, biomedical datasets. FINDINGS For improving ARX's capability to find optimal transformations when processing high-dimensional data, we implement 2 novel search algorithms. The first is a greedy top-down approach and is oriented on a formally implemented bottom-up search. The second is based on a genetic algorithm. We evaluated the algorithms with different datasets, transformation methods, and privacy models. The novel algorithms mostly outperformed the previously implemented bottom-up search. In addition, we extended the GUI to provide a high degree of usability and performance when working with high-dimensional datasets. CONCLUSION With our additions we have significantly enhanced ARX's ability to handle high-dimensional data in terms of processing performance as well as usability and thus can further facilitate data sharing.
Collapse
Affiliation(s)
- Thierry Meurers
- Berlin Institute of Health at Charité–Universitätsmedizin Berlin, Medical Informatics, Charitéplatz 1, 10117 Berlin, Germany
| | - Raffael Bild
- School of Medicine, Technical University of Munich, Ismaninger Str. 22, 81675 Munich, Germany
| | - Kieu-Mi Do
- Faculty of Informatics, Technical University of Munich, Boltzmannstr. 3, 85748 Garching, Germany
| | - Fabian Prasser
- Berlin Institute of Health at Charité–Universitätsmedizin Berlin, Medical Informatics, Charitéplatz 1, 10117 Berlin, Germany
| |
Collapse
|
15
|
Liao S, Kiros J, Chen J, Zhang Z, Chen T. Improving domain adaptation in de-identification of electronic health records through self-training. J Am Med Inform Assoc 2021; 28:2093-2100. [PMID: 34363664 PMCID: PMC8449604 DOI: 10.1093/jamia/ocab128] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2021] [Revised: 07/01/2021] [Accepted: 07/04/2021] [Indexed: 11/13/2022] Open
Abstract
OBJECTIVE De-identification is a fundamental task in electronic health records to remove protected health information entities. Deep learning models have proven to be promising tools to automate de-identification processes. However, when the target domain (where the model is applied) is different from the source domain (where the model is trained), the model often suffers a significant performance drop, commonly referred to as domain adaptation issue. In de-identification, domain adaptation issues can make the model vulnerable for deployment. In this work, we aim to close the domain gap by leveraging unlabeled data from the target domain. MATERIALS AND METHODS We introduce a self-training framework to address the domain adaptation issue by leveraging unlabeled data from the target domain. We validate the effectiveness on 4 standard de-identification datasets. In each experiment, we use a pair of datasets: labeled data from the source domain and unlabeled data from the target domain. We compare the proposed self-training framework with supervised learning that directly deploys the model trained on the source domain. RESULTS In summary, our proposed framework improves the F1-score by 5.38 (on average) when compared with direct deployment. For example, using i2b2-2014 as the training dataset and i2b2-2006 as the test, the proposed framework increases the F1-score from 76.61 to 85.41 (+8.8). The method also increases the F1-score by 10.86 for mimic-radiology and mimic-discharge. CONCLUSION Our work demonstrates an effective self-training framework to boost the domain adaptation performance for the de-identification task for electronic health records.
Collapse
Affiliation(s)
- Shun Liao
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Donnelly Centre for Cellular and Biomoleular Research, University of Toronto, Ontario, Canada
| | | | | | - Zhaolei Zhang
- Department of Computer Science, University of Toronto, Toronto, Ontario, Canada
- Donnelly Centre for Cellular and Biomoleular Research, University of Toronto, Ontario, Canada
| | | |
Collapse
|
16
|
Lee B, Dupervil B, Deputy NP, Duck W, Soroka S, Bottichio L, Silk B, Price J, Sweeney P, Fuld J, Weber JT, Pollock D. Protecting Privacy and Transforming COVID-19 Case Surveillance Datasets for Public Use. Public Health Rep 2021; 136:554-561. [PMID: 34139910 PMCID: PMC8216038 DOI: 10.1177/00333549211026817] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Objectives Federal open-data initiatives that promote increased sharing of federally
collected data are important for transparency, data quality, trust, and
relationships with the public and state, tribal, local, and territorial
partners. These initiatives advance understanding of health conditions and
diseases by providing data to researchers, scientists, and policymakers for
analysis, collaboration, and use outside the Centers for Disease Control and
Prevention (CDC), particularly for emerging conditions such as COVID-19, for
which data needs are constantly evolving. Since the beginning of the
pandemic, CDC has collected person-level, de-identified data from
jurisdictions and currently has more than 8 million records. We describe how
CDC designed and produces 2 de-identified public datasets from these
collected data. Methods We included data elements based on usefulness, public request, and privacy
implications; we suppressed some field values to reduce the risk of
re-identification and exposure of confidential information. We created
datasets and verified them for privacy and confidentiality by using data
management platform analytic tools and R scripts. Results Unrestricted data are available to the public through Data.CDC.gov, and
restricted data, with additional fields, are available with a data-use
agreement through a private repository on GitHub.com. Practice Implications Enriched understanding of the available public data, the methods used to
create these data, and the algorithms used to protect the privacy of
de-identified people allow for improved data use. Automating data-generation
procedures improves the volume and timeliness of sharing data.
Collapse
Affiliation(s)
- Brian Lee
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- Office of the Chief Operations Officer, Office of the Chief
Information Officer, Centers for Disease Control and Prevention, Atlanta, GA,
USA
- Brian Lee, MPH, Centers for Disease Control
and Prevention, COVID-19 Response, 1600 Clifton Rd NE, MS TW-2, Atlanta, GA
30329, USA;
| | - Brandi Dupervil
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- National Center for Birth Defects and Developmental Disabilities,
Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Nicholas P. Deputy
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- National Center for Birth Defects and Developmental Disabilities,
Centers for Disease Control and Prevention, Atlanta, GA, USA
- US Public Health Service, Rockville, MD, USA
| | - Wil Duck
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- Center for Surveillance, Epidemiology, and Laboratory Services,
Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Stephen Soroka
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- National Center for Emerging and Zoonotic Infectious Diseases,
Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Lyndsay Bottichio
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- National Center for Emerging and Zoonotic Infectious Diseases,
Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Benjamin Silk
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- US Public Health Service, Rockville, MD, USA
- National Center for Immunization and Respiratory Diseases, Centers
for Disease Control and Prevention, Atlanta, GA, USA
| | - Jason Price
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- National Center for HIV/AIDS, Viral Hepatitis, STD, and TB
Prevention, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Patricia Sweeney
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- National Center for HIV/AIDS, Viral Hepatitis, STD, and TB
Prevention, Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Jennifer Fuld
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- Office of the Associate Director for Policy and Strategy, Centers
for Disease Control and Prevention, Atlanta, GA, USA
| | - J. Todd Weber
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- National Center for Emerging and Zoonotic Infectious Diseases,
Centers for Disease Control and Prevention, Atlanta, GA, USA
| | - Dan Pollock
- COVID-19 Response, Centers for Disease Control and Prevention,
Atlanta, GA, USA
- National Center for Emerging and Zoonotic Infectious Diseases,
Centers for Disease Control and Prevention, Atlanta, GA, USA
| |
Collapse
|
17
|
Murugadoss K, Rajasekharan A, Malin B, Agarwal V, Bade S, Anderson JR, Ross JL, Faubion WA, Halamka JD, Soundararajan V, Ardhanari S. Building a best-in-class automated de-identification tool for electronic health records through ensemble learning. Patterns (N Y) 2021; 2:100255. [PMID: 34179842 PMCID: PMC8212138 DOI: 10.1016/j.patter.2021.100255] [Citation(s) in RCA: 10] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/06/2021] [Revised: 02/24/2021] [Accepted: 04/07/2021] [Indexed: 10/29/2022]
Abstract
The presence of personally identifiable information (PII) in natural language portions of electronic health records (EHRs) constrains their broad reuse. Despite continuous improvements in automated detection of PII, residual identifiers require manual validation and correction. Here, we describe an automated de-identification system that employs an ensemble architecture, incorporating attention-based deep-learning models and rule-based methods, supported by heuristics for detecting PII in EHR data. Detected identifiers are then transformed into plausible, though fictional, surrogates to further obfuscate any leaked identifier. Our approach outperforms existing tools, with a recall of 0.992 and precision of 0.979 on the i2b2 2014 dataset and a recall of 0.994 and precision of 0.967 on a dataset of 10,000 notes from the Mayo Clinic. The de-identification system presented here enables the generation of de-identified patient data at the scale required for modern machine-learning applications to help accelerate medical discoveries.
Collapse
Affiliation(s)
| | | | - Bradley Malin
- Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | | | | | - Jeff R. Anderson
- Mayo Clinic, Rochester, MN 55905, USA
- Mayo Clinic Platform, Rochester, MN 55905, USA
| | | | | | - John D. Halamka
- Mayo Clinic, Rochester, MN 55905, USA
- Mayo Clinic Platform, Rochester, MN 55905, USA
| | | | | |
Collapse
|
18
|
Farzanehfar A, Houssiau F, de Montjoye YA. The risk of re-identification remains high even in country-scale location datasets. Patterns (N Y) 2021; 2:100204. [PMID: 33748793 PMCID: PMC7961185 DOI: 10.1016/j.patter.2021.100204] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/24/2020] [Revised: 11/27/2020] [Accepted: 01/07/2021] [Indexed: 11/30/2022]
Abstract
Although anonymous data are not considered personal data, recent research has shown how individuals can often be re-identified. Scholars have argued that previous findings apply only to small-scale datasets and that privacy is preserved in large-scale datasets. Using 3 months of location data, we (1) show the risk of re-identification to decrease slowly with dataset size, (2) approximate this decrease with a simple model taking into account three population-wide marginal distributions, and (3) prove that unicity is convex and obtain a linear lower bound. Our estimates show that 93% of people would be uniquely identified in a dataset of 60M people using four points of auxiliary information, with a lower bound at 22%. This lower bound increases to 87% when five points are available. Taken together, our results show how the privacy of individuals is very unlikely to be preserved even in country-scale location datasets.
Collapse
Affiliation(s)
- Ali Farzanehfar
- Department of Computing, Imperial College London, London SW7 2AZ, UK
| | | | | |
Collapse
|
19
|
Theyers AE, Zamyadi M, O'Reilly M, Bartha R, Symons S, MacQueen GM, Hassel S, Lerch JP, Anagnostou E, Lam RW, Frey BN, Milev R, Müller DJ, Kennedy SH, Scott CJM, Strother SC, Arnott SR. Multisite Comparison of MRI Defacing Software Across Multiple Cohorts. Front Psychiatry 2021; 12:617997. [PMID: 33716819 PMCID: PMC7943842 DOI: 10.3389/fpsyt.2021.617997] [Citation(s) in RCA: 22] [Impact Index Per Article: 7.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 10/19/2020] [Accepted: 02/03/2021] [Indexed: 01/26/2023] Open
Abstract
With improvements to both scan quality and facial recognition software, there is an increased risk of participants being identified by a 3D render of their structural neuroimaging scans, even when all other personal information has been removed. To prevent this, facial features should be removed before data are shared or openly released, but while there are several publicly available software algorithms to do this, there has been no comprehensive review of their accuracy within the general population. To address this, we tested multiple algorithms on 300 scans from three neuroscience research projects, funded in part by the Ontario Brain Institute, to cover a wide range of ages (3-85 years) and multiple patient cohorts. While skull stripping is more thorough at removing identifiable features, we focused mainly on defacing software, as skull stripping also removes potentially useful information, which may be required for future analyses. We tested six publicly available algorithms (afni_refacer, deepdefacer, mri_deface, mridefacer, pydeface, quickshear), with one skull stripper (FreeSurfer) included for comparison. Accuracy was measured through a pass/fail system with two criteria; one, that all facial features had been removed and two, that no brain tissue was removed in the process. A subset of defaced scans were also run through several preprocessing pipelines to ensure that none of the algorithms would alter the resulting outputs. We found that the success rates varied strongly between defacers, with afni_refacer (89%) and pydeface (83%) having the highest rates, overall. In both cases, the primary source of failure came from a single dataset that the defacer appeared to struggle with - the youngest cohort (3-20 years) for afni_refacer and the oldest (44-85 years) for pydeface, demonstrating that defacer performance not only depends on the data provided, but that this effect varies between algorithms. While there were some very minor differences between the preprocessing results for defaced and original scans, none of these were significant and were within the range of variation between using different NIfTI converters, or using raw DICOM files.
Collapse
Affiliation(s)
- Athena E Theyers
- Rotman Research Institute, Baycrest Health Sciences Centre, Toronto, ON, Canada
| | - Mojdeh Zamyadi
- Rotman Research Institute, Baycrest Health Sciences Centre, Toronto, ON, Canada
| | | | - Robert Bartha
- Department of Medical Biophysics, Robarts Research Institute, Western University, London, ON, Canada
| | - Sean Symons
- Department of Medical Imaging, Sunnybrook Health Sciences Centre, Toronto, ON, Canada
| | - Glenda M MacQueen
- Department of Psychiatry, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Stefanie Hassel
- Department of Psychiatry, Cumming School of Medicine, University of Calgary, Calgary, AB, Canada
| | - Jason P Lerch
- Mouse Imaging Centre, Hospital for Sick Children, Toronto, ON, Canada
| | - Evdokia Anagnostou
- Bloorview Research Institute, Holland Bloorview Kids Rehabilitation Hospital, Toronto, ON, Canada
| | - Raymond W Lam
- Department of Psychiatry, University of British Columbia, Vancouver, BC, Canada
| | - Benicio N Frey
- Department of Psychiatry and Behavioural Neurosciences, McMaster University, Hamilton, ON, Canada.,Mood Disorders Program, St. Joseph's Healthcare, Hamilton, ON, Canada
| | - Roumen Milev
- Departments of Psychiatry and Psychology, Queen's University, Providence Care Hospital, Kingston, ON, Canada
| | - Daniel J Müller
- Molecular Brain Science, Campbell Family Mental Health Research Institute, Centre for Addiction and Mental Health, Toronto, ON, Canada.,Department of Psychiatry, University of Toronto, Toronto, ON, Canada
| | - Sidney H Kennedy
- Department of Psychiatry, University of Toronto, Toronto, ON, Canada.,Department of Psychiatry, Krembil Research Centre, University Health Network, Toronto, ON, Canada.,Department of Psychiatry, St. Michael's Hospital, University of Toronto, Toronto, ON, Canada.,Keenan Research Centre for Biomedical Science, Li Ka Shing Knowledge Institute, St. Michael's Hospital, Toronto, ON, Canada
| | - Christopher J M Scott
- LC Campbell Cognitive Neurology Research Unit, Toronto, ON, Canada.,Heart & Stroke Foundation Centre for Stroke Recovery, Toronto, ON, Canada.,Sunnybrook Health Sciences Centre, Brain Sciences Research Program, Sunnybrook Research Institute, Toronto, ON, Canada
| | - Stephen C Strother
- Rotman Research Institute, Baycrest Health Sciences Centre, Toronto, ON, Canada.,Department of Medical Biophysics, University of Toronto, Toronto, ON, Canada
| | - Stephen R Arnott
- Rotman Research Institute, Baycrest Health Sciences Centre, Toronto, ON, Canada
| |
Collapse
|
20
|
Jeong YU, Yoo S, Kim YH, Shim WH. De-Identification of Facial Features in Magnetic Resonance Images: Software Development Using Deep Learning Technology. J Med Internet Res 2020; 22:e22739. [PMID: 33208302 PMCID: PMC7759440 DOI: 10.2196/22739] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2020] [Revised: 09/09/2020] [Accepted: 11/12/2020] [Indexed: 12/14/2022] Open
Abstract
Background High-resolution medical images that include facial regions can be used to recognize the subject’s face when reconstructing 3-dimensional (3D)-rendered images from 2-dimensional (2D) sequential images, which might constitute a risk of infringement of personal information when sharing data. According to the Health Insurance Portability and Accountability Act (HIPAA) privacy rules, full-face photographic images and any comparable image are direct identifiers and considered as protected health information. Moreover, the General Data Protection Regulation (GDPR) categorizes facial images as biometric data and stipulates that special restrictions should be placed on the processing of biometric data. Objective This study aimed to develop software that can remove the header information from Digital Imaging and Communications in Medicine (DICOM) format files and facial features (eyes, nose, and ears) at the 2D sliced-image level to anonymize personal information in medical images. Methods A total of 240 cranial magnetic resonance (MR) images were used to train the deep learning model (144, 48, and 48 for the training, validation, and test sets, respectively, from the Alzheimer's Disease Neuroimaging Initiative [ADNI] database). To overcome the small sample size problem, we used a data augmentation technique to create 576 images per epoch. We used attention-gated U-net for the basic structure of our deep learning model. To validate the performance of the software, we adapted an external test set comprising 100 cranial MR images from the Open Access Series of Imaging Studies (OASIS) database. Results The facial features (eyes, nose, and ears) were successfully detected and anonymized in both test sets (48 from ADNI and 100 from OASIS). Each result was manually validated in both the 2D image plane and the 3D-rendered images. Furthermore, the ADNI test set was verified using Microsoft Azure's face recognition artificial intelligence service. By adding a user interface, we developed and distributed (via GitHub) software named “Deface program” for medical images as an open-source project. Conclusions We developed deep learning–based software for the anonymization of MR images that distorts the eyes, nose, and ears to prevent facial identification of the subject in reconstructed 3D images. It could be used to share medical big data for secondary research while making both data providers and recipients compliant with the relevant privacy regulations.
Collapse
Affiliation(s)
- Yeon Uk Jeong
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Soyoung Yoo
- Human Research Protection Center, Asan Institute of Life Sciences, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Young-Hak Kim
- Division of Cardiology, Department of Internal Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.,Department of Information Medicine, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| | - Woo Hyun Shim
- Department of Medical Science, Asan Medical Institute of Convergence Science and Technology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea.,Department of Radiology, Asan Medical Center, University of Ulsan College of Medicine, Seoul, Republic of Korea
| |
Collapse
|
21
|
Jeon S, Seo J, Kim S, Lee J, Kim JH, Sohn JW, Moon J, Joo HJ. Proposal and Assessment of a De-Identification Strategy to Enhance Anonymity of the Observational Medical Outcomes Partnership Common Data Model (OMOP-CDM) in a Public Cloud-Computing Environment: Anonymization of Medical Data Using Privacy Models. J Med Internet Res 2020; 22:e19597. [PMID: 33177037 PMCID: PMC7728527 DOI: 10.2196/19597] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2020] [Revised: 07/29/2020] [Accepted: 11/11/2020] [Indexed: 02/01/2023] Open
Abstract
Background De-identifying personal information is critical when using personal health data for secondary research. The Observational Medical Outcomes Partnership Common Data Model (CDM), defined by the nonprofit organization Observational Health Data Sciences and Informatics, has been gaining attention for its use in the analysis of patient-level clinical data obtained from various medical institutions. When analyzing such data in a public environment such as a cloud-computing system, an appropriate de-identification strategy is required to protect patient privacy. Objective This study proposes and evaluates a de-identification strategy that is comprised of several rules along with privacy models such as k-anonymity, l-diversity, and t-closeness. The proposed strategy was evaluated using the actual CDM database. Methods The CDM database used in this study was constructed by the Anam Hospital of Korea University. Analysis and evaluation were performed using the ARX anonymizing framework in combination with the k-anonymity, l-diversity, and t-closeness privacy models. Results The CDM database, which was constructed according to the rules established by Observational Health Data Sciences and Informatics, exhibited a low risk of re-identification: The highest re-identifiable record rate (11.3%) in the dataset was exhibited by the DRUG_EXPOSURE table, with a re-identification success rate of 0.03%. However, because all tables include at least one “highest risk” value of 100%, suitable anonymizing techniques are required; moreover, the CDM database preserves the “source values” (raw data), a combination of which could increase the risk of re-identification. Therefore, this study proposes an enhanced strategy to de-identify the source values to significantly reduce not only the highest risk in the k-anonymity, l-diversity, and t-closeness privacy models but also the overall possibility of re-identification. Conclusions Our proposed de-identification strategy effectively enhanced the privacy of the CDM database, thereby encouraging clinical research involving multiple centers.
Collapse
Affiliation(s)
- Seungho Jeon
- Division of Information Security, Graduate School of Information Security, Korea University, Seoul, Republic of Korea
| | - Jeongeun Seo
- Division of Information Security, Graduate School of Information Security, Korea University, Seoul, Republic of Korea
| | - Sukyoung Kim
- Division of Information Security, Graduate School of Information Security, Korea University, Seoul, Republic of Korea
| | - Jeongmoon Lee
- Korea University Research Institute for Medical Bigdata Science, Korea University, Seoul, Republic of Korea
| | - Jong-Ho Kim
- Department of Cardiology, Cardiovascular Center, Korea University, Seoul, Republic of Korea
| | - Jang Wook Sohn
- Division of Infectious Diseases, Department of Internal Medicine, College of Medicine, Korea University, Seoul, Republic of Korea
| | - Jongsub Moon
- Division of Information Security, Graduate School of Information Security, Korea University, Seoul, Republic of Korea
| | - Hyung Joon Joo
- Department of Internal Medicine, Korea University College of Medicine, Korea University, Seoul, Republic of Korea
| |
Collapse
|
22
|
Abstract
Making data Findable, Accessible, Interoperable and Reusable (FAIR) is a good approach when data needs to be shared. However, security and privacy are still critical aspects. In the FAIRification process, there is a need both for de-identification of data and for license attribution. The paper analyses some of the issues related to this process when the objective is sharing genomic information. The main results are the identification of the already existing standards that could be used for this purpose and how to combine them. Nevertheless, the area is quickly evolving and more specific standards could be specified.
Collapse
Affiliation(s)
- Jaime Delgado
- Information Modeling and Processing (IMP) group - DMAG, Computer Architecture Dept. (DAC), Universitat Politècnica de Catalunya (UPC BarcelonaTECH)
| | - Silvia Llorente
- Information Modeling and Processing (IMP) group - DMAG, Computer Architecture Dept. (DAC), Universitat Politècnica de Catalunya (UPC BarcelonaTECH)
| |
Collapse
|
23
|
El Emam K, Mosquera L, Bass J. Evaluating Identity Disclosure Risk in Fully Synthetic Health Data: Model Development and Validation. J Med Internet Res 2020; 22:e23139. [PMID: 33196453 PMCID: PMC7704280 DOI: 10.2196/23139] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/02/2020] [Revised: 09/02/2020] [Accepted: 10/10/2020] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND There has been growing interest in data synthesis for enabling the sharing of data for secondary analysis; however, there is a need for a comprehensive privacy risk model for fully synthetic data: If the generative models have been overfit, then it is possible to identify individuals from synthetic data and learn something new about them. OBJECTIVE The purpose of this study is to develop and apply a methodology for evaluating the identity disclosure risks of fully synthetic data. METHODS A full risk model is presented, which evaluates both identity disclosure and the ability of an adversary to learn something new if there is a match between a synthetic record and a real person. We term this "meaningful identity disclosure risk." The model is applied on samples from the Washington State Hospital discharge database (2007) and the Canadian COVID-19 cases database. Both of these datasets were synthesized using a sequential decision tree process commonly used to synthesize health and social science data. RESULTS The meaningful identity disclosure risk for both of these synthesized samples was below the commonly used 0.09 risk threshold (0.0198 and 0.0086, respectively), and 4 times and 5 times lower than the risk values for the original datasets, respectively. CONCLUSIONS We have presented a comprehensive identity disclosure risk model for fully synthetic data. The results for this synthesis method on 2 datasets demonstrate that synthesis can reduce meaningful identity disclosure risks considerably. The risk model can be applied in the future to evaluate the privacy of fully synthetic data.
Collapse
Affiliation(s)
- Khaled El Emam
- School of Epidemiology and Public Health, Faculty of Medicine, University of Ottawa, Ottawa, ON, Canada
- Children's Hospital of Eastern Ontario Research Institute, Ottawa, ON, Canada
- Replica Analytics Ltd, Ottawa, ON, Canada
| | | | - Jason Bass
- Replica Analytics Ltd, Ottawa, ON, Canada
| |
Collapse
|
24
|
Parker W, Jaremko JL, Cicero M, Azar M, El-Emam K, Gray BG, Hurrell C, Lavoie-Cardinal F, Desjardins B, Lum A, Sheremeta L, Lee E, Reinhold C, Tang A, Bromwich R. Canadian Association of Radiologists White Paper on De-Identification of Medical Imaging: Part 1, General Principles. Can Assoc Radiol J 2020; 72:13-24. [PMID: 33138621 DOI: 10.1177/0846537120967349] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022] Open
Abstract
The application of big data, radiomics, machine learning, and artificial intelligence (AI) algorithms in radiology requires access to large data sets containing personal health information. Because machine learning projects often require collaboration between different sites or data transfer to a third party, precautions are required to safeguard patient privacy. Safety measures are required to prevent inadvertent access to and transfer of identifiable information. The Canadian Association of Radiologists (CAR) is the national voice of radiology committed to promoting the highest standards in patient-centered imaging, lifelong learning, and research. The CAR has created an AI Ethical and Legal standing committee with the mandate to guide the medical imaging community in terms of best practices in data management, access to health care data, de-identification, and accountability practices. Part 1 of this article will inform CAR members on principles of de-identification, pseudonymization, encryption, direct and indirect identifiers, k-anonymization, risks of reidentification, implementations, data set release models, and validation of AI algorithms, with a view to developing appropriate standards to safeguard patient information effectively.
Collapse
Affiliation(s)
- William Parker
- Department of Radiology, 8166University of British Columbia, Vancouver, British Columbia, Canada.,SapienML Corp, Vancouver, British Columbia, Canada
| | - Jacob L Jaremko
- Department of Radiology & Diagnostic Imaging, 12357University of Alberta, Edmonton, Canada
| | - Mark Cicero
- 16 Bit Inc, Toronto, Ontario, Canada.,True North Imaging, Thornhill, Ontario, Canada
| | - Marleine Azar
- Department of Medicine, 5622Université de Montréal, Montréal, Quebec, Canada
| | - Khaled El-Emam
- School of Epidemiology and Public Health, University of Ottawa, Ontario, Canada
| | - Bruce G Gray
- Department of Medical Imaging, University of Toronto, Toronto, Canada
| | - Casey Hurrell
- 525917Canadian Association of Radiologists, Ottawa, Canada
| | | | | | - Andrea Lum
- Department of Medical Imaging, 6221Western University, London, Ontario, Canada
| | - Lori Sheremeta
- 41464Northern Alberta Institute of Technology, Alberta, Canada
| | - Emil Lee
- 27355Fraser Health Authority, Vancouver, British Columbia, Canada
| | - Caroline Reinhold
- 54473McGill University Health Center, McGill University, Montreal, Canada.,Augmented Intelligence & Precision Health Laboratory of the Research Institute, McGill University Health Center, McGill University, Montreal, Canada
| | - An Tang
- Department of Radiology, Radio-oncology, and Nuclear Medicine, 5622Universite de Montreal, Montreal, Quebec, Canada
| | - Rebecca Bromwich
- Department of Law and Legal Studies, 6339Carleton University, Ottawa, Canada
| |
Collapse
|
25
|
Parker W, Jaremko JL, Cicero M, Azar M, El-Emam K, Gray BG, Hurrell C, Lavoie-Cardinal F, Desjardins B, Lum A, Sheremeta L, Lee E, Reinhold C, Tang A, Bromwich R. Canadian Association of Radiologists White Paper on De-identification of Medical Imaging: Part 2, Practical Considerations. Can Assoc Radiol J 2020; 72:25-34. [PMID: 33140663 DOI: 10.1177/0846537120967345] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/16/2022] Open
Abstract
The application of big data, radiomics, machine learning, and artificial intelligence (AI) algorithms in radiology requires access to large data sets containing personal health information. Because machine learning projects often require collaboration between different sites or data transfer to a third party, precautions are required to safeguard patient privacy. Safety measures are required to prevent inadvertent access to and transfer of identifiable information. The Canadian Association of Radiologists (CAR) is the national voice of radiology committed to promoting the highest standards in patient-centered imaging, lifelong learning, and research. The CAR has created an AI Ethical and Legal standing committee with the mandate to guide the medical imaging community in terms of best practices in data management, access to health care data, de-identification, and accountability practices. Part 2 of this article will inform CAR members on the practical aspects of medical imaging de-identification, strengths and limitations of de-identification approaches, list of de-identification software and tools available, and perspectives on future directions.
Collapse
Affiliation(s)
- William Parker
- Department of Radiology, 8166University of British Columbia, Vancouver, British Columbia, Canada.,SapienML Corp, Vancouver, British Columbia, Canada
| | - Jacob L Jaremko
- Department of Radiology & Diagnostic Imaging, 3158University of Alberta, Edmonton, Canada
| | - Mark Cicero
- 16 Bit Inc, Toronto, Ontario, Canada.,True North Imaging, Thornhill, Ontario, Canada
| | - Marleine Azar
- Department of Medicine, 5622Université de Montréal, Montréal, Quebec, Canada
| | - Khaled El-Emam
- School of Epidemiology and Public Health, University of Ottawa, Ontario, Canada
| | - Bruce G Gray
- Department of Medical Imaging, University of Toronto, Toronto, Canada
| | - Casey Hurrell
- 103977Canadian Association of Radiologists, Ottawa, Canada
| | | | | | - Andrea Lum
- Department of Medical Imaging, 70384Western University, London, Ontario, Canada
| | - Lori Sheremeta
- 41464Northern Alberta Institute of Technology, Edmonton, Alberta, Canada
| | - Emil Lee
- 27355Fraser Health Authority, Vancouver, British Columbia, Canada
| | - Caroline Reinhold
- 54473McGill University Health Center, McGill University, Montréal, Canada.,Augmented Intelligence & Precision Health Laboratory of the Research Institute of 54473McGill University Health Centre, Montréal, Quebec, Canada
| | - An Tang
- Department of Radiology, Radio-oncology, and Nuclear Medicine, 12368Universite de Montreal, Montréal, Quebec, Canada
| | - Rebecca Bromwich
- Department of Law and Legal Studies, 6339Carleton University, Ottawa, Canada
| |
Collapse
|
26
|
Ahn NY, Park JE, Lee DH, Hong PC. Balancing Personal Privacy and Public Safety During COVID-19: The Case of South Korea. IEEE Access 2020; 8:171325-171333. [PMID: 34786290 PMCID: PMC8545276 DOI: 10.1109/access.2020.3025971] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/25/2020] [Accepted: 09/20/2020] [Indexed: 05/09/2023]
Abstract
There has been vigorous debate on how different countries responded to the COVID-19 pandemic. To secure public safety, South Korea actively used personal information at the risk of personal privacy whereas France encouraged voluntary cooperation at the risk of public safety. In this article, after a brief comparison of contextual differences with France, we focus on South Korea's approaches to epidemiological investigations. To evaluate the issues pertaining to personal privacy and public health, we examine the usage patterns of original data, de-identification data, and encrypted data. Our specific proposal discusses the COVID index, which considers collective infection, outbreak intensity, availability of medical infrastructure, and the death rate. Finally, we summarize the findings and lessons for future research and the policy implications.
Collapse
Affiliation(s)
- Na Young Ahn
- Institute of Cyber Security and Privacy, Korea UniversitySeoul02841South Korea
| | - Jun Eun Park
- Department of PediatricsKorea University College of MedicineSeoul02842South Korea
| | - Dong Hoon Lee
- Institute of Cyber Security and Privacy and The Graduate School of Information Security, Korea UniversitySeoul02841South Korea
| | - Paul C. Hong
- Information, Operations, and Technology Management College of Business and InnovationThe University of ToledoToledoOH43606USA
| |
Collapse
|
27
|
Elbers DC, Fillmore NR, Sung FC, Ganas SS, Prokhorenkov A, Meyer C, Hall RB, Ajjarapu SJ, Chen DC, Meng F, Grossman RL, Brophy MT, Do NV. The Veterans Affairs Precision Oncology Data Repository, a Clinical, Genomic, and Imaging Research Database. Patterns (N Y) 2020; 1:100083. [PMID: 33205130 PMCID: PMC7660389 DOI: 10.1016/j.patter.2020.100083] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 01/13/2020] [Revised: 06/15/2020] [Accepted: 07/10/2020] [Indexed: 02/06/2023]
Abstract
The Veterans Affairs Precision Oncology Data Repository (VA-PODR) is a large, nationwide repository of de-identified data on patients diagnosed with cancer at the Department of Veterans Affairs (VA). Data include longitudinal clinical data from the VA's nationwide electronic health record system and the VA Central Cancer Registry, targeted tumor sequencing data, and medical imaging data including computed tomography (CT) scans and pathology slides. A subset of the repository is available at the Genomic Data Commons (GDC) and The Cancer Imaging Archive (TCIA), and the full repository is available through the Veterans Precision Oncology Data Commons (VPODC). By releasing this de-identified dataset, we aim to advance Veterans' health care through enabling translational research on the Veteran population by a wide variety of researchers.
Collapse
Affiliation(s)
- Danne C Elbers
- VA Cooperative Studies Program, VA Boston Healthcare System (151MAV), 150 S. Huntington Ave, Jamaica Plain, MA 02130, USA.,University of Vermont, Complex Systems Center, Burlington, VT 05405, USA
| | - Nathanael R Fillmore
- VA Cooperative Studies Program, VA Boston Healthcare System (151MAV), 150 S. Huntington Ave, Jamaica Plain, MA 02130, USA.,Harvard Medical School, Boston, MA 02115, USA.,Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | - Feng-Chi Sung
- VA Cooperative Studies Program, VA Boston Healthcare System (151MAV), 150 S. Huntington Ave, Jamaica Plain, MA 02130, USA
| | - Spyridon S Ganas
- VA Cooperative Studies Program, VA Boston Healthcare System (151MAV), 150 S. Huntington Ave, Jamaica Plain, MA 02130, USA
| | - Andrew Prokhorenkov
- University of Chicago, Center for Data Intensive Science, Chicago, IL 60615, USA
| | - Christopher Meyer
- University of Chicago, Center for Data Intensive Science, Chicago, IL 60615, USA
| | - Robert B Hall
- VA Cooperative Studies Program, VA Boston Healthcare System (151MAV), 150 S. Huntington Ave, Jamaica Plain, MA 02130, USA
| | - Samuel J Ajjarapu
- VA Cooperative Studies Program, VA Boston Healthcare System (151MAV), 150 S. Huntington Ave, Jamaica Plain, MA 02130, USA.,Dana-Farber Cancer Institute, Boston, MA 02215, USA
| | - Daniel C Chen
- VA Cooperative Studies Program, VA Boston Healthcare System (151MAV), 150 S. Huntington Ave, Jamaica Plain, MA 02130, USA.,Boston University School of Medicine, Boston, MA 02118, USA
| | - Frank Meng
- VA Cooperative Studies Program, VA Boston Healthcare System (151MAV), 150 S. Huntington Ave, Jamaica Plain, MA 02130, USA.,Boston University School of Medicine, Boston, MA 02118, USA
| | - Robert L Grossman
- University of Chicago, Center for Data Intensive Science, Chicago, IL 60615, USA
| | - Mary T Brophy
- VA Cooperative Studies Program, VA Boston Healthcare System (151MAV), 150 S. Huntington Ave, Jamaica Plain, MA 02130, USA.,Boston University School of Medicine, Boston, MA 02118, USA
| | - Nhan V Do
- VA Cooperative Studies Program, VA Boston Healthcare System (151MAV), 150 S. Huntington Ave, Jamaica Plain, MA 02130, USA.,Boston University School of Medicine, Boston, MA 02118, USA
| |
Collapse
|
28
|
Carrell DS, Malin BA, Cronkite DJ, Aberdeen JS, Clark C, Li MR, Bastakoty D, Nyemba S, Hirschman L. Resilience of clinical text de-identified with "hiding in plain sight" to hostile reidentification attacks by human readers. J Am Med Inform Assoc 2020; 27:1374-1382. [PMID: 32930712 DOI: 10.1093/jamia/ocaa095] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Revised: 04/02/2020] [Accepted: 05/26/2020] [Indexed: 11/14/2022] Open
Abstract
OBJECTIVE Effective, scalable de-identification of personally identifying information (PII) for information-rich clinical text is critical to support secondary use, but no method is 100% effective. The hiding-in-plain-sight (HIPS) approach attempts to solve this "residual PII problem." HIPS replaces PII tagged by a de-identification system with realistic but fictitious (resynthesized) content, making it harder to detect remaining unredacted PII. MATERIALS AND METHODS Using 2000 representative clinical documents from 2 healthcare settings (4000 total), we used a novel method to generate 2 de-identified 100-document corpora (200 documents total) in which PII tagged by a typical automated machine-learned tagger was replaced by HIPS-resynthesized content. Four readers conducted aggressive reidentification attacks to isolate leaked PII: 2 readers from within the originating institution and 2 external readers. RESULTS Overall, mean recall of leaked PII was 26.8% and mean precision was 37.2%. Mean recall was 9% (mean precision = 37%) for patient ages, 32% (mean precision = 26%) for dates, 25% (mean precision = 37%) for doctor names, 45% (mean precision = 55%) for organization names, and 23% (mean precision = 57%) for patient names. Recall was 32% (precision = 40%) for internal and 22% (precision =33%) for external readers. DISCUSSION AND CONCLUSIONS Approximately 70% of leaked PII "hiding" in a corpus de-identified with HIPS resynthesis is resilient to detection by human readers in a realistic, aggressive reidentification attack scenario-more than double the rate reported in previous studies but less than the rate reported for an attack assisted by machine learning methods.
Collapse
Affiliation(s)
- David S Carrell
- Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA
| | - David J Cronkite
- Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA
| | - John S Aberdeen
- Human Language Technology, MITRE Corporation, Bedford, Massachusetts, USA
| | - Cheryl Clark
- Human Language Technology, MITRE Corporation, Bedford, Massachusetts, USA
| | | | - Dikshya Bastakoty
- Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA
| | - Steve Nyemba
- Department of Biomedical Informatics, Vanderbilt University, Nashville, Tennessee, USA
| | - Lynette Hirschman
- Human Language Technology, MITRE Corporation, Bedford, Massachusetts, USA
| |
Collapse
|
29
|
Chomutare T, Yigzaw KY, Budrionis A, Makhlysheva A, Godtliebsen F, Dalianis H. De-Identifying Swedish EHR Text Using Public Resources in the General Domain. Stud Health Technol Inform 2020; 270:148-152. [PMID: 32570364 DOI: 10.3233/shti200140] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
Sensitive data is normally required to develop rule-based or train machine learning-based models for de-identifying electronic health record (EHR) clinical notes; and this presents important problems for patient privacy. In this study, we add non-sensitive public datasets to EHR training data; (i) scientific medical text and (ii) Wikipedia word vectors. The data, all in Swedish, is used to train a deep learning model using recurrent neural networks. Tests on pseudonymized Swedish EHR clinical notes showed improved precision and recall from 55.62% and 80.02% with the base EHR embedding layer, to 85.01% and 87.15% when Wikipedia word vectors are added. These results suggest that non-sensitive text from the general domain can be used to train robust models for de-identifying Swedish clinical text; and this could be useful in cases where the data is both sensitive and in low-resource languages.
Collapse
Affiliation(s)
| | | | | | | | - Fred Godtliebsen
- Norwegian Centre for E-health Research, Tromsø, Norway
- Faculty of Science & Technology, UiT - The Arctic University of Norway
| | - Hercules Dalianis
- Norwegian Centre for E-health Research, Tromsø, Norway
- Department of Computer and Systems Sciences, Stockholm University, Sweden
| |
Collapse
|
30
|
Demuro PR, Petersen C. Managing Privacy and Data Sharing Through the Use of Health Care Information Fiduciaries. Stud Health Technol Inform 2019; 265:157-162. [PMID: 31431592 DOI: 10.3233/shti190156] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
Abstract
Policy and regulation seldom keep up with advances in technology. Although data de-identification is seen as a key to protecting one's data, re-identification is often possible. Whether one's data is to be used for care, research, or commercial purposes, individuals are concerned about the use of their information. The authors propose the concept of an information fiduciary for holders of data, describe how it might be applied in a health care context, and outline considerations to determine whether a holder of health care-related information should be regarded as an information fiduciary.
Collapse
Affiliation(s)
- Paul R Demuro
- Nelson Mullins Broad and Cassel, Ft. Lauderdale, FL, USA
| | | |
Collapse
|
31
|
Cha HS, Jung JM, Shin SY, Jang YM, Park P, Lee JW, Chung SH, Choi KS. The Korea Cancer Big Data Platform (K-CBP) for Cancer Research. Int J Environ Res Public Health 2019; 16:E2290. [PMID: 31261630 DOI: 10.3390/ijerph16132290] [Citation(s) in RCA: 13] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/08/2019] [Revised: 05/31/2019] [Accepted: 06/24/2019] [Indexed: 12/23/2022]
Abstract
Data warehousing is the most important technology to address recent advances in precision medicine. However, a generic clinical data warehouse does not address unstructured and insufficient data. In precision medicine, it is essential to develop a platform that can collect and utilize data. Data were collected from electronic medical records, genomic sequences, tumor biopsy specimens, and national cancer control initiative databases in the National Cancer Center (NCC), Korea. Data were de-identified and stored in a safe and independent space. Unstructured clinical data were standardized and incorporated into cancer registries and linked to cancer genome sequences and tumor biopsy specimens. Finally, national cancer control initiative data from the public domain were independently organized and linked to cancer registries. We constructed a system for integrating and providing various cancer data called the Korea Cancer Big Data Platform (K-CBP). Although the K-CBP could be used for cancer research, the legal and regulatory aspects of data distribution and usage need to be addressed first. Nonetheless, the system will continue collecting data from cancer-related resources that will hopefully facilitate precision-based research.
Collapse
|
32
|
Chevrier R, Foufi V, Gaudet-Blavignac C, Robert A, Lovis C. Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review. J Med Internet Res 2019; 21:e13484. [PMID: 31152528 PMCID: PMC6658290 DOI: 10.2196/13484] [Citation(s) in RCA: 39] [Impact Index Per Article: 7.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 03/29/2019] [Accepted: 04/26/2019] [Indexed: 01/19/2023] Open
Abstract
Background The secondary use of health data is central to biomedical research in the era of data science and precision medicine. National and international initiatives, such as the Global Open Findable, Accessible, Interoperable, and Reusable (GO FAIR) initiative, are supporting this approach in different ways (eg, making the sharing of research data mandatory or improving the legal and ethical frameworks). Preserving patients’ privacy is crucial in this context. De-identification and anonymization are the two most common terms used to refer to the technical approaches that protect privacy and facilitate the secondary use of health data. However, it is difficult to find a consensus on the definitions of the concepts or on the reliability of the techniques used to apply them. A comprehensive review is needed to better understand the domain, its capabilities, its challenges, and the ratio of risk between the data subjects’ privacy on one side, and the benefit of scientific advances on the other. Objective This work aims at better understanding how the research community comprehends and defines the concepts of de-identification and anonymization. A rich overview should also provide insights into the use and reliability of the methods. Six aspects will be studied: (1) terminology and definitions, (2) backgrounds and places of work of the researchers, (3) reasons for anonymizing or de-identifying health data, (4) limitations of the techniques, (5) legal and ethical aspects, and (6) recommendations of the researchers. Methods Based on a scoping review protocol designed a priori, MEDLINE was searched for publications discussing de-identification or anonymization and published between 2007 and 2017. The search was restricted to MEDLINE to focus on the life sciences community. The screening process was performed by two reviewers independently. Results After searching 7972 records that matched at least one search term, 135 publications were screened and 60 full-text articles were included. (1) Terminology: Definitions of the terms de-identification and anonymization were provided in less than half of the articles (29/60, 48%). When both terms were used (41/60, 68%), their meanings divided the authors into two equal groups (19/60, 32%, each) with opposed views. The remaining articles (3/60, 5%) were equivocal. (2) Backgrounds and locations: Research groups were based predominantly in North America (31/60, 52%) and in the European Union (22/60, 37%). The authors came from 19 different domains; computer science (91/248, 36.7%), biomedical informatics (47/248, 19.0%), and medicine (38/248, 15.3%) were the most prevalent ones. (3) Purpose: The main reason declared for applying these techniques is to facilitate biomedical research. (4) Limitations: Progress is made on specific techniques but, overall, limitations remain numerous. (5) Legal and ethical aspects: Differences exist between nations in the definitions, approaches, and legal practices. (6) Recommendations: The combination of organizational, legal, ethical, and technical approaches is necessary to protect health data. Conclusions Interest is growing for privacy-enhancing techniques in the life sciences community. This interest crosses scientific boundaries, involving primarily computer science, biomedical informatics, and medicine. The variability observed in the use of the terms de-identification and anonymization emphasizes the need for clearer definitions as well as for better education and dissemination of information on the subject. The same observation applies to the methods. Several legislations, such as the American Health Insurance Portability and Accountability Act (HIPAA) and the European General Data Protection Regulation (GDPR), regulate the domain. Using the definitions they provide could help address the variable use of these two concepts in the research community.
Collapse
Affiliation(s)
- Raphaël Chevrier
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Vasiliki Foufi
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Christophe Gaudet-Blavignac
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Arnaud Robert
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Christian Lovis
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| |
Collapse
|
33
|
Caetano SJ, Dawe D, Ellis P, Earle CC, Pond GR. Methods to improve the estimation of time-to-event outcomes when data is de-identified. Stat Med 2019; 38:625-635. [PMID: 30311241 DOI: 10.1002/sim.7990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/30/2017] [Revised: 08/30/2018] [Accepted: 09/06/2018] [Indexed: 11/07/2022]
Abstract
Technological advancements in recent years have sparked the use of large databases for research. The availability of these large databases has administered a need for anonymization and de-identification techniques, prior to publishing the data. This de-identification alters the data, which in turn can impact the results derived post de-identification and potentially lead to false conclusions. The objective of this study is to investigate if alterations to a de-identified time-to-event data set may improve the accuracy of the estimates. In this data set, a missing time bias was present among censored patients as a means to preserve patient confidentiality. This study investigates five methods intended to reduce the bias of time-to-event estimates. A simulation study was conducted to evaluate the effectiveness of each method in reducing bias. In situations where there was a large number of censored patients, the results of the simulation showed that Method 4 yielded the most accurate estimates. This method adjusted the survival times of censored patients by adding a random uniform component such that the modified survival time would occur within the final year of the study. Alternatively, when there was only a small number of censored patients, the method that did not alter the de-identified data set (Method 1) provided the most accurate estimates.
Collapse
Affiliation(s)
- Samantha-Jo Caetano
- Department of Mathematics and Statistics, McMaster University, Hamilton, Canada
| | - David Dawe
- Department of Internal Medicine, Faculty of Health Sciences, University of Manitoba, Winnipeg, Canada
- Department of Hematology and Medical Oncology, Cancer Care Manitoba, Winnipeg, Canada
| | - Peter Ellis
- Department of Oncology, Faculty of Health Sciences, McMaster University, Hamilton, Canada
| | - Craig C Earle
- Cancer Care Ontario, Toronto, Canada
- Ontario Institute for Cancer Research, Toronto, Canada
- Institute for Clinical Evaluative Sciences, Toronto, Canada
| | - Gregory R Pond
- Department of Oncology, Faculty of Health Sciences, McMaster University, Hamilton, Canada
| |
Collapse
|
34
|
Kuo SIC, Wheeler LA, Updegraff KA, McHale SM, Umaña-Taylor AJ, Perez-Brena NJ. Parental Modeling and Deidentification in Romantic Relationships Among Mexican-origin Youth. J Marriage Fam 2017; 79:1388-1403. [PMID: 29033465 PMCID: PMC5637550 DOI: 10.1111/jomf.12411] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/17/2023]
Abstract
This study investigated youth's modeling of and de-identification from parents in romantic relationships, using two phases of data from adolescent siblings, mothers, and fathers in 246 Mexican-origin families. Each parent reported his/her marital satisfaction and conflict, and youth reported on parent-adolescent warmth and conflict at Time 1. Youth's reports of modeling of and de-identification from their mothers and fathers and three romantic relationship outcomes were assessed at Time 2. Findings revealed that higher parental marital satisfaction, lower marital conflict, and higher warmth and lower conflict in parent-adolescent relationships were associated with more modeling and less de-identification from parents. Moreover, higher de-identification was linked to a greater likelihood of youth being involved in a romantic relationship and cohabitation, whereas more modeling was linked to a lower likelihood of cohabitation and older age of first sex. Discussion underscores the importance of assessing parental modeling and de-identification and understanding correlates of these processes.
Collapse
Affiliation(s)
- Sally I-Chun Kuo
- Department of Psychology, Virginia Commonwealth University, Richmond, VA 23284
| | - Lorey A Wheeler
- Nebraska Center for Research on Children, Youth, Families and Schools, University of Nebraska-Lincoln, Lincoln, NE 68583
| | - Kimberly A Updegraff
- T. Denny Sanford School of Social and Family Dynamics, Arizona State University, Tempe, AZ 85287-3701
| | - Susan M McHale
- Social Science Research Institute, The Pennsylvania State University, University Park, PA 16802
| | - Adriana J Umaña-Taylor
- T. Denny Sanford School of Social and Family Dynamics, Arizona State University, Tempe, AZ 85287-3701
| | - Norma J Perez-Brena
- Department of Family and Child Development, Texas State University-San Marcos, San Marcos, TX 78666
| |
Collapse
|
35
|
Abstract
Measures for ensuring that epidemiologic studies are reproducible include making data sets and software available to other researchers so they can verify published findings, conduct alternative analyses of the data, and check for statistical errors or programming errors. Recent developments related to the reproducibility and transparency of epidemiologic studies include the creation of a global platform for sharing data from clinical trials and the anticipated future extension of the global platform to non-clinical trial data. Government agencies and departments such as the US Department of Veterans Affairs Cooperative Studies Program have also enhanced their data repositories and data sharing resources. The Institute of Medicine and the International Committee of Medical Journal Editors released guidance on sharing clinical trial data. The US National Institutes of Health has updated their data-sharing policies. In this issue of the Journal, Shepherd et al. (Am J Epidemiol. 2017;186:387-392) outline a pragmatic approach for reproducible research with sensitive data for studies for which data cannot be shared because of legal or ethical restrictions. Their proposed quasi-reproducible approach facilitates the dissemination of statistical methods and codes to independent researchers. Both reproducibility and quasi-reproducibility can increase transparency for critical evaluation, further dissemination of study methods, and expedite the exchange of ideas among researchers.
Collapse
|
36
|
Shepherd BE, Blevins Peratikos M, Rebeiro PF, Duda SN, McGowan CC. A Pragmatic Approach for Reproducible Research With Sensitive Data. Am J Epidemiol 2017; 186:387-392. [PMID: 28830079 DOI: 10.1093/aje/kwx066] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/30/2016] [Accepted: 02/24/2017] [Indexed: 11/13/2022] Open
Abstract
Reproducible research is important for assessing the integrity of findings and disseminating methods, but it requires making original study data sets publicly available. This requirement is difficult to meet in settings with sensitive data, which can mean that resulting studies are not reproducible. For studies in which data cannot be shared, we propose a pragmatic approach to make research quasi-reproducible. On a publicly available website without restriction, researchers should post 1) analysis code used in the published study, 2) simulated data, and 3) results obtained by applying the analysis code used in the published study to the simulated data. Although it is not a perfect solution, such an approach makes analyses transparent for critical evaluation and dissemination and is therefore a significant improvement over current practice.
Collapse
|
37
|
Dernoncourt F, Lee JY, Uzuner O, Szolovits P. De-identification of patient notes with recurrent neural networks. J Am Med Inform Assoc 2017; 24:596-606. [PMID: 28040687 PMCID: PMC7787254 DOI: 10.1093/jamia/ocw156] [Citation(s) in RCA: 106] [Impact Index Per Article: 15.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2016] [Revised: 09/06/2016] [Accepted: 10/06/2016] [Indexed: 01/16/2023] Open
Abstract
OBJECTIVE Patient notes in electronic health records (EHRs) may contain critical information for medical investigations. However, the vast majority of medical investigators can only access de-identified notes, in order to protect the confidentiality of patients. In the United States, the Health Insurance Portability and Accountability Act (HIPAA) defines 18 types of protected health information that needs to be removed to de-identify patient notes. Manual de-identification is impractical given the size of electronic health record databases, the limited number of researchers with access to non-de-identified notes, and the frequent mistakes of human annotators. A reliable automated de-identification system would consequently be of high value. MATERIALS AND METHODS We introduce the first de-identification system based on artificial neural networks (ANNs), which requires no handcrafted features or rules, unlike existing systems. We compare the performance of the system with state-of-the-art systems on two datasets: the i2b2 2014 de-identification challenge dataset, which is the largest publicly available de-identification dataset, and the MIMIC de-identification dataset, which we assembled and is twice as large as the i2b2 2014 dataset. RESULTS Our ANN model outperforms the state-of-the-art systems. It yields an F1-score of 97.85 on the i2b2 2014 dataset, with a recall of 97.38 and a precision of 98.32, and an F1-score of 99.23 on the MIMIC de-identification dataset, with a recall of 99.25 and a precision of 99.21. CONCLUSION Our findings support the use of ANNs for de-identification of patient notes, as they show better performance than previously published systems while requiring no manual feature engineering.
Collapse
Affiliation(s)
- Franck Dernoncourt
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ji Young Lee
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| | - Ozlem Uzuner
- Computer Science Department, University at Albany, SUNY, Albany, NY, USA
| | - Peter Szolovits
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA
| |
Collapse
|
38
|
Kulynych J, Greely HT. Clinical genomics, big data, and electronic medical records: reconciling patient rights with research when privacy and science collide. J Law Biosci 2017; 4:94-132. [PMID: 28852559 PMCID: PMC5570692 DOI: 10.1093/jlb/lsw061] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Widespread use of medical records for research, without consent, attracts little scrutiny compared to biospecimen research, where concerns about genomic privacy prompted recent federal proposals to mandate consent. This paper explores an important consequence of the proliferation of electronic health records (EHRs) in this permissive atmosphere: with the advent of clinical gene sequencing, EHR-based secondary research poses genetic privacy risks akin to those of biospecimen research, yet regulators still permit researchers to call gene sequence data 'de-identified', removing such data from the protection of the federal Privacy Rule and federal human subjects regulations. Medical centers and other providers seeking to offer genomic 'personalized medicine' now confront the problem of governing the secondary use of clinical genomic data as privacy risks escalate. We argue that regulators should no longer permit HIPAA-covered entities to treat dense genomic data as de-identified health information. Even with this step, the Privacy Rule would still permit disclosure of clinical genomic data for research, without consent, under a data use agreement, so we also urge that providers give patients specific notice before disclosing clinical genomic data for research, permitting (where possible) some degree of choice and control. To aid providers who offer clinical gene sequencing, we suggest both general approaches and specific actions to reconcile patients' rights and interests with genomic research.
Collapse
Affiliation(s)
- Jennifer Kulynych
- Legal Department, The Johns Hopkins Hospital and Health System, 1812 Ashland Ave., Suite 300, Baltimore, MD 21205, USA
| | - Henry T. Greely
- Law School, Stanford University, 559 Nathan Abbott Way, Stanford, CA 94305-8610, USA
| |
Collapse
|
39
|
Lu Y, Sinnott RO, Verspoor K. A Semantic-Based K-Anonymity Scheme for Health Record Linkage. Stud Health Technol Inform 2017; 239:84-90. [PMID: 28756441] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Record linkage is a technique for integrating data from sources or providers where direct access to the data is not possible due to security and privacy considerations. This is a very common scenario for medical data, as patient privacy is a significant concern. To avoid privacy leakage, researchers have adopted k-anonymity to protect raw data from re-identification however they cannot avoid associated information loss, e.g. due to generalisation. Given that individual-level data is often not disclosed in the linkage cases, but yet remains potentially re-discoverable, we propose semantic-based linkage k-anonymity to de-identify record linkage with fewer generalisations and eliminate inference disclosure through semantic reasoning.
Collapse
Affiliation(s)
- Yang Lu
- Department of Computing and Information System, The University of Melbourne, Melbourne, Australia
| | - Richard O Sinnott
- Department of Computing and Information System, The University of Melbourne, Melbourne, Australia
| | - Karin Verspoor
- Department of Computing and Information System, The University of Melbourne, Melbourne, Australia
| |
Collapse
|
40
|
Foufi V, Gaudet-Blavignac C, Chevrier R, Lovis C. De-Identification of Medical Narrative Data. Stud Health Technol Inform 2017; 244:23-27. [PMID: 29039370] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Maintaining data security and privacy in an era of cybersecurity is a challenge. The enormous and rapidly growing amount of health-related data available today raises numerous questions about data collection, storage, analysis, comparability and interoperability but also about data protection. The US Health Portability and Accountability Act (HIPAA) of 1996 provides a legal framework and a guidance for using and disclosing health data. Practically, the approach proposed by HIPAA is the de-identification of medical documents by removing certain Protected Health Information (PHI). In this work, a rule-based method for the de-identification of French free-text medical data using Natural Language Processing (NLP) tools will be presented.
Collapse
Affiliation(s)
- Vasiliki Foufi
- Division of Medical Information Sciences, Geneva University Hospitals and University of Geneva
| | | | - Raphaël Chevrier
- Division of Medical Information Sciences, Geneva University Hospitals and University of Geneva
| | - Christian Lovis
- Division of Medical Information Sciences, Geneva University Hospitals and University of Geneva
| |
Collapse
|
41
|
Henriksson A, Kvist M, Dalianis H. Prevalence Estimation of Protected Health Information in Swedish Clinical Text. Stud Health Technol Inform 2017; 235:216-220. [PMID: 28423786] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/07/2023]
Abstract
Obscuring protected health information (PHI) in the clinical text of health records facilitates the secondary use of healthcare data in a privacy-preserving manner. Although automatic de-identification of clinical text using machine learning holds much promise, little is known about the relative prevalence of PHI in different types of clinical text and whether there is a need for domain adaptation when learning predictive models from one particular domain and applying it to another. In this study, we address these questions by training a predictive model and using it to estimate the prevalence of PHI in clinical text written (1) in different clinical specialties, (2) in different types of notes (i.e., under different headings), and (3) by persons in different professional roles. It is demonstrated that the overall PHI density is 1.57%; however, substantial differences exist across domains.
Collapse
Affiliation(s)
- Aron Henriksson
- Department of Computer and Systems Sciences, Stockholm University, Sweden
| | - Maria Kvist
- Department of Computer and Systems Sciences, Stockholm University, Sweden
| | - Hercules Dalianis
- Department of Computer and Systems Sciences, Stockholm University, Sweden
| |
Collapse
|
42
|
Spidlen J, Brinkman RR. Use FlowRepository to share your clinical data upon study publication. Cytometry B Clin Cytom 2016; 94:196-198. [PMID: 27342384 DOI: 10.1002/cyto.b.21393] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/20/2016] [Accepted: 06/23/2016] [Indexed: 01/01/2023]
Abstract
A fundamental tenet of scientific research is that published results including underlying data should be open to independent validation and refutation. Data sharing encourages collaboration, facilitates quality and reduces redundancy in data production. Authors submitting manuscripts to several journals have already adopted the habit of sharing their underlying flow cytometry data by deposition to FlowRepository-a data repository that is jointly supported by the International Society for Advancement of Cytometry, the International Clinical Cytometry Society and the European Society for Clinical Cell Analysis. De-identification is required for publishing data from clinical studies and we discuss ways to satisfy data sharing requirements and patient privacy requirements simultaneously. Scientific communities in the fields of microarray, proteomics, and sequencing have been benefiting from reuse and re-exploration of data in public repositories for over decade. We believe it is time that clinicians follow suit and that de-identified clinical data also become routinely available along with published cytometry-based findings. © 2016 International Clinical Cytometry Society.
Collapse
Affiliation(s)
- Josef Spidlen
- Terry Fox Laboratory, BC Cancer Agency, Vancouver, British Columbia, Canada
| | - Ryan R Brinkman
- Terry Fox Laboratory, BC Cancer Agency, Vancouver, British Columbia, Canada.,Department of Medical Genetics, University of British Columbia, Vancouver, British Columbia, Canada
| |
Collapse
|
43
|
Abstract
A health record database contains structured data fields that identify the patient, such as patient ID, patient name, e-mail and phone number. These data are fairly easy to de-identify, that is, replace with other identifiers. However, these data also occur in fields with doctors' free-text notes written in an abbreviated style that cannot be analyzed grammatically. If we replace a word that looks like a name, but isn't, we degrade readability and medical correctness. If we fail to replace it when we should, we degrade confidentiality. We de-identified an existing Danish electronic health record database, ending up with 323,122 patient health records. We had to invent many methods for de-identifying potential identifiers in the free-text notes. The de-identified health records should be used with caution for statistical purposes because we removed health records that were so special that they couldn't be de-identified. Furthermore, we distorted geography by replacing zip codes with random zip codes.
Collapse
|
44
|
Song X, Wang J, Wang A, Meng Q, Prescott C, Tsu L, Eckert MA. DeID - a data sharing tool for neuroimaging studies. Front Neurosci 2015; 9:325. [PMID: 26441500 PMCID: PMC4585207 DOI: 10.3389/fnins.2015.00325] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2015] [Accepted: 08/31/2015] [Indexed: 11/25/2022] Open
Abstract
Funding institutions and researchers increasingly expect that data will be shared to increase scientific integrity and provide other scientists with the opportunity to use the data with novel methods that may advance understanding in a particular field of study. In practice, sharing human subject data can be complicated because data must be de-identified prior to sharing. Moreover, integrating varied data types collected in a study can be challenging and time consuming. For example, sharing data from structural imaging studies of a complex disorder requires the integration of imaging, demographic and/or behavioral data in a way that no subject identifiers are included in the de-identified dataset and with new subject labels or identification values that cannot be tracked back to the original ones. We have developed a Java program that users can use to remove identifying information in neuroimaging datasets, while still maintaining the association among different data types from the same subject for further studies. This software provides a series of user interaction wizards to allow users to select data variables to be de-identified, implements functions for auditing and validation of de-identified data, and enables the user to share the de-identified data in a single compressed package through various communication protocols, such as FTPS and SFTP. DeID runs with Windows, Linux, and Mac operating systems and its open architecture allows it to be easily adapted to support a broader array of data types, with the goal of facilitating data sharing. DeID can be obtained at http://www.nitrc.org/projects/deid.
Collapse
Affiliation(s)
- Xuebo Song
- School of Computing, Clemson University Clemson, SC, USA
| | - James Wang
- School of Computing, Clemson University Clemson, SC, USA
| | - Anlin Wang
- School of Computing, Clemson University Clemson, SC, USA
| | - Qingping Meng
- School of Computing, Clemson University Clemson, SC, USA
| | | | - Loretta Tsu
- Department of Otolaryngology - Head and Neck Surgery, Medical University of South Carolina Charleston, SC, USA
| | - Mark A Eckert
- Department of Otolaryngology - Head and Neck Surgery, Medical University of South Carolina Charleston, SC, USA
| |
Collapse
|
45
|
Xia W, Heatherly R, Ding X, Li J, Malin BA. R-U policy frontiers for health data de-identification. J Am Med Inform Assoc 2015; 22:1029-41. [PMID: 25911674 PMCID: PMC4986667 DOI: 10.1093/jamia/ocv004] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/14/2014] [Revised: 12/27/2014] [Accepted: 01/09/2015] [Indexed: 11/12/2022] Open
Abstract
OBJECTIVE The Health Insurance Portability and Accountability Act Privacy Rule enables healthcare organizations to share de-identified data via two routes. They can either 1) show re-identification risk is small (e.g., via a formal model, such as k-anonymity) with respect to an anticipated recipient or 2) apply a rule-based policy (i.e., Safe Harbor) that enumerates attributes to be altered (e.g., dates to years). The latter is often invoked because it is interpretable, but it fails to tailor protections to the capabilities of the recipient. The paper shows rule-based policies can be mapped to a utility (U) and re-identification risk (R) space, which can be searched for a collection, or frontier, of policies that systematically trade off between these goals. METHODS We extend an algorithm to efficiently compose an R-U frontier using a lattice of policy options. Risk is proportional to the number of patients to which a record corresponds, while utility is proportional to similarity of the original and de-identified distribution. We allow our method to search 20 000 rule-based policies (out of 2(700)) and compare the resulting frontier with k-anonymous solutions and Safe Harbor using the demographics of 10 U.S. states. RESULTS The results demonstrate the rule-based frontier 1) consists, on average, of 5000 policies, 2% of which enable better utility with less risk than Safe Harbor and 2) the policies cover a broader spectrum of utility and risk than k-anonymity frontiers. CONCLUSIONS R-U frontiers of de-identification policies can be discovered efficiently, allowing healthcare organizations to tailor protections to anticipated needs and trustworthiness of recipients.
Collapse
Affiliation(s)
- Weiyi Xia
- Department of Electrical Engineering & Computer Science, Vanderbilt University, Nashville, TN, USA
| | - Raymond Heatherly
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
| | - Xiaofeng Ding
- Huazhong University of Science and Technology, Wuhan, China
| | - Jiuyong Li
- School of Information Technology and Mathematical Sciences, University of South Australia, Mawson Lakes, South Australia, Australia
| | - Bradley A Malin
- Department of Electrical Engineering & Computer Science, Vanderbilt University, Nashville, TN, USA Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
| |
Collapse
|
46
|
Moffet HH, Warton EM, Parker MM, Liu JY, Lyles CR, Karter AJ. The DISTANCE model for collaborative research: distributing analytic effort using scrambled data sets. ACTA ACUST UNITED AC 2014; 2:33-8. [PMID: 25584364 DOI: 10.12691/iscf-2-3-1] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
BACKGROUND Data-sharing is encouraged to fulfill the ethical responsibility to transform research data into public health knowledge, but data sharing carries risks of improper disclosure and potential harm from release of individually identifiable data. METHODS The study objective was to develop and implement a novel method for scientific collaboration and data sharing which distributes the analytic burden while protecting patient privacy. A procedure was developed where in an investigator who is external to an analytic coordinating center (ACC) can conduct original research following a protocol governed by a Publications and Presentations (P&P) Committee. The collaborating investigator submits a study proposal and, if approved, develops the analytic specifications using existing data dictionaries and templates. An original data set is prepared according to the specifications and the external investigator is provided with a complete but de-identified and shuffled data set which retains all key data fields but which obfuscates individually identifiable data and patterns; this" scrambled data set" provides a "sandbox" for the external investigator to develop and test analytic code for analyses. The analytic code is then run against the original data at the ACC to generate output which is used by the external investigator in preparing a manuscript for journal submission. RESULTS The method has been successfully used with collaborators to produce many published papers and conference reports. CONCLUSION By distributing the analytic burden, this method can facilitate collaboration and expand analytic capacity, resulting in more science for less money.
Collapse
|
47
|
Li D, Mojarad MR, Li Y, Sohn S, Mehrabi S, Elayavilli RK, Yu Y, Liu H. A Frequency-based Strategy of Obtaining Sentences from Clinical Data Repository for Crowdsourcing. Stud Health Technol Inform 2015; 216:1033-4. [PMID: 26262333 PMCID: PMC5859924] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
In clinical NLP, one major barrier to adopting crowdsourcing for NLP annotation is the issue of confidentiality for protected health information (PHI) in clinical narratives. In this paper, we investigated the use of a frequency-based approach to extract sentences without PHI. Our approach is based on the assumption that sentences appearing frequently tend to contain no PHI. Both manual and automatic evaluations on 500 sentences out of the 7.9 million sentences of frequencies higher than one show that no PHI can be found among them. The promising results provide potentials of releasing those sentences for obtaining sentence-level NLP annotations via crowdsourcing.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Yue Yu
- Mayo Clinic, Rochester, MN, USA,Department of Biomedical Informatics, University of Jilin, Jilin, China
| | | |
Collapse
|
48
|
McGraw D. Building public trust in uses of Health Insurance Portability and Accountability Act de-identified data. J Am Med Inform Assoc 2013; 20:29-34. [PMID: 22735615 PMCID: PMC3555317 DOI: 10.1136/amiajnl-2012-000936] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Accepted: 05/31/2012] [Indexed: 11/04/2022] Open
Abstract
OBJECTIVES The aim of this paper is to summarize concerns with the de-identification standard and methodologies established under the Health Insurance Portability and Accountability Act (HIPAA) regulations, and report some potential policies to address those concerns that were discussed at a recent workshop attended by industry, consumer, academic and research stakeholders. TARGET AUDIENCE The target audience includes researchers, industry stakeholders, policy makers and consumer advocates concerned about preserving the ability to use HIPAA de-identified data for a range of important secondary uses. SCOPE HIPAA sets forth methodologies for de-identifying health data; once such data are de-identified, they are no longer subject to HIPAA regulations and can be used for any purpose. Concerns have been raised about the sufficiency of HIPAA de-identification methodologies, the lack of legal accountability for unauthorized re-identification of de-identified data, and insufficient public transparency about de-identified data uses. Although there is little published evidence of the re-identification of properly de-identified datasets, such concerns appear to be increasing. This article discusses policy proposals intended to address de-identification concerns while maintaining de-identification as an effective tool for protecting privacy and preserving the ability to leverage health data for secondary purposes.
Collapse
Affiliation(s)
- Deven McGraw
- Center for Democracy & Technology, 1634 I Street, NW Suite 1100, Washington, DC 20006,
| |
Collapse
|
49
|
Cimino JJ. The false security of blind dates: chrononymization's lack of impact on data privacy of laboratory data. Appl Clin Inform 2012; 3:392-403. [PMID: 23646086 DOI: 10.4338/aci-2012-07-ra-0028] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2012] [Accepted: 10/01/2012] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND The reuse of clinical data for research purposes requires methods for the protection of personal privacy. One general approach is the removal of personal identifiers from the data. A frequent part of this anonymization process is the removal of times and dates, which we refer to as "chrononymization." While this step can make the association with identified data (such as public information or a small sample of patient information) more difficult, it comes at a cost to the usefulness of the data for research. OBJECTIVES We sought to determine whether removal of dates from common laboratory test panels offers any advantage in protecting such data from re-identification. METHODS We obtained a set of results for 5.9 million laboratory panels from the National Institutes of Health's (NIH) Biomedical Translational Research Information System (BTRIS), selected a random set of 20,000 panels from the larger source sets, and then identified all matches between the sets. RESULTS We found that while removal of dates could hinder the re-identification of a single test result, such removal had almost no effect when entire panels were used. CONCLUSIONS Our results suggest that reliance on chrononymization provides a false sense of security for the protection of laboratory test results. As a result of this study, the NIH has chosen to rely on policy solutions, such as strong data use agreements, rather than removal of dates when reusing clinical data for research purposes.
Collapse
Affiliation(s)
- J J Cimino
- Department of Pediatrics, Hospital for Special Surgery
| |
Collapse
|
50
|
Chervenak AL, van Erp TGM, Kesselman C, D'Arcy M, Sobell J, Keator D, Dahm L, Murry J, Law M, Hasso A, Ames J, Macciardi F, Potkin SG. A system architecture for sharing de-identified, research-ready brain scans and health information across clinical imaging centers. Stud Health Technol Inform 2012; 175:19-28. [PMID: 22941984 PMCID: PMC4478050] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Progress in our understanding of brain disorders increasingly relies on the costly collection of large standardized brain magnetic resonance imaging (MRI) data sets. Moreover, the clinical interpretation of brain scans benefits from compare and contrast analyses of scans from patients with similar, and sometimes rare, demographic, diagnostic, and treatment status. A solution to both needs is to acquire standardized, research-ready clinical brain scans and to build the information technology infrastructure to share such scans, along with other pertinent information, across hospitals. This paper describes the design, deployment, and operation of a federated imaging system that captures and shares standardized, de-identified clinical brain images in a federation across multiple institutions. In addition to describing innovative aspects of the system architecture and our initial testing of the deployed infrastructure, we also describe the Standardized Imaging Protocol (SIP) developed for the project and our interactions with the Institutional Review Board (IRB) regarding handling patient data in the federated environment.
Collapse
|