1
|
Sepas A, Bangash AH, Alraoui O, El Emam K, El-Hussuna A. Algorithms to anonymize structured medical and healthcare data: A systematic review. FRONTIERS IN BIOINFORMATICS 2022; 2:984807. [PMID: 36619476 PMCID: PMC9815524 DOI: 10.3389/fbinf.2022.984807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Accepted: 11/28/2022] [Indexed: 12/24/2022] Open
Abstract
Introduction: With many anonymization algorithms developed for structured medical health data (SMHD) in the last decade, our systematic review provides a comprehensive bird's eye view of algorithms for SMHD anonymization. Methods: This systematic review was conducted according to the recommendations in the Cochrane Handbook for Reviews of Interventions and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). Eligible articles from the PubMed, ACM digital library, Medline, IEEE, Embase, Web of Science Collection, Scopus, ProQuest Dissertation, and Theses Global databases were identified through systematic searches. The following parameters were extracted from the eligible studies: author, year of publication, sample size, and relevant algorithms and/or software applied to anonymize SMHD, along with the summary of outcomes. Results: Among 1,804 initial hits, the present study considered 63 records including research articles, reviews, and books. Seventy five evaluated the anonymization of demographic data, 18 assessed diagnosis codes, and 3 assessed genomic data. One of the most common approaches was k-anonymity, which was utilized mainly for demographic data, often in combination with another algorithm; e.g., l-diversity. No approaches have yet been developed for protection against membership disclosure attacks on diagnosis codes. Conclusion: This study reviewed and categorized different anonymization approaches for MHD according to the anonymized data types (demographics, diagnosis codes, and genomic data). Further research is needed to develop more efficient algorithms for the anonymization of diagnosis codes and genomic data. The risk of reidentification can be minimized with adequate application of the addressed anonymization approaches. Systematic Review Registration: [http://www.crd.york.ac.uk/prospero], identifier [CRD42021228200].
Collapse
Affiliation(s)
- Ali Sepas
- Open Source Research Collaboration, Aalborg, Denmark
- Department of Materials and Production, Aalborg University, Aalborg, Denmark
| | - Ali Haider Bangash
- Open Source Research Collaboration, Aalborg, Denmark
- STMU Shifa College of Medicine, Islamabad, Pakistan
| | - Omar Alraoui
- Department of Health Science and Technology, Aalborg University, Aalborg, Denmark
| | - Khaled El Emam
- Canada Research Chair in Medical AI, University of Ottawa, Ottawa, ON, Canada
| | | |
Collapse
|
2
|
Desmet C, Cook DJ. Recent Developments in Privacy-Preserving Mining of Clinical Data. ACM/IMS TRANSACTIONS ON DATA SCIENCE 2021; 2:28. [PMID: 35018368 PMCID: PMC8746818 DOI: 10.1145/3447774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Accepted: 01/01/2021] [Indexed: 06/14/2023]
Abstract
With the dramatic increases in both the capability to collect personal data and the capability to analyze large amounts of data, increasingly sophisticated and personal insights are being drawn. These insights are valuable for clinical applications but also open up possibilities for identification and abuse of personal information. In this paper, we survey recent research on classical methods of privacy-preserving data mining. Looking at dominant techniques and recent innovations to them, we examine the applicability of these methods to the privacy-preserving analysis of clinical data. We also discuss promising directions for future research in this area.
Collapse
|
3
|
Wang JT, Lin WY. Privacy-Preserving Anonymity for Periodical Releases of Spontaneous Adverse Drug Event Reporting Data: Algorithm Development and Validation. JMIR Med Inform 2021; 9:e28752. [PMID: 34709197 PMCID: PMC8587328 DOI: 10.2196/28752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Revised: 07/30/2021] [Accepted: 08/02/2021] [Indexed: 11/20/2022] Open
Abstract
Background Spontaneous reporting systems (SRSs) have been increasingly established to collect adverse drug events for fostering adverse drug reaction (ADR) detection and analysis research. SRS data contain personal information, and so their publication requires data anonymization to prevent the disclosure of individuals’ privacy. We have previously proposed a privacy model called MS(k, θ*)-bounding and the associated MS-Anonymization algorithm to fulfill the anonymization of SRS data. In the real world, the SRS data usually are released periodically (eg, FDA Adverse Event Reporting System [FAERS]) to accommodate newly collected adverse drug events. Different anonymized releases of SRS data available to the attacker may thwart our single-release-focus method, that is, MS(k, θ*)-bounding. Objective We investigate the privacy threat caused by periodical releases of SRS data and propose anonymization methods to prevent the disclosure of personal privacy information while maintaining the utility of published data. Methods We identify potential attacks on periodical releases of SRS data, namely, BFL-attacks, mainly caused by follow-up cases. We present a new privacy model called PPMS(k, θ*)-bounding, and propose the associated PPMS-Anonymization algorithm and 2 improvements: PPMS+-Anonymization and PPMS++-Anonymization. Empirical evaluations were performed using 32 selected FAERS quarter data sets from 2004Q1 to 2011Q4. The performance of the proposed versions of PPMS-Anonymization was inspected against MS-Anonymization from some aspects, including data distortion, measured by normalized information loss; privacy risk of anonymized data, measured by dangerous identity ratio and dangerous sensitivity ratio; and data utility, measured by the bias of signal counting and strength (proportional reporting ratio). Results The best version of PPMS-Anonymization, PPMS++-Anonymization, achieves nearly the same quality as MS-Anonymization in both privacy protection and data utility. Overall, PPMS++-Anonymization ensures zero privacy risk on record and attribute linkage, and exhibits 51%-78% and 59%-82% improvements on information loss over PPMS+-Anonymization and PPMS-Anonymization, respectively, and significantly reduces the bias of ADR signal. Conclusions The proposed PPMS(k, θ*)-bounding model and PPMS-Anonymization algorithm are effective in anonymizing SRS data sets in the periodical data publishing scenario, preventing the series of releases from disclosing personal sensitive information caused by BFL-attacks while maintaining the data utility for ADR signal detection.
Collapse
Affiliation(s)
- Jie-Teng Wang
- Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan
| | - Wen-Yang Lin
- Department of Computer Science and Information Engineering, National University of Kaohsiung, Kaohsiung, Taiwan
| |
Collapse
|
4
|
A multicenter random forest model for effective prognosis prediction in collaborative clinical research network. Artif Intell Med 2020; 103:101814. [PMID: 32143809 DOI: 10.1016/j.artmed.2020.101814] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 02/04/2020] [Accepted: 02/04/2020] [Indexed: 12/17/2022]
Abstract
BACKGROUND The accuracy of a prognostic prediction model has become an essential aspect of the quality and reliability of the health-related decisions made by clinicians in modern medicine. Unfortunately, individual institutions often lack sufficient samples, which might not provide sufficient statistical power for models. One mitigation is to expand data collection from a single institution to multiple centers to collectively increase the sample size. However, sharing sensitive biomedical data for research involves complicated issues. Machine learning models such as random forests (RF), though they are commonly used and achieve good performances for prognostic prediction, usually suffer worse performance under multicenter privacy-preserving data mining scenarios compared to a centrally trained version. METHODS AND MATERIALS In this study, a multicenter random forest prognosis prediction model is proposed that enables federated clinical data mining from horizontally partitioned datasets. By using a novel data enhancement approach based on a differentially private generative adversarial network customized to clinical prognosis data, the proposed model is able to provide a multicenter RF model with performances on par with-or even better than-centrally trained RF but without the need to aggregate the raw data. Moreover, our model also incorporates an importance ranking step designed for feature selection without sharing patient-level information. RESULT The proposed model was evaluated on colorectal cancer datasets from the US and China. Two groups of datasets with different levels of heterogeneity within the collaborative research network were selected. First, we compare the performance of the distributed random forest model under different privacy parameters with different percentages of enhancement datasets and validate the effectiveness and plausibility of our approach. Then, we compare the discrimination and calibration ability of the proposed multicenter random forest with a centrally trained random forest model and other tree-based classifiers as well as some commonly used machine learning methods. The results show that the proposed model can provide better prediction performance in terms of discrimination and calibration ability than the centrally trained RF model or the other candidate models while following the privacy-preserving rules in both groups. Additionally, good discrimination and calibration ability are shown on the simplified model based on the feature importance ranking in the proposed approach. CONCLUSION The proposed random forest model exhibits ideal prediction capability using multicenter clinical data and overcomes the performance limitation arising from privacy guarantees. It can also provide feature importance ranking across institutions without pooling the data at a central site. This study offers a practical solution for building a prognosis prediction model in the collaborative clinical research network and solves practical issues in real-world applications of medical artificial intelligence.
Collapse
|
5
|
Hsiao MH, Lin WY, Hsu KY, Shen ZX. On Anonymizing Medical Microdata with Large-Scale Missing Values - A Case Study with the FAERS Dataset .. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2020; 2019:6505-6508. [PMID: 31947331 DOI: 10.1109/embc.2019.8857025] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]
Abstract
As big data analysis becomes one of the main driving forces for productivity and economic growth, the concern of individual privacy disclosure increases as well, especially for applications accessing medical or health data that contain personal information. Most contemporary techniques for privacy preserving data publishing follow a simple assumption-the data of concern is complete, i.e., containing no missing values, which however is not the case in the real world. This paper presents our endeavors on inspecting the effect of missing values upon medical data privacy. In particular, we inspected the US FAERS dataset, a public dataset containing adverse drug events released by US FDA. Following the presumption of current anonymization paradigm-the data should contain no missing values, we investigated three intuitive strategies, including or excluding missing values or executing imputation, to anonymize the FAERS dataset. Our results demonstrate the awkwardness of these intuitive strategies in handling data with a massive amount of missing values. Accordingly, we propose a new strategy, consolidation, and the corresponding privacy protection model and anonymization algorithm. Experimental results show that our method can prevent privacy disclosure and sustain the data utility for ADR signal detection.
Collapse
|
6
|
Chevrier R, Foufi V, Gaudet-Blavignac C, Robert A, Lovis C. Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review. J Med Internet Res 2019; 21:e13484. [PMID: 31152528 PMCID: PMC6658290 DOI: 10.2196/13484] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 03/29/2019] [Accepted: 04/26/2019] [Indexed: 01/19/2023] Open
Abstract
Background The secondary use of health data is central to biomedical research in the era of data science and precision medicine. National and international initiatives, such as the Global Open Findable, Accessible, Interoperable, and Reusable (GO FAIR) initiative, are supporting this approach in different ways (eg, making the sharing of research data mandatory or improving the legal and ethical frameworks). Preserving patients’ privacy is crucial in this context. De-identification and anonymization are the two most common terms used to refer to the technical approaches that protect privacy and facilitate the secondary use of health data. However, it is difficult to find a consensus on the definitions of the concepts or on the reliability of the techniques used to apply them. A comprehensive review is needed to better understand the domain, its capabilities, its challenges, and the ratio of risk between the data subjects’ privacy on one side, and the benefit of scientific advances on the other. Objective This work aims at better understanding how the research community comprehends and defines the concepts of de-identification and anonymization. A rich overview should also provide insights into the use and reliability of the methods. Six aspects will be studied: (1) terminology and definitions, (2) backgrounds and places of work of the researchers, (3) reasons for anonymizing or de-identifying health data, (4) limitations of the techniques, (5) legal and ethical aspects, and (6) recommendations of the researchers. Methods Based on a scoping review protocol designed a priori, MEDLINE was searched for publications discussing de-identification or anonymization and published between 2007 and 2017. The search was restricted to MEDLINE to focus on the life sciences community. The screening process was performed by two reviewers independently. Results After searching 7972 records that matched at least one search term, 135 publications were screened and 60 full-text articles were included. (1) Terminology: Definitions of the terms de-identification and anonymization were provided in less than half of the articles (29/60, 48%). When both terms were used (41/60, 68%), their meanings divided the authors into two equal groups (19/60, 32%, each) with opposed views. The remaining articles (3/60, 5%) were equivocal. (2) Backgrounds and locations: Research groups were based predominantly in North America (31/60, 52%) and in the European Union (22/60, 37%). The authors came from 19 different domains; computer science (91/248, 36.7%), biomedical informatics (47/248, 19.0%), and medicine (38/248, 15.3%) were the most prevalent ones. (3) Purpose: The main reason declared for applying these techniques is to facilitate biomedical research. (4) Limitations: Progress is made on specific techniques but, overall, limitations remain numerous. (5) Legal and ethical aspects: Differences exist between nations in the definitions, approaches, and legal practices. (6) Recommendations: The combination of organizational, legal, ethical, and technical approaches is necessary to protect health data. Conclusions Interest is growing for privacy-enhancing techniques in the life sciences community. This interest crosses scientific boundaries, involving primarily computer science, biomedical informatics, and medicine. The variability observed in the use of the terms de-identification and anonymization emphasizes the need for clearer definitions as well as for better education and dissemination of information on the subject. The same observation applies to the methods. Several legislations, such as the American Health Insurance Portability and Accountability Act (HIPAA) and the European General Data Protection Regulation (GDPR), regulate the domain. Using the definitions they provide could help address the variable use of these two concepts in the research community.
Collapse
Affiliation(s)
- Raphaël Chevrier
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Vasiliki Foufi
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Christophe Gaudet-Blavignac
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Arnaud Robert
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| | - Christian Lovis
- Division of Medical Information Sciences, University Hospitals of Geneva, Geneva, Switzerland.,Faculty of Medicine, University of Geneva, Geneva, Switzerland
| |
Collapse
|
7
|
Trippe ZA, Brendani B, Meier C, Lewis D. Identification of Substandard Medicines via Disproportionality Analysis of Individual Case Safety Reports. Drug Saf 2017; 40:293-303. [PMID: 28130773 DOI: 10.1007/s40264-016-0499-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
INTRODUCTION The distribution and use of substandard medicines (SSMs) is a public health concern worldwide. The detection of SSMs is currently limited to expensive large-scale assay techniques such as high-performance liquid chromatography (HPLC). Since 2013, the Pharmacovigilance Department at Novartis Pharma AG has been analyzing drug-associated adverse events related to 'product quality issues' with the aim of detecting defective medicines using spontaneous reporting. The method of identifying SSMs with spontaneous reporting was pioneered by the Monitoring Medicines project in 2011. METHODS This retrospective review was based on data from the World Health Organization (WHO) Global individual case safety report (ICSR) database VigiBase® collected from January 2001 to December 2014. We conducted three different stratification analyses using the Multi-item Gamma Poisson Shrinker (MGPS) algorithm through the Oracle Empirica data-mining software. In total, 24 preferred terms (PTs) from the Medical Dictionary for Regulatory Activities (MedDRA®) were used to identify poor-quality medicines. To identify potential SSMs for further evaluation, a cutoff of 2.0 for EB05, the lower 95% interval of the empirical Bayes geometric mean (EBGM) was applied. We carried out a literature search for advisory letters related to defective medicinal products to validate our findings. Furthermore, we aimed to assess whether we could confirm two SSMs first identified by the Uppsala Monitoring Centre (UMC) with our stratification method. RESULTS The analysis of ICSRs based on the specified selection criteria and threshold yielded 2506 hits including medicinal products with an excess of reports of product quality defects relative to other medicines in the database. Further investigations and a pilot study in five authorized medicinal products (proprietary and generic) licensed by a single marketing authorization holder, containing valsartan, methylphenidate, rivastigmine, clozapine, or carbamazepine, were performed. This resulted in an output of 23 potential SSMs. The literature search identified two communications issued to health professionals concerning a substandard rivastigmine patch, which validated our initial findings. Furthermore, we identified excess reporting of product quality issues with an ethinyl estradiol/norgestrel combination and with salbutamol. These were categorized as confirmed clusters of substandard/spurious/falsely labelled/falsified/counterfeit (SSFFC) medical products by the UMC in 2014. CONCLUSION This study illustrates the value of data mining of spontaneous adverse event reports and the applicability of disproportionality analysis to identify potential SSMs.
Collapse
Affiliation(s)
- Zahra Anita Trippe
- Patient Safety, Novartis Pharma AG, Basel, Switzerland. .,Division of Clinical Pharmacy and Epidemiology, Department of Pharmaceutical Sciences, University of Basel, Basel, Switzerland.
| | | | - Christoph Meier
- Division of Clinical Pharmacy and Epidemiology, Department of Pharmaceutical Sciences, University of Basel, Basel, Switzerland
| | - David Lewis
- Patient Safety, Novartis Pharma AG, Basel, Switzerland.,School of Life and Medical Sciences, University of Hertfordshire, Hatfield, England, UK
| |
Collapse
|
8
|
Lee H, Kim S, Kim JW, Chung YD. Utility-preserving anonymization for health data publishing. BMC Med Inform Decis Mak 2017; 17:104. [PMID: 28693480 PMCID: PMC5504813 DOI: 10.1186/s12911-017-0499-0] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2017] [Accepted: 06/28/2017] [Indexed: 11/23/2022] Open
Abstract
BACKGROUND Publishing raw electronic health records (EHRs) may be considered as a breach of the privacy of individuals because they usually contain sensitive information. A common practice for the privacy-preserving data publishing is to anonymize the data before publishing, and thus satisfy privacy models such as k-anonymity. Among various anonymization techniques, generalization is the most commonly used in medical/health data processing. Generalization inevitably causes information loss, and thus, various methods have been proposed to reduce information loss. However, existing generalization-based data anonymization methods cannot avoid excessive information loss and preserve data utility. METHODS We propose a utility-preserving anonymization for privacy preserving data publishing (PPDP). To preserve data utility, the proposed method comprises three parts: (1) utility-preserving model, (2) counterfeit record insertion, (3) catalog of the counterfeit records. We also propose an anonymization algorithm using the proposed method. Our anonymization algorithm applies full-domain generalization algorithm. We evaluate our method in comparison with existence method on two aspects, information loss measured through various quality metrics and error rate of analysis result. RESULTS With all different types of quality metrics, our proposed method show the lower information loss than the existing method. In the real-world EHRs analysis, analysis results show small portion of error between the anonymized data through the proposed method and original data. CONCLUSIONS We propose a new utility-preserving anonymization method and an anonymization algorithm using the proposed method. Through experiments on various datasets, we show that the utility of EHRs anonymized by the proposed method is significantly better than those anonymized by previous approaches.
Collapse
Affiliation(s)
- Hyukki Lee
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Republic of Korea
| | - Soohyung Kim
- Department of IT Convegence, Korea University, Seoul, 145 Anam-ro, Seongbuk-gu, 02841 Republic of Korea
| | - Jong Wook Kim
- Department of Media Software, Seoul, 20-Gil, Hongji-dong, Seongbuk-gu, 03016 Republic of Korea
| | - Yon Dohn Chung
- Department of Computer Science and Engineering, Korea University, 145 Anam-ro, Seongbuk-gu, Seoul, 02841 Republic of Korea
| |
Collapse
|