Reference Citation Analysis: Find an Article, Find a Category, Find a Journal, Find a Scholar

For: Lin WY, Yang DC, Wang JT. Privacy preserving data anonymization of spontaneous ADE reporting system dataset. BMC Med Inform Decis Mak 2016;16 Suppl 1:58. [PMID: 27454754 DOI: 10.1186/s12911-016-0293-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open

For:	Lin WY, Yang DC, Wang JT. Privacy preserving data anonymization of spontaneous ADE reporting system dataset. BMC Med Inform Decis Mak 2016;16 Suppl 1:58. [PMID: 27454754 DOI: 10.1186/s12911-016-0293-4] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.9] [Reference Citation Analysis] [What about the content of this article? (0)] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/02/2022] Open

Number

Cited by Other Article(s)

Sepas A, Bangash AH, Alraoui O, El Emam K, El-Hussuna A. Algorithms to anonymize structured medical and healthcare data: A systematic review. FRONTIERS IN BIOINFORMATICS 2022;2:984807. [PMID: 36619476 PMCID: PMC9815524 DOI: 10.3389/fbinf.2022.984807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Accepted: 11/28/2022] [Indexed: 12/24/2022] Open

Abstract

Introduction: With many anonymization algorithms developed for structured medical health data (SMHD) in the last decade, our systematic review provides a comprehensive bird's eye view of algorithms for SMHD anonymization. Methods: This systematic review was conducted according to the recommendations in the Cochrane Handbook for Reviews of Interventions and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). Eligible articles from the PubMed, ACM digital library, Medline, IEEE, Embase, Web of Science Collection, Scopus, ProQuest Dissertation, and Theses Global databases were identified through systematic searches. The following parameters were extracted from the eligible studies: author, year of publication, sample size, and relevant algorithms and/or software applied to anonymize SMHD, along with the summary of outcomes. Results: Among 1,804 initial hits, the present study considered 63 records including research articles, reviews, and books. Seventy five evaluated the anonymization of demographic data, 18 assessed diagnosis codes, and 3 assessed genomic data. One of the most common approaches was k-anonymity, which was utilized mainly for demographic data, often in combination with another algorithm; e.g., l-diversity. No approaches have yet been developed for protection against membership disclosure attacks on diagnosis codes. Conclusion: This study reviewed and categorized different anonymization approaches for MHD according to the anonymized data types (demographics, diagnosis codes, and genomic data). Further research is needed to develop more efficient algorithms for the anonymization of diagnosis codes and genomic data. The risk of reidentification can be minimized with adequate application of the addressed anonymization approaches. Systematic Review Registration: [http://www.crd.york.ac.uk/prospero], identifier [CRD42021228200].

Collapse

Desmet C, Cook DJ. Recent Developments in Privacy-Preserving Mining of Clinical Data. ACM/IMS TRANSACTIONS ON DATA SCIENCE 2021;2:28. [PMID: 35018368 PMCID: PMC8746818 DOI: 10.1145/3447774] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/01/2020] [Accepted: 01/01/2021] [Indexed: 06/14/2023]

Wang JT, Lin WY. Privacy-Preserving Anonymity for Periodical Releases of Spontaneous Adverse Drug Event Reporting Data: Algorithm Development and Validation. JMIR Med Inform 2021;9:e28752. [PMID: 34709197 PMCID: PMC8587328 DOI: 10.2196/28752] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2021] [Revised: 07/30/2021] [Accepted: 08/02/2021] [Indexed: 11/20/2022] Open

Abstract

Background

Spontaneous reporting systems (SRSs) have been increasingly established to collect adverse drug events for fostering adverse drug reaction (ADR) detection and analysis research. SRS data contain personal information, and so their publication requires data anonymization to prevent the disclosure of individuals’ privacy. We have previously proposed a privacy model called MS(k, θ*)-bounding and the associated MS-Anonymization algorithm to fulfill the anonymization of SRS data. In the real world, the SRS data usually are released periodically (eg, FDA Adverse Event Reporting System [FAERS]) to accommodate newly collected adverse drug events. Different anonymized releases of SRS data available to the attacker may thwart our single-release-focus method, that is, MS(k, θ*)-bounding.

Objective

We investigate the privacy threat caused by periodical releases of SRS data and propose anonymization methods to prevent the disclosure of personal privacy information while maintaining the utility of published data.

Methods

We identify potential attacks on periodical releases of SRS data, namely, BFL-attacks, mainly caused by follow-up cases. We present a new privacy model called PPMS(k, θ*)-bounding, and propose the associated PPMS-Anonymization algorithm and 2 improvements: PPMS+-Anonymization and PPMS++-Anonymization. Empirical evaluations were performed using 32 selected FAERS quarter data sets from 2004Q1 to 2011Q4. The performance of the proposed versions of PPMS-Anonymization was inspected against MS-Anonymization from some aspects, including data distortion, measured by normalized information loss; privacy risk of anonymized data, measured by dangerous identity ratio and dangerous sensitivity ratio; and data utility, measured by the bias of signal counting and strength (proportional reporting ratio).

Results

The best version of PPMS-Anonymization, PPMS++-Anonymization, achieves nearly the same quality as MS-Anonymization in both privacy protection and data utility. Overall, PPMS++-Anonymization ensures zero privacy risk on record and attribute linkage, and exhibits 51%-78% and 59%-82% improvements on information loss over PPMS+-Anonymization and PPMS-Anonymization, respectively, and significantly reduces the bias of ADR signal.

Conclusions

The proposed PPMS(k, θ*)-bounding model and PPMS-Anonymization algorithm are effective in anonymizing SRS data sets in the periodical data publishing scenario, preventing the series of releases from disclosing personal sensitive information caused by BFL-attacks while maintaining the data utility for ADR signal detection.

Collapse

A multicenter random forest model for effective prognosis prediction in collaborative clinical research network. Artif Intell Med 2020;103:101814. [PMID: 32143809 DOI: 10.1016/j.artmed.2020.101814] [Citation(s) in RCA: 20] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/16/2019] [Revised: 02/04/2020] [Accepted: 02/04/2020] [Indexed: 12/17/2022]

Abstract

BACKGROUND

The accuracy of a prognostic prediction model has become an essential aspect of the quality and reliability of the health-related decisions made by clinicians in modern medicine. Unfortunately, individual institutions often lack sufficient samples, which might not provide sufficient statistical power for models. One mitigation is to expand data collection from a single institution to multiple centers to collectively increase the sample size. However, sharing sensitive biomedical data for research involves complicated issues. Machine learning models such as random forests (RF), though they are commonly used and achieve good performances for prognostic prediction, usually suffer worse performance under multicenter privacy-preserving data mining scenarios compared to a centrally trained version.

METHODS AND MATERIALS

In this study, a multicenter random forest prognosis prediction model is proposed that enables federated clinical data mining from horizontally partitioned datasets. By using a novel data enhancement approach based on a differentially private generative adversarial network customized to clinical prognosis data, the proposed model is able to provide a multicenter RF model with performances on par with-or even better than-centrally trained RF but without the need to aggregate the raw data. Moreover, our model also incorporates an importance ranking step designed for feature selection without sharing patient-level information.

RESULT

The proposed model was evaluated on colorectal cancer datasets from the US and China. Two groups of datasets with different levels of heterogeneity within the collaborative research network were selected. First, we compare the performance of the distributed random forest model under different privacy parameters with different percentages of enhancement datasets and validate the effectiveness and plausibility of our approach. Then, we compare the discrimination and calibration ability of the proposed multicenter random forest with a centrally trained random forest model and other tree-based classifiers as well as some commonly used machine learning methods. The results show that the proposed model can provide better prediction performance in terms of discrimination and calibration ability than the centrally trained RF model or the other candidate models while following the privacy-preserving rules in both groups. Additionally, good discrimination and calibration ability are shown on the simplified model based on the feature importance ranking in the proposed approach.

CONCLUSION

The proposed random forest model exhibits ideal prediction capability using multicenter clinical data and overcomes the performance limitation arising from privacy guarantees. It can also provide feature importance ranking across institutions without pooling the data at a central site. This study offers a practical solution for building a prognosis prediction model in the collaborative clinical research network and solves practical issues in real-world applications of medical artificial intelligence.

Collapse

Hsiao MH, Lin WY, Hsu KY, Shen ZX. On Anonymizing Medical Microdata with Large-Scale Missing Values - A Case Study with the FAERS Dataset^.. ANNUAL INTERNATIONAL CONFERENCE OF THE IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. IEEE ENGINEERING IN MEDICINE AND BIOLOGY SOCIETY. ANNUAL INTERNATIONAL CONFERENCE 2020;2019:6505-6508. [PMID: 31947331 DOI: 10.1109/embc.2019.8857025] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/06/2022]

Chevrier R, Foufi V, Gaudet-Blavignac C, Robert A, Lovis C. Use and Understanding of Anonymization and De-Identification in the Biomedical Literature: Scoping Review. J Med Internet Res 2019;21:e13484. [PMID: 31152528 PMCID: PMC6658290 DOI: 10.2196/13484] [Citation(s) in RCA: 42] [Impact Index Per Article: 8.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2019] [Revised: 03/29/2019] [Accepted: 04/26/2019] [Indexed: 01/19/2023] Open

Abstract

Background

The secondary use of health data is central to biomedical research in the era of data science and precision medicine. National and international initiatives, such as the Global Open Findable, Accessible, Interoperable, and Reusable (GO FAIR) initiative, are supporting this approach in different ways (eg, making the sharing of research data mandatory or improving the legal and ethical frameworks). Preserving patients’ privacy is crucial in this context. De-identification and anonymization are the two most common terms used to refer to the technical approaches that protect privacy and facilitate the secondary use of health data. However, it is difficult to find a consensus on the definitions of the concepts or on the reliability of the techniques used to apply them. A comprehensive review is needed to better understand the domain, its capabilities, its challenges, and the ratio of risk between the data subjects’ privacy on one side, and the benefit of scientific advances on the other.

Objective

This work aims at better understanding how the research community comprehends and defines the concepts of de-identification and anonymization. A rich overview should also provide insights into the use and reliability of the methods. Six aspects will be studied: (1) terminology and definitions, (2) backgrounds and places of work of the researchers, (3) reasons for anonymizing or de-identifying health data, (4) limitations of the techniques, (5) legal and ethical aspects, and (6) recommendations of the researchers.

Methods

Based on a scoping review protocol designed a priori, MEDLINE was searched for publications discussing de-identification or anonymization and published between 2007 and 2017. The search was restricted to MEDLINE to focus on the life sciences community. The screening process was performed by two reviewers independently.

Results

After searching 7972 records that matched at least one search term, 135 publications were screened and 60 full-text articles were included. (1) Terminology: Definitions of the terms de-identification and anonymization were provided in less than half of the articles (29/60, 48%). When both terms were used (41/60, 68%), their meanings divided the authors into two equal groups (19/60, 32%, each) with opposed views. The remaining articles (3/60, 5%) were equivocal. (2) Backgrounds and locations: Research groups were based predominantly in North America (31/60, 52%) and in the European Union (22/60, 37%). The authors came from 19 different domains; computer science (91/248, 36.7%), biomedical informatics (47/248, 19.0%), and medicine (38/248, 15.3%) were the most prevalent ones. (3) Purpose: The main reason declared for applying these techniques is to facilitate biomedical research. (4) Limitations: Progress is made on specific techniques but, overall, limitations remain numerous. (5) Legal and ethical aspects: Differences exist between nations in the definitions, approaches, and legal practices. (6) Recommendations: The combination of organizational, legal, ethical, and technical approaches is necessary to protect health data.

Conclusions

Interest is growing for privacy-enhancing techniques in the life sciences community. This interest crosses scientific boundaries, involving primarily computer science, biomedical informatics, and medicine. The variability observed in the use of the terms de-identification and anonymization emphasizes the need for clearer definitions as well as for better education and dissemination of information on the subject. The same observation applies to the methods. Several legislations, such as the American Health Insurance Portability and Accountability Act (HIPAA) and the European General Data Protection Regulation (GDPR), regulate the domain. Using the definitions they provide could help address the variable use of these two concepts in the research community.

Collapse

Trippe ZA, Brendani B, Meier C, Lewis D. Identification of Substandard Medicines via Disproportionality Analysis of Individual Case Safety Reports. Drug Saf 2017;40:293-303. [PMID: 28130773 DOI: 10.1007/s40264-016-0499-5] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]

Abstract

INTRODUCTION

The distribution and use of substandard medicines (SSMs) is a public health concern worldwide. The detection of SSMs is currently limited to expensive large-scale assay techniques such as high-performance liquid chromatography (HPLC). Since 2013, the Pharmacovigilance Department at Novartis Pharma AG has been analyzing drug-associated adverse events related to 'product quality issues' with the aim of detecting defective medicines using spontaneous reporting. The method of identifying SSMs with spontaneous reporting was pioneered by the Monitoring Medicines project in 2011.

METHODS

This retrospective review was based on data from the World Health Organization (WHO) Global individual case safety report (ICSR) database VigiBase^® collected from January 2001 to December 2014. We conducted three different stratification analyses using the Multi-item Gamma Poisson Shrinker (MGPS) algorithm through the Oracle Empirica data-mining software. In total, 24 preferred terms (PTs) from the Medical Dictionary for Regulatory Activities (MedDRA^®) were used to identify poor-quality medicines. To identify potential SSMs for further evaluation, a cutoff of 2.0 for EB05, the lower 95% interval of the empirical Bayes geometric mean (EBGM) was applied. We carried out a literature search for advisory letters related to defective medicinal products to validate our findings. Furthermore, we aimed to assess whether we could confirm two SSMs first identified by the Uppsala Monitoring Centre (UMC) with our stratification method.

RESULTS

The analysis of ICSRs based on the specified selection criteria and threshold yielded 2506 hits including medicinal products with an excess of reports of product quality defects relative to other medicines in the database. Further investigations and a pilot study in five authorized medicinal products (proprietary and generic) licensed by a single marketing authorization holder, containing valsartan, methylphenidate, rivastigmine, clozapine, or carbamazepine, were performed. This resulted in an output of 23 potential SSMs. The literature search identified two communications issued to health professionals concerning a substandard rivastigmine patch, which validated our initial findings. Furthermore, we identified excess reporting of product quality issues with an ethinyl estradiol/norgestrel combination and with salbutamol. These were categorized as confirmed clusters of substandard/spurious/falsely labelled/falsified/counterfeit (SSFFC) medical products by the UMC in 2014.

CONCLUSION

This study illustrates the value of data mining of spontaneous adverse event reports and the applicability of disproportionality analysis to identify potential SSMs.

Collapse

Lee H, Kim S, Kim JW, Chung YD. Utility-preserving anonymization for health data publishing. BMC Med Inform Decis Mak 2017;17:104. [PMID: 28693480 PMCID: PMC5504813 DOI: 10.1186/s12911-017-0499-0] [Citation(s) in RCA: 15] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2017] [Accepted: 06/28/2017] [Indexed: 11/23/2022] Open

Abstract

BACKGROUND

Publishing raw electronic health records (EHRs) may be considered as a breach of the privacy of individuals because they usually contain sensitive information. A common practice for the privacy-preserving data publishing is to anonymize the data before publishing, and thus satisfy privacy models such as k-anonymity. Among various anonymization techniques, generalization is the most commonly used in medical/health data processing. Generalization inevitably causes information loss, and thus, various methods have been proposed to reduce information loss. However, existing generalization-based data anonymization methods cannot avoid excessive information loss and preserve data utility.

METHODS

We propose a utility-preserving anonymization for privacy preserving data publishing (PPDP). To preserve data utility, the proposed method comprises three parts: (1) utility-preserving model, (2) counterfeit record insertion, (3) catalog of the counterfeit records. We also propose an anonymization algorithm using the proposed method. Our anonymization algorithm applies full-domain generalization algorithm. We evaluate our method in comparison with existence method on two aspects, information loss measured through various quality metrics and error rate of analysis result.

RESULTS

With all different types of quality metrics, our proposed method show the lower information loss than the existing method. In the real-world EHRs analysis, analysis results show small portion of error between the anonymized data through the proposed method and original data.

CONCLUSIONS

We propose a new utility-preserving anonymization method and an anonymization algorithm using the proposed method. Through experiments on various datasets, we show that the utility of EHRs anonymized by the proposed method is significantly better than those anonymized by previous approaches.

Collapse