1
|
Koll CEM, Hopff SM, Meurers T, Lee CH, Kohls M, Stellbrink C, Thibeault C, Reinke L, Steinbrecher S, Schreiber S, Mitrov L, Frank S, Miljukov O, Erber J, Hellmuth JC, Reese JP, Steinbeis F, Bahmer T, Hagen M, Meybohm P, Hansch S, Vadász I, Krist L, Jiru-Hillmann S, Prasser F, Vehreschild JJ. Statistical biases due to anonymization evaluated in an open clinical dataset from COVID-19 patients. Sci Data 2022; 9:776. [PMID: 36543828 PMCID: PMC9769467 DOI: 10.1038/s41597-022-01669-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/27/2022] [Accepted: 08/30/2022] [Indexed: 12/24/2022] Open
Abstract
Anonymization has the potential to foster the sharing of medical data. State-of-the-art methods use mathematical models to modify data to reduce privacy risks. However, the degree of protection must be balanced against the impact on statistical properties. We studied an extreme case of this trade-off: the statistical validity of an open medical dataset based on the German National Pandemic Cohort Network (NAPKON), which was prepared for publication using a strong anonymization procedure. Descriptive statistics and results of regression analyses were compared before and after anonymization of multiple variants of the original dataset. Despite significant differences in value distributions, the statistical bias was found to be small in all cases. In the regression analyses, the median absolute deviations of the estimated adjusted odds ratios for different sample sizes ranged from 0.01 [minimum = 0, maximum = 0.58] to 0.52 [minimum = 0.25, maximum = 0.91]. Disproportionate impact on the statistical properties of data is a common argument against the use of anonymization. Our analysis demonstrates that anonymization can actually preserve validity of statistical results in relatively low-dimensional data.
Collapse
Affiliation(s)
- Carolin E M Koll
- University of Cologne, Faculty of Medicine and University Hospital Cologne, Department I of Internal Medicine, Center for Integrated Oncology Aachen Bonn Cologne Duesseldorf, Cologne, Germany.
| | - Sina M Hopff
- University of Cologne, Faculty of Medicine and University Hospital Cologne, Department I of Internal Medicine, Center for Integrated Oncology Aachen Bonn Cologne Duesseldorf, Cologne, Germany
| | - Thierry Meurers
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| | - Chin Huang Lee
- University of Cologne, Faculty of Medicine and University Hospital Cologne, Department I of Internal Medicine, Center for Integrated Oncology Aachen Bonn Cologne Duesseldorf, Cologne, Germany
| | - Mirjam Kohls
- University of Wuerzburg, Faculty of Medicine, Institute for Clinical Epidemiology and Biometry, Wuerzburg, Germany
| | - Christoph Stellbrink
- Department of Cardiology and Intensive Care Medicine, Bielefeld Medical Centre, Medical Faculty OWL, University of Bielefeld, Bielefeld, Germany
| | - Charlotte Thibeault
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Lennart Reinke
- Internal Medicine Department I, University Medical Center Schleswig-Holstein Campus Kiel, Kiel, Germany
| | - Sarah Steinbrecher
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Stefan Schreiber
- Internal Medicine Department I, University Medical Center Schleswig-Holstein Campus Kiel, Kiel, Germany
| | - Lazar Mitrov
- University of Cologne, Faculty of Medicine and University Hospital Cologne, Department I of Internal Medicine, Center for Integrated Oncology Aachen Bonn Cologne Duesseldorf, Cologne, Germany
| | - Sandra Frank
- Department of Anesthesiology, University Hospital of Ludwig-Maximilians-University (LMU), Munich, Germany
- Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
| | - Olga Miljukov
- University of Wuerzburg, Faculty of Medicine, Institute for Clinical Epidemiology and Biometry, Wuerzburg, Germany
| | - Johanna Erber
- Technical University of Munich, School of Medicine, University Hospital rechts der Isar, Department of Internal Medicine II, Munich, Germany
| | - Johannes C Hellmuth
- Department of Medicine III, University Hospital, LMU Munich, Munich, Germany
- COVID-19 Registry of the LMU Munich (CORKUM), University Hospital, LMU Munich, Munich, Germany
| | - Jens-Peter Reese
- University of Wuerzburg, Faculty of Medicine, Institute for Clinical Epidemiology and Biometry, Wuerzburg, Germany
| | - Fridolin Steinbeis
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin and Humboldt Universität zu Berlin, Berlin, Germany
| | - Thomas Bahmer
- Internal Medicine Department I, University Medical Center Schleswig-Holstein Campus Kiel, Kiel, Germany
- Airway Research Center North (ARCN), German Center for Lung Research (DZL), Großhansdorf, Germany
| | - Marina Hagen
- Department II for Internal Medicine, Hematology/Oncology, University Hospital Frankfurt, Frankfurt am Main, Germany
| | - Patrick Meybohm
- Department of Anaesthesiology, Intensive Care, Emergency and Pain Medicine, University Hospital Wuerzburg, Wuerzburg, Germany
| | - Stefan Hansch
- Department of Infection Prevention and Infectious Diseases, University Hospital Regensburg, Regensburg, Germany
| | - István Vadász
- Department of Internal Medicine, Justus Liebig University, Universities of Giessen and Marburg Lung Center (UGMLC), Member of the German Center for Lung Research (DZL), Giessen, Germany
- The Cardio-Pulmonary Institute (CPI), Giessen, Germany
| | - Lilian Krist
- Institute of Social Medicine, Epidemiology and Health Economics, Charité-Universitätsmedizin Berlin, Berlin, Germany
| | - Steffi Jiru-Hillmann
- University of Wuerzburg, Faculty of Medicine, Institute for Clinical Epidemiology and Biometry, Wuerzburg, Germany
| | - Fabian Prasser
- Berlin Institute of Health at Charité - Universitätsmedizin Berlin, Charitéplatz 1, 10117, Berlin, Germany
| | - Jörg Janne Vehreschild
- University of Cologne, Faculty of Medicine and University Hospital Cologne, Department I of Internal Medicine, Center for Integrated Oncology Aachen Bonn Cologne Duesseldorf, Cologne, Germany
- Department II for Internal Medicine, Hematology/Oncology, University Hospital Frankfurt, Frankfurt am Main, Germany
- German Centre for Infection Research (DZIF), partner site Bonn-Cologne, Cologne, Germany
| |
Collapse
|
2
|
Jakob CEM, Kohlmayer F, Meurers T, Vehreschild JJ, Prasser F. Design and evaluation of a data anonymization pipeline to promote Open Science on COVID-19. Sci Data 2020; 7:435. [PMID: 33303746 PMCID: PMC7729909 DOI: 10.1038/s41597-020-00773-y] [Citation(s) in RCA: 31] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2020] [Accepted: 11/13/2020] [Indexed: 11/24/2022] Open
Abstract
The Lean European Open Survey on SARS-CoV-2 Infected Patients (LEOSS) is a European registry for studying the epidemiology and clinical course of COVID-19. To support evidence-generation at the rapid pace required in a pandemic, LEOSS follows an Open Science approach, making data available to the public in real-time. To protect patient privacy, quantitative anonymization procedures are used to protect the continuously published data stream consisting of 16 variables on the course and therapy of COVID-19 from singling out, inference and linkage attacks. We investigated the bias introduced by this process and found that it has very little impact on the quality of output data. Current laws do not specify requirements for the application of formal anonymization methods, there is a lack of guidelines with clear recommendations and few real-world applications of quantitative anonymization procedures have been described in the literature. We therefore believe that our work can help others with developing urgently needed anonymization pipelines for their projects.
Collapse
Affiliation(s)
| | | | - Thierry Meurers
- Berlin Institute of Health (BIH), Berlin, Germany
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Berlin, Germany
| | - Jörg Janne Vehreschild
- University Hospital Cologne, Cologne, Germany
- German Center for Infection Research (DZIF), partner site Bonn-Cologne, Cologne, Germany
- Department of Internal Medicine, Hematology and Oncology, Goethe University Frankfurt, Frankfurt am Main, Germany
| | - Fabian Prasser
- Berlin Institute of Health (BIH), Berlin, Germany.
- Charité - Universitätsmedizin Berlin, corporate member of Freie Universität Berlin, Humboldt-Universität zu Berlin, and Berlin Institute of Health, Berlin, Germany.
| |
Collapse
|
3
|
Liu Y, Wan Z, Xia W, Kantarcioglu M, Vorobeychik Y, Clayton EW, Kho A, Carrell D, Malin BA. Detecting the Presence of an Individual in Phenotypic Summary Data. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2018; 2018:760-769. [PMID: 30815118 PMCID: PMC6371366] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/09/2023]
Abstract
As the quantity and detail of association studies between clinical phenotypes and genotypes grows, there is a push to make summary statistics widely available. Genome wide summary statistics have been shown to be vulnerable to the inference of a targeted individual's presence. In this paper, we show that presence attacks are feasible with phenome wide summary statistics as well. We use data from three healthcare organizations and an online resource that publishes summary statistics. We introduce a novel attack that achieves over 80% recall and precision within a population of 16,346, where 8,173 individuals are targets. However, the feasibility of the attack is dependent on the attacker's knowledge about 1) the targeted individual and 2) the reference dataset. Within a population of over 2 million, where 8,173 individuals are targets, our attack achieves 31% recall and 17% precision. As a result, it is plausible that sharing of phenomic summary statistics may be accomplished with an acceptable level of privacy risk.
Collapse
Affiliation(s)
- Yongtai Liu
- Vanderbilt University, Nashville, Tennessee, USA
| | - Zhiyu Wan
- Vanderbilt University, Nashville, Tennessee, USA
| | - Weiyi Xia
- Vanderbilt University, Nashville, Tennessee, USA
| | | | | | | | - Abel Kho
- Northwestern University, Chicago, Illinois, USA
| | - David Carrell
- Kaiser Permanente Washington Health Research Institute, Seattle, Washington, USA
| | | |
Collapse
|
4
|
Holub P, Kohlmayer F, Prasser F, Mayrhofer MT, Schlünder I, Martin GM, Casati S, Koumakis L, Wutte A, Kozera Ł, Strapagiel D, Anton G, Zanetti G, Sezerman OU, Mendy M, Valík D, Lavitrano M, Dagher G, Zatloukal K, van Ommen GB, Litton JE. Enhancing Reuse of Data and Biological Material in Medical Research: From FAIR to FAIR-Health. Biopreserv Biobank 2018; 16:97-105. [PMID: 29359962 PMCID: PMC5906729 DOI: 10.1089/bio.2017.0110] [Citation(s) in RCA: 50] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/16/2023] Open
Abstract
The known challenge of underutilization of data and biological material from biorepositories as potential resources for medical research has been the focus of discussion for over a decade. Recently developed guidelines for improved data availability and reusability-entitled FAIR Principles (Findability, Accessibility, Interoperability, and Reusability)-are likely to address only parts of the problem. In this article, we argue that biological material and data should be viewed as a unified resource. This approach would facilitate access to complete provenance information, which is a prerequisite for reproducibility and meaningful integration of the data. A unified view also allows for optimization of long-term storage strategies, as demonstrated in the case of biobanks. We propose an extension of the FAIR Principles to include the following additional components: (1) quality aspects related to research reproducibility and meaningful reuse of the data, (2) incentives to stimulate effective enrichment of data sets and biological material collections and its reuse on all levels, and (3) privacy-respecting approaches for working with the human material and data. These FAIR-Health principles should then be applied to both the biological material and data. We also propose the development of common guidelines for cloud architectures, due to the unprecedented growth of volume and breadth of medical data generation, as well as the associated need to process the data efficiently.
Collapse
Affiliation(s)
| | | | | | | | | | | | - Sara Casati
- BBMRI.it and Universita degli Studi di Milano-Bicocca, Milano, Italy
| | - Lefteris Koumakis
- BBMRI.gr and Foundation for Research and Technology-Hellas, Heraklion, Greece
| | | | - Łukasz Kozera
- BBMRI.pl and Wroclaw Research Centre EIT+, Wroclaw, Poland
| | | | | | | | | | - Maimuna Mendy
- BBMRI.IARC and International Agency for Research on Cancer, Lyon, France
| | - Dalibor Valík
- BBMRI.cz and Masaryk Memorial Cancer Institute, Brno, Czech Republic
| | | | | | | | | | | |
Collapse
|
5
|
Heatherly R, Rasmussen LV, Peissig PL, Pacheco JA, Harris P, Denny JC, Malin BA. A multi-institution evaluation of clinical profile anonymization. J Am Med Inform Assoc 2016; 23:e131-7. [PMID: 26567325 PMCID: PMC4954623 DOI: 10.1093/jamia/ocv154] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/29/2015] [Revised: 08/17/2015] [Accepted: 09/09/2015] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND AND OBJECTIVE There is an increasing desire to share de-identified electronic health records (EHRs) for secondary uses, but there are concerns that clinical terms can be exploited to compromise patient identities. Anonymization algorithms mitigate such threats while enabling novel discoveries, but their evaluation has been limited to single institutions. Here, we study how an existing clinical profile anonymization fares at multiple medical centers. METHODS We apply a state-of-the-artk-anonymization algorithm, withkset to the standard value 5, to the International Classification of Disease, ninth edition codes for patients in a hypothyroidism association study at three medical centers: Marshfield Clinic, Northwestern University, and Vanderbilt University. We assess utility when anonymizing at three population levels: all patients in 1) the EHR system; 2) the biorepository; and 3) a hypothyroidism study. We evaluate utility using 1) changes to the number included in the dataset, 2) number of codes included, and 3) regions generalization and suppression were required. RESULTS Our findings yield several notable results. First, we show that anonymizing in the context of the entire EHR yields a significantly greater quantity of data by reducing the amount of generalized regions from ∼15% to ∼0.5%. Second, ∼70% of codes that needed generalization only generalized two or three codes in the largest anonymization. CONCLUSIONS Sharing large volumes of clinical data in support of phenome-wide association studies is possible while safeguarding privacy to the underlying individuals.
Collapse
Affiliation(s)
- Raymond Heatherly
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA
| | - Luke V Rasmussen
- Feinberg School of Medicine, Northwestern University, Chicago, IL, USA
| | - Peggy L Peissig
- Biomedical Informatics Research Center, Marshfield Clinic Research Foundation, Marshfield, WI, USA
| | | | - Paul Harris
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA Department of Biomedical Engineering, Vanderbilt University, Nashville, TN, USA
| | - Joshua C Denny
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA Department of Medicine, Vanderbilt University, Nashville, TN, USA
| | - Bradley A Malin
- Department of Biomedical Informatics, Vanderbilt University, Nashville, TN, USA Department of Electrical Engineering & Computer Science, Vanderbilt University, Nashville, TN, USA
| |
Collapse
|