1
|
Vranopoulos G, Clarke N, Atkinson S. Big Data Confidentiality: An Approach Toward Corporate Compliance Using a Rule-Based System. Big Data 2023. [PMID: 37906117 DOI: 10.1089/big.2022.0201] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/02/2023]
Abstract
Organizations have been investing in analytics relying on internal and external data to gain a competitive advantage. However, the legal and regulatory acts imposed nationally and internationally have become a challenge, especially for highly regulated sectors such as health or finance/banking. Data handlers such as Facebook and Amazon have already sustained considerable fines or are under investigation due to violations of data governance. The era of big data has further intensified the challenges of minimizing the risk of data loss by introducing the dimensions of Volume, Velocity, and Variety into confidentiality. Although Volume and Velocity have been extensively researched, Variety, "the ugly duckling" of big data, is often neglected and difficult to solve, thus increasing the risk of data exposure and data loss. In mitigating the risk of data exposure and data loss in this article, a framework is proposed to utilize algorithmic classification and workflow capabilities to provide a consistent approach toward data evaluations across the organizations. A rule-based system, implementing the corporate data classification policy, will minimize the risk of exposure by facilitating users to identify the approved guidelines and enforce them quickly. The framework includes an exception handling process with appropriate approval for extenuating circumstances. The system was implemented in a proof of concept working prototype to showcase the capabilities and provide a hands-on experience. The information system was evaluated and accredited by a diverse audience of academics and senior business executives in the fields of security and data management. The audience had an average experience of ∼25 years and amasses a total experience of almost three centuries (294 years). The results confirmed that the 3Vs are of concern and that Variety, with a majority of 90% of the commentators, is the most troubling. In addition to that, with an approximate average of 60%, it was confirmed that appropriate policies, procedure, and prerequisites for classification are in place while implementation tools are lagging.
Collapse
Affiliation(s)
- Georgios Vranopoulos
- School of Engineering, Computing and Mathematics, University of Plymouth, Plymouth, United Kingdom
| | - Nathan Clarke
- School of Engineering, Computing and Mathematics, University of Plymouth, Plymouth, United Kingdom
| | - Shirley Atkinson
- School of Engineering, Computing and Mathematics, University of Plymouth, Plymouth, United Kingdom
| |
Collapse
|
2
|
Sepas A, Bangash AH, Alraoui O, El Emam K, El-Hussuna A. Algorithms to anonymize structured medical and healthcare data: A systematic review. Front Bioinform 2022; 2:984807. [PMID: 36619476 PMCID: PMC9815524 DOI: 10.3389/fbinf.2022.984807] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/02/2022] [Accepted: 11/28/2022] [Indexed: 12/24/2022] Open
Abstract
Introduction: With many anonymization algorithms developed for structured medical health data (SMHD) in the last decade, our systematic review provides a comprehensive bird's eye view of algorithms for SMHD anonymization. Methods: This systematic review was conducted according to the recommendations in the Cochrane Handbook for Reviews of Interventions and reported according to the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA). Eligible articles from the PubMed, ACM digital library, Medline, IEEE, Embase, Web of Science Collection, Scopus, ProQuest Dissertation, and Theses Global databases were identified through systematic searches. The following parameters were extracted from the eligible studies: author, year of publication, sample size, and relevant algorithms and/or software applied to anonymize SMHD, along with the summary of outcomes. Results: Among 1,804 initial hits, the present study considered 63 records including research articles, reviews, and books. Seventy five evaluated the anonymization of demographic data, 18 assessed diagnosis codes, and 3 assessed genomic data. One of the most common approaches was k-anonymity, which was utilized mainly for demographic data, often in combination with another algorithm; e.g., l-diversity. No approaches have yet been developed for protection against membership disclosure attacks on diagnosis codes. Conclusion: This study reviewed and categorized different anonymization approaches for MHD according to the anonymized data types (demographics, diagnosis codes, and genomic data). Further research is needed to develop more efficient algorithms for the anonymization of diagnosis codes and genomic data. The risk of reidentification can be minimized with adequate application of the addressed anonymization approaches. Systematic Review Registration: [http://www.crd.york.ac.uk/prospero], identifier [CRD42021228200].
Collapse
Affiliation(s)
- Ali Sepas
- Open Source Research Collaboration, Aalborg, Denmark
- Department of Materials and Production, Aalborg University, Aalborg, Denmark
| | - Ali Haider Bangash
- Open Source Research Collaboration, Aalborg, Denmark
- STMU Shifa College of Medicine, Islamabad, Pakistan
| | - Omar Alraoui
- Department of Health Science and Technology, Aalborg University, Aalborg, Denmark
| | - Khaled El Emam
- Canada Research Chair in Medical AI, University of Ottawa, Ottawa, ON, Canada
| | | |
Collapse
|
3
|
Alexander J, Beatty A. Nonspecific deidentification of date-like text in deidentified clinical notes enables reidentification of dates. J Am Med Inform Assoc 2022; 29:1967-1971. [PMID: 36217861 PMCID: PMC9552287 DOI: 10.1093/jamia/ocac147] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/13/2022] [Revised: 07/19/2022] [Accepted: 08/14/2022] [Indexed: 06/16/2023] Open
Abstract
To facilitate the secondary usage of electronic health record data for research, the University of California, San Francisco (UCSF) recently implemented a clinical data warehouse including, among other data, deidentified clinical notes and reports, which are available to UCSF researchers without Institutional Review Board approval. For deidentification of these notes, most of the Health Insurance Portability and Accountability Act identifiers are redacted, but dates are transformed by shifting all dates for a patient back by the same random number of days. We describe an issue in which nonspecific (ie, excess) transformation of nondate, date-like text by this deidentification process enables reidentification of all dates, including birthdates, for certain patients. This issue undercuts the common assumption that excess deidentification is a safe tradeoff to protect patient privacy. We present this issue as a caution to other institutions that may also be considering releasing deidentified notes for research.
Collapse
Affiliation(s)
- Jes Alexander
- Department of Radiation Oncology, University of California, San Francisco, San Francisco, California, USA
| | - Alexis Beatty
- Department of Epidemiology and Biostatistics and Department of Medicine, Division of Cardiology, University of California, San Francisco, San Francisco, California, USA
| |
Collapse
|
4
|
Sun S, Ma S, Song JH, Yue WH, Lin XL, Ma T. Experiments and Analyses of Anonymization Mechanisms for Trajectory Data Publishing. J Comput Sci Technol 2022; 37:1026-1048. [PMID: 36281257 PMCID: PMC9581755 DOI: 10.1007/s11390-022-2409-x] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 04/13/2022] [Accepted: 09/21/2022] [Indexed: 06/16/2023]
Abstract
UNLABELLED With the advancing of location-detection technologies and the increasing popularity of mobile phones and other location-aware devices, trajectory data is continuously growing. While large-scale trajectories provide opportunities for various applications, the locations in trajectories pose a threat to individual privacy. Recently, there has been an interesting debate on the reidentifiability of individuals in the Science magazine. The main finding of Sánchez et al. is exactly opposite to that of De Montjoye et al., which raises the first question: "what is the true situation of the privacy preservation for trajectories in terms of reidentification?" Furthermore, it is known that anonymization typically causes a decline of data utility, and anonymization mechanisms need to consider the trade-off between privacy and utility. This raises the second question: "what is the true situation of the utility of anonymized trajectories?" To answer these two questions, we conduct a systematic experimental study, using three real-life trajectory datasets, five existing anonymization mechanisms (i.e., identifier anonymization, grid-based anonymization, dummy trajectories, k-anonymity and ε-differential privacy), and two practical applications (i.e., travel time estimation and window range queries). Our findings reveal the true situation of the privacy preservation for trajectories in terms of reidentification and the true situation of the utility of anonymized trajectories, and essentially close the debate between De Montjoye et al. and Sánchez et al. To the best of our knowledge, this study is among the first systematic evaluation and analysis of anonymized trajectories on the individual privacy in terms of unicity and on the utility in terms of practical applications. SUPPLEMENTARY INFORMATION The online version contains supplementary material available at 10.1007/s11390-022-2409-x.
Collapse
Affiliation(s)
- She Sun
- State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing, 100191 China
| | - Shuai Ma
- State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing, 100191 China
| | - Jing-He Song
- State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing, 100191 China
| | - Wen-Hai Yue
- State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing, 100191 China
| | - Xue-Lian Lin
- State Key Laboratory of Software Development Environment, School of Computer Science and Engineering, Beihang University, Beijing, 100191 China
| | - Tiejun Ma
- Department of Decision Analytics and Risk, Southampton Business School, University of Southampton, Southampton, SO17 1BJ UK
| |
Collapse
|
5
|
Parobek CM, Thorsen MM, Has P, Lorenzi P, Clark MA, Russo ML, Lewkowitz AK. Video education about genetic privacy and patient perspectives about sharing prenatal genetic data: a randomized trial. Am J Obstet Gynecol 2022; 227:87.e1-87.e13. [PMID: 35351406 DOI: 10.1016/j.ajog.2022.03.047] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/10/2021] [Revised: 03/18/2022] [Accepted: 03/24/2022] [Indexed: 11/01/2022]
Abstract
BACKGROUND Laboratories offering cell-free DNA often reserve the right to share prenatal genetic data for research or even commercial purposes, and obtain this permission on the patient consent form. Although it is known that nonpregnant patients are often reluctant to share their genetic data for research, pregnant patients' knowledge of, and opinions about, genetic data privacy are unknown. OBJECTIVE We investigated whether pregnant patients who had already undergone cell-free DNA screening were aware that genetic data derived from cell-free DNA may be shared for research. Furthermore, we examined whether pregnant patients exposed to video education about the Genetic Information Nondiscrimination Act-a federal law that mandates workplace and health insurance protections against genetic discrimination-were more willing to share cell-free DNA-related genetic data for research than pregnant patients who were unexposed. STUDY DESIGN In this randomized controlled trial (ClinicalTrials.gov Identifier: NCT04420858), English-speaking patients with singleton pregnancies who underwent cell-free DNA and subsequently presented at 17 0/7 to 23 6/7 weeks of gestation for a detailed anatomy scan were randomized 1:1 to a control or intervention group. Both groups viewed an infographic about cell-free DNA. In addition, the intervention group viewed an educational video about the Genetic Information Nondiscrimination Act. The primary outcomes were knowledge about, and willingness to share, prenatal genetic data from cell-free DNA by commercial laboratories for nonclinical purposes, such as research. The secondary outcomes included knowledge about existing genetic privacy laws, knowledge about the potential for reidentification of anonymized genetic data, and acceptability of various use and sharing scenarios for prenatal genetic data. Eighty-one participants per group were required for 80% power to detect an increase in willingness to share data from 60% to 80% (α=0.05). RESULTS A total of 747 pregnant patients were screened, and 213 patients were deemed eligible and approached for potential study participation. Of these patients, 163 (76.5%) consented and were randomized; one participant discontinued the intervention, and two participants were excluded from analysis after the intervention when it was discovered that they did not fulfill all eligibility criteria. Overall, 160 (75.1%) of those approached were included in the final analysis. Most patients in the control group (72 [90.0%]) and intervention (76 [97.4%]) group were either unsure about or incorrectly thought that cell-free DNA companies could not share prenatal genetic data for research. Participants in the intervention group were more likely to incorrectly believe that their prenatal genetic data would not be shared for nonclinical purposes than participants in the control group (28.8% in the control group vs 46.2% in the intervention; P=.03). However, video education did not increase participant willingness to share genetic data in multiple scenarios. Non-White participants were less willing than White participants to allow sharing of genetic data specifically for academic research (P<.001). CONCLUSION Most participants were unaware that their prenatal genetic data may be used for nonclinical purposes. Pregnant patients who were educated about the Genetic Information Nondiscrimination Act were not more willing to share genetic data than those who did not receive this education. Surprisingly, video education about the Genetic Information Nondiscrimination Act led patients to falsely believe that their data would not be shared for research, and participants who identified as racial minorities were less willing to share genetic data. New strategies are needed to improve pregnant patients' understanding of genetic privacy.
Collapse
|
6
|
Lippert C, Sabatini R, Maher MC, Kang EY, Lee S, Arikan O, Harley A, Bernal A, Garst P, Lavrenko V, Yocum K, Wong T, Zhu M, Yang WY, Chang C, Lu T, Lee CWH, Hicks B, Ramakrishnan S, Tang H, Xie C, Piper J, Brewerton S, Turpaz Y, Telenti A, Roby RK, Och FJ, Venter JC. Identification of individuals by trait prediction using whole-genome sequencing data. Proc Natl Acad Sci U S A 2017; 114:10166-10171. [PMID: 28874526 PMCID: PMC5617305 DOI: 10.1073/pnas.1711125114] [Citation(s) in RCA: 96] [Impact Index Per Article: 13.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
Prediction of human physical traits and demographic information from genomic data challenges privacy and data deidentification in personalized medicine. To explore the current capabilities of phenotype-based genomic identification, we applied whole-genome sequencing, detailed phenotyping, and statistical modeling to predict biometric traits in a cohort of 1,061 participants of diverse ancestry. Individually, for a large fraction of the traits, their predictive accuracy beyond ancestry and demographic information is limited. However, we have developed a maximum entropy algorithm that integrates multiple predictions to determine which genomic samples and phenotype measurements originate from the same person. Using this algorithm, we have reidentified an average of >8 of 10 held-out individuals in an ethnically mixed cohort and an average of 5 of either 10 African Americans or 10 Europeans. This work challenges current conceptions of personal privacy and may have far-reaching ethical and legal implications.
Collapse
Affiliation(s)
| | | | | | | | | | - Okan Arikan
- Human Longevity, Inc., Mountain View, CA 94303
| | | | - Axel Bernal
- Human Longevity, Inc., Mountain View, CA 94303
| | - Peter Garst
- Human Longevity, Inc., Mountain View, CA 94303
| | | | - Ken Yocum
- Human Longevity, Inc., Mountain View, CA 94303
| | | | - Mingfu Zhu
- Human Longevity, Inc., Mountain View, CA 94303
| | | | - Chris Chang
- Human Longevity, Inc., Mountain View, CA 94303
| | - Tim Lu
- Human Longevity, Inc., San Diego, CA 92121
| | | | - Barry Hicks
- Human Longevity, Inc., Mountain View, CA 94303
| | | | - Haibao Tang
- Human Longevity, Inc., Mountain View, CA 94303
| | - Chao Xie
- Human Longevity Singapore, Pte. Ltd., Singapore 138542
| | - Jason Piper
- Human Longevity Singapore, Pte. Ltd., Singapore 138542
| | | | - Yaron Turpaz
- Human Longevity, Inc., San Diego, CA 92121
- Human Longevity Singapore, Pte. Ltd., Singapore 138542
| | | | - Rhonda K Roby
- Human Longevity, Inc., San Diego, CA 92121
- J. Craig Venter Institute, La Jolla, CA 92037
| | - Franz J Och
- Human Longevity, Inc., Mountain View, CA 94303
| | - J Craig Venter
- Human Longevity, Inc., San Diego, CA 92121;
- J. Craig Venter Institute, La Jolla, CA 92037
| |
Collapse
|
7
|
El Emam K, Hu J, Mercer J, Peyton L, Kantarcioglu M, Malin B, Buckeridge D, Samet S, Earle C. A secure protocol for protecting the identity of providers when disclosing data for disease surveillance. J Am Med Inform Assoc 2011; 18:212-7. [PMID: 21486880 PMCID: PMC3078664 DOI: 10.1136/amiajnl-2011-000100] [Citation(s) in RCA: 30] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/16/2011] [Accepted: 02/03/2011] [Indexed: 11/13/2022] Open
Abstract
BACKGROUND Providers have been reluctant to disclose patient data for public-health purposes. Even if patient privacy is ensured, the desire to protect provider confidentiality has been an important driver of this reluctance. METHODS Six requirements for a surveillance protocol were defined that satisfy the confidentiality needs of providers and ensure utility to public health. The authors developed a secure multi-party computation protocol using the Paillier cryptosystem to allow the disclosure of stratified case counts and denominators to meet these requirements. The authors evaluated the protocol in a simulated environment on its computation performance and ability to detect disease outbreak clusters. RESULTS Theoretical and empirical assessments demonstrate that all requirements are met by the protocol. A system implementing the protocol scales linearly in terms of computation time as the number of providers is increased. The absolute time to perform the computations was 12.5 s for data from 3000 practices. This is acceptable performance, given that the reporting would normally be done at 24 h intervals. The accuracy of detection disease outbreak cluster was unchanged compared with a non-secure distributed surveillance protocol, with an F-score higher than 0.92 for outbreaks involving 500 or more cases. CONCLUSION The protocol and associated software provide a practical method for providers to disclose patient data for sentinel, syndromic or other indicator-based surveillance while protecting patient privacy and the identity of individual providers.
Collapse
Affiliation(s)
- Khaled El Emam
- Children's Hospital of Eastern Ontario Research Institute, Ottawa, Ontario, Canada.
| | | | | | | | | | | | | | | | | |
Collapse
|