1
|
Oliva A, Kaphle A, Reguant R, Sng LMF, Twine NA, Malakar Y, Wickramarachchi A, Keller M, Ranbaduge T, Chan EKF, Breen J, Buckberry S, Guennewig B, Haas M, Brown A, Cowley MJ, Thorne N, Jain Y, Bauer DC. Future-proofing genomic data and consent management: a comprehensive review of technology innovations. Gigascience 2024; 13:giae021. [PMID: 38837943 DOI: 10.1093/gigascience/giae021] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/14/2023] [Revised: 01/15/2024] [Accepted: 04/09/2024] [Indexed: 06/07/2024] Open
Abstract
Genomic information is increasingly used to inform medical treatments and manage future disease risks. However, any personal and societal gains must be carefully balanced against the risk to individuals contributing their genomic data. Expanding our understanding of actionable genomic insights requires researchers to access large global datasets to capture the complexity of genomic contribution to diseases. Similarly, clinicians need efficient access to a patient's genome as well as population-representative historical records for evidence-based decisions. Both researchers and clinicians hence rely on participants to consent to the use of their genomic data, which in turn requires trust in the professional and ethical handling of this information. Here, we review existing and emerging solutions for secure and effective genomic information management, including storage, encryption, consent, and authorization that are needed to build participant trust. We discuss recent innovations in cloud computing, quantum-computing-proof encryption, and self-sovereign identity. These innovations can augment key developments from within the genomics community, notably GA4GH Passports and the Crypt4GH file container standard. We also explore how decentralized storage as well as the digital consenting process can offer culturally acceptable processes to encourage data contributions from ethnic minorities. We conclude that the individual and their right for self-determination needs to be put at the center of any genomics framework, because only on an individual level can the received benefits be accurately balanced against the risk of exposing private information.
Collapse
Affiliation(s)
- Adrien Oliva
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Anubhav Kaphle
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Roc Reguant
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Letitia M F Sng
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Natalie A Twine
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Yuwan Malakar
- Responsible Innovation Future Science Platform, Commonwealth Scientific and Industrial Research Organisation, Brisbane, 41 Boggo Rd, Dutton Park QLD 4102, Australia
| | - Anuradha Wickramarachchi
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
| | - Marcel Keller
- Data61, Commonwealth Scientific and Industrial Research Organisation, Level 5/13 Garden St, Eveleigh NSW 2015, Australia
| | - Thilina Ranbaduge
- Data61, Commonwealth Scientific and Industrial Research Organisation, Building 101, Clunies Ross St, Black Mountain, Canberra, ACT 2601, Australia
| | - Eva K F Chan
- NSW Health Pathology, Sydney, 1 Reserve Road, St Leonards NSW 2065, Australia
| | - James Breen
- Telethon Kids Institute, Perth, WA 6009, Australia
- National Centre for Indigenous Genomics, The John Curtin School of Medical Research, Australian National University, Canberra, ACT 2601, Australia
| | - Sam Buckberry
- Telethon Kids Institute, Perth, WA 6009, Australia
- National Centre for Indigenous Genomics, The John Curtin School of Medical Research, Australian National University, Canberra, ACT 2601, Australia
| | - Boris Guennewig
- Sydney Medical School, Brain and Mind Centre, The University of Sydney, Sydney, 94 Mallett St, Camperdown NSW 2050, Australia
| | - Matilda Haas
- Australian Genomics, Parkville, VIC 3052, Australia
- Murdoch Children's Research Institute, Parkville, Victoria 3052, Australia
| | - Alex Brown
- Telethon Kids Institute, Perth, WA 6009, Australia
- National Centre for Indigenous Genomics, The John Curtin School of Medical Research, Australian National University, Canberra, ACT 2601, Australia
| | - Mark J Cowley
- Children's Cancer Institute, Lowy Cancer Research Centre, Level 4, Lowy Cancer Research Centre Corner Botany & High Streets UNSW Kensington Campus UNSW Sydney, Kensington NSW 2052, Australia
- School of Clinical Medicine, UNSW Medicine & Health, Wallace Wurth Building (C27), Cnr High St & Botany St, UNSW Sydney, Kensington NSW 2052, Australia
| | - Natalie Thorne
- University of Melbourne, Melbourne, Parkville VIC 3052, Australia
- Melbourne Genomics Health Alliance, Melbourne 1G, Walter and Eliza Hall Institute/1G Royal Parade, Parkville VIC 3052, Australia
- Walter and Eliza Hall Institute, Melbourne, 1G, Walter and Eliza Hall Institute/1G Royal Parade, Parkville VIC 3052, Australia
| | - Yatish Jain
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Level 3/160 Hawkesbury Rd, Westmead NSW 2145, Australia
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, Applied BioSciences 205B Culloden Rd Macquarie University, NSW 2109, Australia
| | - Denis C Bauer
- Applied BioSciences, Faculty of Science and Engineering, Macquarie University, Applied BioSciences 205B Culloden Rd Macquarie University, NSW 2109, Australia
- Department of Biomedical Sciences, MQ Health General Practice - Macquarie University, Suite 305, Level 3/2 Technology Pl, Macquarie Park NSW 2109, Australia
- Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation, Gate 13, Kintore Avenue University of Adelaide, Adelaide SA 5000, Australia
| |
Collapse
|
2
|
Emani PS, Geradi MN, Gürsoy G, Grasty MR, Miranker A, Gerstein MB. Assessing and mitigating privacy risks of sparse, noisy genotypes by local alignment to haplotype databases. Genome Res 2023; 33:gr.278322.123. [PMID: 38097386 PMCID: PMC10760520 DOI: 10.1101/gr.278322.123] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2023] [Accepted: 11/18/2023] [Indexed: 01/04/2024]
Abstract
Single nucleotide polymorphisms (SNPs) from omics data create a reidentification risk for individuals and their relatives. Although the ability of thousands of SNPs (especially rare ones) to identify individuals has been repeatedly shown, the availability of small sets of noisy genotypes, from environmental DNA samples or functional genomics data, motivated us to quantify their informativeness. We present a computational tool suite, termed Privacy Leakage by Inference across Genotypic HMM Trajectories (PLIGHT), using population-genetics-based hidden Markov models (HMMs) of recombination and mutation to find piecewise alignment of small, noisy SNP sets to reference haplotype databases. We explore cases in which query individuals are either known to be in the database, or not, and consider several genotype queries, including those from environmental sample swabs from known individuals and from simulated "mosaics" (two-individual composites). Using PLIGHT on a database with ∼5000 haplotypes, we find for common, noise-free SNPs that only ten are sufficient to identify individuals, ∼20 can identify both components in two-individual mosaics, and 20-30 can identify first-order relatives. Using noisy environmental-sample-derived SNPs, PLIGHT identifies individuals in a database using ∼30 SNPs. Even when the individuals are not in the database, local genotype matches allow for some phenotypic information leakage based on coarse-grained SNP imputation. Finally, by quantifying privacy leakage from sparse SNP sets, PLIGHT helps determine the value of selectively sanitizing released SNPs without explicit assumptions about population membership or allele frequency. To make this practical, we provide a sanitization tool to remove the most identifying SNPs from genomic data.
Collapse
Affiliation(s)
- Prashant S Emani
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Maya N Geradi
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Gamze Gürsoy
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Monica R Grasty
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Andrew Miranker
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
| | - Mark B Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, Connecticut 06520, USA;
- Department of Molecular Biophysics and Biochemistry, Yale University, New Haven, Connecticut 06520, USA
- Department of Computer Science, Yale University, New Haven, Connecticut 06520, USA
- Department of Statistics and Data Science, Yale University, New Haven, Connecticut 06520, USA
| |
Collapse
|
3
|
Ayday E, Vaidya J, Jiang X, Telenti A. Ensuring Trust in Genomics Research. ... IEEE INTERNATIONAL CONFERENCE ON TRUST, PRIVACY AND SECURITY IN INTELLIGENT SYSTEMS AND APPLICATIONS : (TPS-ISA ...). IEEE INTERNATIONAL CONFERENCE ON TRUST, PRIVACY AND SECURITY IN INTELLIGENT SYSTEMS AND APPLICATIONS 2023; 2023:1-12. [PMID: 38562180 PMCID: PMC10981793 DOI: 10.1109/tps-isa58951.2023.00011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Reproducibility, transparency, representation, and privacy underpin the trust on genomics research in general and genome-wide association studies (GWAS) in particular. Concerns about these issues can be mitigated by technologies that address privacy protection, quality control, and verifiability of GWAS. However, many of the existing technological solutions have been developed in isolation and may address one aspect of reproducibility, transparency, representation, and privacy of GWAS while unknowingly impacting other aspects. As a consequence, the current patchwork of technological tools only partially and in an overlapping manner address issues with GWAS, sometimes even creating more problems. This paper addresses the progress in a field that creates technological solutions that augment the acceptance and security of population genetic analyses. The text identifies areas that are falling behind in technical implementation or where there is insufficient research. We make the case that a full understanding of the different GWAS settings, technological tools and new research directions can holistically address the requirements for the acceptance of GWAS.
Collapse
Affiliation(s)
- Erman Ayday
- Department of Computer and Data Sciences Case Western Reserve University Cleveland, OH
| | - Jaideep Vaidya
- Management Science and Information Systems Department Rutgers University Newark, NJ
| | - Xiaoqian Jiang
- Department of Data Science and Artificial Intelligence University of Texas - Health Houston, TX
| | - Amalio Telenti
- Dept. of Integrative Structural and Computational Biology Scripps Institute La Jolla, CA
| |
Collapse
|
4
|
Li W, Kim M, Zhang K, Chen H, Jiang X, Harmanci A. COLLAGENE enables privacy-aware federated and collaborative genomic data analysis. Genome Biol 2023; 24:204. [PMID: 37697426 PMCID: PMC10496350 DOI: 10.1186/s13059-023-03039-z] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/26/2022] [Accepted: 08/16/2023] [Indexed: 09/13/2023] Open
Abstract
Growing regulatory requirements set barriers around genetic data sharing and collaborations. Moreover, existing privacy-aware paradigms are challenging to deploy in collaborative settings. We present COLLAGENE, a tool base for building secure collaborative genomic data analysis methods. COLLAGENE protects data using shared-key homomorphic encryption and combines encryption with multiparty strategies for efficient privacy-aware collaborative method development. COLLAGENE provides ready-to-run tools for encryption/decryption, matrix processing, and network transfers, which can be immediately integrated into existing pipelines. We demonstrate the usage of COLLAGENE by building a practical federated GWAS protocol for binary phenotypes and a secure meta-analysis protocol. COLLAGENE is available at https://zenodo.org/record/8125935 .
Collapse
Affiliation(s)
- Wentao Li
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Miran Kim
- Department of Mathematics, Department of Computer Science, Hanyang University, Seoul, 04763, Republic of Korea
- Research Institute for Convergence of Basic Science, Hanyang University, Seoul, 04763, Republic of Korea
- Bio-BigData Center, Hanyang Institute of Bioscience and Biotechnology, Hanyang University, Seoul, 04763, Republic of Korea
| | - Kai Zhang
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Han Chen
- Human Genetics Center, Department of Epidemiology, Human Genetics and Environmental Sciences, School of Public Health, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
- Center for Precision Health, D. Bradley McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA
| | - Xiaoqian Jiang
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Arif Harmanci
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA.
- Center for Precision Health, D. Bradley McWilliams School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, TX, 77030, USA.
| |
Collapse
|
5
|
Remes Lenicov F, Fink NE. Ethical issues in the use of leftover samples and associated personal data obtained from diagnostic laboratories. Clin Chim Acta 2023; 548:117442. [PMID: 37308048 PMCID: PMC10257511 DOI: 10.1016/j.cca.2023.117442] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/24/2023] [Revised: 05/30/2023] [Accepted: 06/09/2023] [Indexed: 06/14/2023]
Abstract
Diagnostic laboratories are an integral part of the research ecosystem in biomedical sciences. Among other roles, laboratories are a source of clinically-characterized samples for research or diagnostic validation studies. Particularly during the COVID-19 pandemic, this process was entered by laboratories with different experience in the ethical management of human samples. The objective of this document is to present the current ethical framework regarding the use of leftover samples in clinical laboratories. Leftover samples are defined as the residue of a sample that has been obtained and used for clinical purposes, and would otherwise be discarded. Secondary use of samples typically demands institutional ethical oversight and informed consent by the participants, although the latter requirement could be exempted when the harm risks are sufficiently small. However, ongoing discussions have proposed that minimal risk is an insufficient argument to allow the use of samples without consent. In this article, we discuss both positions, to finally suggest that laboratories anticipating the secondary use of samples should consider the adoption of broad informed consent, or even the implementation of organized biobanking, in order to achieve higher standards of ethical compliance which would enhance their capacity to fulfill their role in the production of knowledge.
Collapse
Affiliation(s)
- Federico Remes Lenicov
- Instituto de Investigaciones Biomédicas en Retrovirus y SIDA (INBIRS), Universidad de Buenos Aires / CONICET, Buenos Aires, Argentina.
| | - Nilda E Fink
- Fundación Bioquímica Argentina, La Plata, Argentina.
| |
Collapse
|
6
|
Abstract
Biobanks and health data repositories provide rich reservoirs of information for use in biomedical research. These repositories depend on participants donating identifiable health data and biospecimens that may be used in perpetuity by unlimited numbers of researchers for unnamed research topics. Since 1991, U.S. federal regulatory provisions, collectively known as the Common Rule, have required informed consent of participants in federally funded human subjects research, but recent changes to the Common Rule now sanction "broad consent" in the repository research context. Broad consent is not defined in the revised Common Rule; thus, researchers and their institutions are left to determine ad hoc what broad consent means and requires. Without leadership and guidance from the U.S. Department of Health and Human Services, stakeholders with potential conflicts of interest will reach their own conclusions and craft new and varied standards for consent. The result will be uneven protections for participants.
Collapse
Affiliation(s)
- Lisa E Smilan
- Visiting scholar at the Institute of Law, Psychiatry, and Public Policy at the University of Virginia and a member of the National Institutes of Health Intramural Institutional Review Board
| |
Collapse
|
7
|
Ye F, Cho H, Rouayheb SE. Mechanisms for Hiding Sensitive Genotypes with Information-Theoretic Privacy. IEEE TRANSACTIONS ON INFORMATION THEORY 2022; 68:4090-4105. [PMID: 37283781 PMCID: PMC10243750 DOI: 10.1109/tit.2022.3156276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
Motivated by the growing availability of personal genomics services, we study an information-theoretic privacy problem that arises when sharing genomic data: a user wants to share his or her genome sequence while keeping the genotypes at certain positions hidden, which could otherwise reveal critical health-related information. A straightforward solution of erasing (masking) the chosen genotypes does not ensure privacy, because the correlation between nearby positions can leak the masked genotypes. We introduce an erasure-based privacy mechanism with perfect information-theoretic privacy, whereby the released sequence is statistically independent of the sensitive genotypes. Our mechanism can be interpreted as a locally-optimal greedy algorithm for a given processing order of sequence positions, where utility is measured by the number of positions released without erasure. We show that finding an optimal order is NP-hard in general and provide an upper bound on the optimal utility. For sequences from hidden Markov models, a standard modeling approach in genetics, we propose an efficient algorithmic implementation of our mechanism with complexity polynomial in sequence length. Moreover, we illustrate the robustness of the mechanism by bounding the privacy leakage from erroneous prior distributions. Our work is a step towards more rigorous control of privacy in genomic data sharing.
Collapse
Affiliation(s)
- Fangwei Ye
- Department of Electrical and Computer Engineering, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Salim El Rouayheb
- Department of Electrical and Computer Engineering, Rutgers, The State University of New Jersey, Piscataway, NJ 08854, USA
| |
Collapse
|
8
|
Wan Z, Hazel JW, Clayton EW, Vorobeychik Y, Kantarcioglu M, Malin BA. Sociotechnical safeguards for genomic data privacy. Nat Rev Genet 2022; 23:429-445. [PMID: 35246669 PMCID: PMC8896074 DOI: 10.1038/s41576-022-00455-y] [Citation(s) in RCA: 20] [Impact Index Per Article: 10.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/24/2022] [Indexed: 12/21/2022]
Abstract
Recent developments in a variety of sectors, including health care, research and the direct-to-consumer industry, have led to a dramatic increase in the amount of genomic data that are collected, used and shared. This state of affairs raises new and challenging concerns for personal privacy, both legally and technically. This Review appraises existing and emerging threats to genomic data privacy and discusses how well current legal frameworks and technical safeguards mitigate these concerns. It concludes with a discussion of remaining and emerging challenges and illustrates possible solutions that can balance protecting privacy and realizing the benefits that result from the sharing of genetic information. In this Review, the authors describe technical and legal protection mechanisms for mitigating vulnerabilities in genomic data privacy. They also discuss how these protections are dependent on the context of data use such as in research, health care, direct-to-consumer testing or forensic investigations.
Collapse
Affiliation(s)
- Zhiyu Wan
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Department of Computer Science, Vanderbilt University, Nashville, TN, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - James W Hazel
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA
| | - Ellen Wright Clayton
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA.,Center for Biomedical Ethics and Society, Vanderbilt University, Nashville, TN, USA.,Vanderbilt University Law School, Nashville, TN, USA
| | - Yevgeniy Vorobeychik
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO, USA
| | - Murat Kantarcioglu
- Department of Computer Science, University of Texas at Dallas, Richardson, TX, USA
| | - Bradley A Malin
- Center for Genetic Privacy and Identity in Community Settings, Vanderbilt University Medical Center, Nashville, TN, USA. .,Department of Computer Science, Vanderbilt University, Nashville, TN, USA. .,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA. .,Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN, USA.
| |
Collapse
|
9
|
Akgün M, Pfeifer N, Kohlbacher O. Efficient privacy-preserving whole-genome variant queries. Bioinformatics 2022; 38:2202-2210. [PMID: 35150254 PMCID: PMC9004657 DOI: 10.1093/bioinformatics/btac070] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2021] [Revised: 01/13/2022] [Accepted: 02/03/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Diagnosis and treatment decisions on genomic data have become widespread as the cost of genome sequencing decreases gradually. In this context, disease-gene association studies are of great importance. However, genomic data are very sensitive when compared to other data types and contains information about individuals and their relatives. Many studies have shown that this information can be obtained from the query-response pairs on genomic databases. In this work, we propose a method that uses secure multi-party computation to query genomic databases in a privacy-protected manner. The proposed solution privately outsources genomic data from arbitrarily many sources to the two non-colluding proxies and allows genomic databases to be safely stored in semi-honest cloud environments. It provides data privacy, query privacy and output privacy by using XOR-based sharing and unlike previous solutions, it allows queries to run efficiently on hundreds of thousands of genomic data. RESULTS We measure the performance of our solution with parameters similar to real-world applications. It is possible to query a genomic database with 3 000 000 variants with five genomic query predicates under 400 ms. Querying 1 048 576 genomes, each containing 1 000 000 variants, for the presence of five different query variants can be achieved approximately in 6 min with a small amount of dedicated hardware and connectivity. These execution times are in the right range to enable real-world applications in medical research and healthcare. Unlike previous studies, it is possible to query multiple databases with response times fast enough for practical application. To the best of our knowledge, this is the first solution that provides this performance for querying large-scale genomic data. AVAILABILITY AND IMPLEMENTATION https://gitlab.com/DIFUTURE/privacy-preserving-variant-queries. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Mete Akgün
- To whom correspondence should be addressed.
| | - Nico Pfeifer
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany,Methods in Medical Informatics, Department of Computer Science, University of Tübingen, Tübingen, Germany,Statistical Learning in Computational Biology, Max Planck Institute for Informatics, Saarbrücken, Germany
| | - Oliver Kohlbacher
- Institute for Bioinformatics and Medical Informatics, University of Tübingen, Tübingen, Germany,Translational Bioinformatics, University Hospital Tübingen, Tübingen, Germany,Applied Bioinformatics, Department of Computer Science, University of Tübingen, Tübingen, Germany
| |
Collapse
|
10
|
Alsaffar MM, Hasan M, McStay GP, Sedky M. Digital DNA lifecycle security and privacy: an overview. Brief Bioinform 2022; 23:6518049. [PMID: 35106557 DOI: 10.1093/bib/bbab607] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/13/2021] [Revised: 12/29/2021] [Accepted: 12/30/2021] [Indexed: 11/14/2022] Open
Abstract
DNA sequencing technologies have advanced significantly in the last few years leading to advancements in biomedical research which has improved personalised medicine and the discovery of new treatments for diseases. Sequencing technology advancement has also reduced the cost of DNA sequencing, which has led to the rise of direct-to-consumer (DTC) sequencing, e.g. 23andme.com, ancestry.co.uk, etc. In the meantime, concerns have emerged over privacy and security in collecting, handling, analysing and sharing DNA and genomic data. DNA data are unique and can be used to identify individuals. Moreover, those data provide information on people's current disease status and disposition, e.g. mental health or susceptibility for developing cancer. DNA privacy violation does not only affect the owner but also affects their close consanguinity due to its hereditary nature. This article introduces and defines the term 'digital DNA life cycle' and presents an overview of privacy and security threats and their mitigation techniques for predigital DNA and throughout the digital DNA life cycle. It covers DNA sequencing hardware, software and DNA sequence pipeline in addition to common privacy attacks and their countermeasures when DNA digital data are stored, queried or shared. Likewise, the article examines DTC genomic sequencing privacy and security.
Collapse
Affiliation(s)
- Muhalb M Alsaffar
- Department of Computing, AI and Robotics, School of Digital, Technologies and Arts, Staffordshire University, College Road, ST4 2DE, Staffordshire, United Kingdom
| | | | - Gavin P McStay
- Department of Biological Sciences, School of Health, Science and Wellbeing, Staffordshire University, College Road, Stoke-on-Trent, Staffordshire, ST4 2DE, United Kingdom
| | - Mohamed Sedky
- Department of Computing, AI and Robotics, School of Digital, Technologies and Arts, Staffordshire University, College Road, ST4 2DE, Staffordshire, United Kingdom
| |
Collapse
|
11
|
Wan Z, Vorobeychik Y, Xia W, Liu Y, Wooders M, Guo J, Yin Z, Clayton EW, Kantarcioglu M, Malin BA. Using game theory to thwart multistage privacy intrusions when sharing data. SCIENCE ADVANCES 2021; 7:eabe9986. [PMID: 34890225 PMCID: PMC8664254 DOI: 10.1126/sciadv.abe9986] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 12/14/2020] [Accepted: 10/25/2021] [Indexed: 06/13/2023]
Abstract
Person-specific biomedical data are now widely collected, but its sharing raises privacy concerns, specifically about the re-identification of seemingly anonymous records. Formal re-identification risk assessment frameworks can inform decisions about whether and how to share data; current techniques, however, focus on scenarios where the data recipients use only one resource for re-identification purposes. This is a concern because recent attacks show that adversaries can access multiple resources, combining them in a stage-wise manner, to enhance the chance of an attack’s success. In this work, we represent a re-identification game using a two-player Stackelberg game of perfect information, which can be applied to assess risk, and suggest an optimal data sharing strategy based on a privacy-utility tradeoff. We report on experiments with large-scale genomic datasets to show that, using game theoretic models accounting for adversarial capabilities to launch multistage attacks, most data can be effectively shared with low re-identification risk.
Collapse
Affiliation(s)
- Zhiyu Wan
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Yevgeniy Vorobeychik
- Department of Computer Science and Engineering, Washington University in St. Louis, St. Louis, MO 63130, USA
| | - Weiyi Xia
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Yongtai Liu
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
| | - Myrna Wooders
- Department of Economics, Vanderbilt University, Nashville, TN 37235, USA
| | - Jia Guo
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
| | - Zhijun Yin
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Ellen Wright Clayton
- Center for Biomedical Ethics and Society, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- School of Law, Vanderbilt University, Nashville, TN 37203, USA
- Department of Pediatrics, Vanderbilt University Medical Center, Nashville, TN 37232, USA
| | - Murat Kantarcioglu
- Department of Computer Science, University of Texas at Dallas, Richardson, TX 75080, USA
- Institute for Quantitative Social Science, Harvard University, Cambridge, MA 02138, USA
- Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, Berkeley, CA 94720, USA
| | - Bradley A. Malin
- Department of Electrical Engineering and Computer Science, Vanderbilt University, Nashville, TN 37212, USA
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
12
|
Bu D, Wang X, Tang H. Haplotype-based membership inference from summary genomic data. Bioinformatics 2021; 37:i161-i168. [PMID: 34252973 PMCID: PMC8275351 DOI: 10.1093/bioinformatics/btab305] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/22/2022] Open
Abstract
Motivation The availability of human genomic data, together with the enhanced capacity to process them, is leading to transformative technological advances in biomedical science and engineering. However, the public dissemination of such data has been difficult due to privacy concerns. Specifically, it has been shown that the presence of a human subject in a case group can be inferred from the shared summary statistics of the group, e.g. the allele frequencies, or even the presence/absence of genetic variants (e.g. shared by the Beacon project) in the group. These methods rely on the availability of the target’s genome, i.e. the DNA profile of a target human subject, and thus are often referred to as the membership inference method. Results In this article, we demonstrate the haplotypes, i.e. the sequence of single nucleotide variations (SNVs) showing strong genetic linkages in human genome databases, may be inferred from the summary of genomic data without using a target’s genome. Furthermore, novel haplotypes that did not appear in the database may be reconstructed solely from the allele frequencies from genomic datasets. These reconstructed haplotypes can be used for a haplotype-based membership inference algorithm to identify target subjects in a case group with greater power than existing methods based on SNVs. Availability and implementation The implementation of the membership inference algorithms is available at https://github.com/diybu/Haplotype-based-membership-inferences.
Collapse
Affiliation(s)
- Diyue Bu
- Department of Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN 47408, USA
| | - Xiaofeng Wang
- Department of Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN 47408, USA
| | - Haixu Tang
- Department of Informatics, Luddy School of Informatics, Computing, and Engineering, Indiana University, Bloomington, IN 47408, USA
| |
Collapse
|
13
|
Lu D, Zhang Y, Zhang L, Wang H, Weng W, Li L, Cai H. Methods of privacy-preserving genomic sequencing data alignments. Brief Bioinform 2021; 22:6279828. [PMID: 34021302 DOI: 10.1093/bib/bbab151] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2020] [Revised: 03/10/2021] [Accepted: 03/30/2021] [Indexed: 11/14/2022] Open
Abstract
Genomic data alignment, a fundamental operation in sequencing, can be utilized to map reads into a reference sequence, query on a genomic database and perform genetic tests. However, with the reduction of sequencing cost and the accumulation of genome data, privacy-preserving genomic sequencing data alignment is becoming unprecedentedly important. In this paper, we present a comprehensive review of secure genomic data comparison schemes. We discuss the privacy threats, including adversaries and privacy attacks. The attacks can be categorized into inference, membership, identity tracing and completion attacks and have been applied to obtaining the genomic privacy information. We classify the state-of-the-art genomic privacy-preserving alignment methods into three different scenarios: large-scale reads mapping, encrypted genomic datasets querying and genetic testing to ease privacy threats. A comprehensive analysis of these approaches has been carried out to evaluate the computation and communication complexity as well as the privacy requirements. The survey provides the researchers with the current trends and the insights on the significance and challenges of privacy issues in genomic data alignment.
Collapse
Affiliation(s)
- Dandan Lu
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Yue Zhang
- School of Computer Science, Guangdong Polytechnic Normal University, Guangzhou, 510006, China
| | - Ling Zhang
- Department of Radiology, Sun Yat-sen University Cancer Center; State Key Laboratory of Oncology in South China; Collaborative Innovation Center for Cancer Medicine, 651 Dongfeng East Road, Guangzhou, P. R. China,510060
| | - Haiyan Wang
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Wanlin Weng
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Li Li
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| | - Hongmin Cai
- School of Computer Science and Engineering, South China University of Technology, Guangzhou, 510006, China
| |
Collapse
|
14
|
Ayoz K, Ayday E, Cicek AE. Genome Reconstruction Attacks Against Genomic Data-Sharing Beacons. PROCEEDINGS ON PRIVACY ENHANCING TECHNOLOGIES. PRIVACY ENHANCING TECHNOLOGIES SYMPOSIUM 2021; 2021:28-48. [PMID: 34746296 PMCID: PMC8570374 DOI: 10.2478/popets-2021-0036] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
Sharing genome data in a privacy-preserving way stands as a major bottleneck in front of the scientific progress promised by the big data era in genomics. A community-driven protocol named genomic data-sharing beacon protocol has been widely adopted for sharing genomic data. The system aims to provide a secure, easy to implement, and standardized interface for data sharing by only allowing yes/no queries on the presence of specific alleles in the dataset. However, beacon protocol was recently shown to be vulnerable against membership inference attacks. In this paper, we show that privacy threats against genomic data sharing beacons are not limited to membership inference. We identify and analyze a novel vulnerability of genomic data-sharing beacons: genome reconstruction. We show that it is possible to successfully reconstruct a substantial part of the genome of a victim when the attacker knows the victim has been added to the beacon in a recent update. In particular, we show how an attacker can use the inherent correlations in the genome and clustering techniques to run such an attack in an efficient and accurate way. We also show that even if multiple individuals are added to the beacon during the same update, it is possible to identify the victim's genome with high confidence using traits that are easily accessible by the attacker (e.g., eye color or hair type). Moreover, we show how a reconstructed genome using a beacon that is not associated with a sensitive phenotype can be used for membership inference attacks to beacons with sensitive phenotypes (e.g., HIV+). The outcome of this work will guide beacon operators on when and how to update the content of the beacon and help them (along with the beacon participants) make informed decisions.
Collapse
|
15
|
Paige B, Bell J, Bellet A, Gascón A, Ezer D. Reconstructing Genotypes in Private Genomic Databases from Genetic Risk Scores. J Comput Biol 2021; 28:435-451. [PMID: 33400590 PMCID: PMC8165474 DOI: 10.1089/cmb.2020.0445] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Some organizations such as 23andMe and the UK Biobank have large genomic databases that they re-use for multiple different genome-wide association studies. Even research studies that compile smaller genomic databases often utilize these databases to investigate many related traits. It is common for the study to report a genetic risk score (GRS) model for each trait within the publication. Here, we show that under some circumstances, these GRS models can be used to recover the genetic variants of individuals in these genomic databases—a reconstruction attack. In particular, if two GRS models are trained by using a largely overlapping set of participants, it is often possible to determine the genotype for each of the individuals who were used to train one GRS model, but not the other. We demonstrate this theoretically and experimentally by analyzing the Cornell Dog Genome database. The accuracy of our reconstruction attack depends on how accurately we can estimate the rate of co-occurrence of pairs of single nucleotide polymorphisms within the private database, so if this aggregate information is ever released, it would drastically reduce the security of a private genomic database. Caution should be applied when using the same database for multiple analysis, especially when a small number of individuals are included or excluded from one part of the study.
Collapse
Affiliation(s)
- Brooks Paige
- The Alan Turing Institute, London, United Kingdom.,Department of Computer Science, University College London, London, United Kingdom
| | - James Bell
- The Alan Turing Institute, London, United Kingdom
| | - Aurélien Bellet
- Inria, Parc Scientifique de la Haute Borne Park Plaza, Villeneuve d'Ascq, France
| | - Adrià Gascón
- The Alan Turing Institute, London, United Kingdom.,University of Warwick, Coventry, United Kingdom
| | - Daphne Ezer
- The Alan Turing Institute, London, United Kingdom.,University of Warwick, Coventry, United Kingdom.,Department of Biology, University of York, York, United Kingdom
| |
Collapse
|
16
|
Ayoz K, Aysen M, Ayday E, Cicek AE. The effect of kinship in re-identification attacks against genomic data sharing beacons. Bioinformatics 2020; 36:i903-i910. [PMID: 33381836 PMCID: PMC7773481 DOI: 10.1093/bioinformatics/btaa821] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 09/08/2020] [Indexed: 12/31/2022] Open
Abstract
MOTIVATION Big data era in genomics promises a breakthrough in medicine, but sharing data in a private manner limit the pace of field. Widely accepted 'genomic data sharing beacon' protocol provides a standardized and secure interface for querying the genomic datasets. The data are only shared if the desired information (e.g. a certain variant) exists in the dataset. Various studies showed that beacons are vulnerable to re-identification (or membership inference) attacks. As beacons are generally associated with sensitive phenotype information, re-identification creates a significant risk for the participants. Unfortunately, proposed countermeasures against such attacks have failed to be effective, as they do not consider the utility of beacon protocol. RESULTS In this study, for the first time, we analyze the mitigation effect of the kinship relationships among beacon participants against re-identification attacks. We argue that having multiple family members in a beacon can garble the information for attacks since a substantial number of variants are shared among kin-related people. Using family genomes from HapMap and synthetically generated datasets, we show that having one of the parents of a victim in the beacon causes (i) significant decrease in the power of attacks and (ii) substantial increase in the number of queries needed to confirm an individual's beacon membership. We also show how the protection effect attenuates when more distant relatives, such as grandparents are included alongside the victim. Furthermore, we quantify the utility loss due adding relatives and show that it is smaller compared with flipping based techniques.
Collapse
Affiliation(s)
- Kerem Ayoz
- Computer Engineering Department, Bilkent University, Ankara 06800, Turkey
| | - Miray Aysen
- Computer Engineering Department, Bilkent University, Ankara 06800, Turkey
| | - Erman Ayday
- Computer Engineering Department, Bilkent University, Ankara 06800, Turkey,Computer and Data Sciences Department, Case Western Reserve University, Cleveland, OH 44106, USA,To whom correspondence should be addressed. E-mail: or
| | - A Ercument Cicek
- Computer Engineering Department, Bilkent University, Ankara 06800, Turkey,Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA,To whom correspondence should be addressed. E-mail: or
| |
Collapse
|
17
|
Karimi S, Jiang X, Dolin RH, Kim M, Boxwala A. A secure system for genomics clinical decision support. J Biomed Inform 2020; 112:103602. [PMID: 33080397 PMCID: PMC8577277 DOI: 10.1016/j.jbi.2020.103602] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2020] [Revised: 09/07/2020] [Accepted: 10/12/2020] [Indexed: 11/26/2022]
Abstract
We developed a prototype genomic archiving and communications system to securely store genome data and provide clinical decision support (CDS). This system operates on a client-server model. The client encrypts the data, and the server stores data and performs the computations necessary for CDS. Computations are directly performed on encrypted data, and the client decrypts results. The server cannot decrypt inputs or outputs, which provides strong guarantees of security. We have validated our system with three genomics-based CDS applications. The results demonstrate that it is possible to resolve a long-standing dilemma in genomic data privacy and accessibility, by using a principled cryptographical framework and a mathematical representation of genome data and CDS questions.
Collapse
Affiliation(s)
| | - Xiaoqian Jiang
- UT Health School of Biomedical Informatics, Houston, TX, United States
| | | | - Miran Kim
- UT Health School of Biomedical Informatics, Houston, TX, United States
| | - Aziz Boxwala
- Elimu Informatics Inc., Richmond, CA, United States
| |
Collapse
|
18
|
Katsanis SH. Pedigrees and Perpetrators: Uses of DNA and Genealogy in Forensic Investigations. Annu Rev Genomics Hum Genet 2020; 21:535-564. [DOI: 10.1146/annurev-genom-111819-084213] [Citation(s) in RCA: 22] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
In the past few years, cases with DNA evidence that could not be solved with direct matches in DNA databases have benefited from comparing single-nucleotide polymorphism data with private and public genomic databases. Using a combination of genome comparisons and traditional genealogical research, investigators can triangulate distant relatives to the contributor of DNA data from a crime scene, ultimately identifying perpetrators of violent crimes. This approach has also been successful in identifying unknown deceased persons and perpetrators of lesser crimes. Such advances are bringing into focus ethical questions on how much access to DNA databases should be granted to law enforcement and how best to empower public genome contributors with control over their data. The necessary policies will take time to develop but can be informed by reflection on the familial searching policies developed for searches of the federal DNA database and considerations of the anonymity and privacy interests of civilians.
Collapse
Affiliation(s)
- Sara H. Katsanis
- Mary Ann & J. Milburn Smith Child Health Research, Outreach, and Advocacy Center, Ann & Robert H. Lurie Children's Hospital of Chicago, Chicago, Illinois 60611, USA
- Department of Pediatrics, Northwestern University, Chicago, Illinois 60611, USA
| |
Collapse
|
19
|
Wu X, Zheng H, Dou Z, Chen F, Deng J, Chen X, Xu S, Gao G, Li M, Wang Z, Xiao Y, Xie K, Wang S, Xu H. A novel privacy-preserving federated genome-wide association study framework and its application in identifying potential risk variants in ankylosing spondylitis. Brief Bioinform 2020; 22:5860679. [PMID: 32591779 DOI: 10.1093/bib/bbaa090] [Citation(s) in RCA: 9] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2020] [Revised: 04/05/2020] [Accepted: 04/24/2020] [Indexed: 11/13/2022] Open
Abstract
Genome-wide association studies (GWAS) have been widely used for identifying potential risk variants in various diseases. A statistically meaningful GWAS typically requires a large sample size to detect disease-associated single nucleotide polymorphisms (SNPs). However, a single institution usually only possesses a limited number of samples. Therefore, cross-institutional partnerships are required to increase sample size and statistical power. However, cross-institutional partnerships offer significant challenges, a major one being data privacy. For example, the privacy awareness of people, the impact of data privacy leakages and the privacy-related risks are becoming increasingly important, while there is no de-identification standard available to safeguard genomic data sharing. In this paper, we introduce a novel privacy-preserving federated GWAS framework (iPRIVATES). Equipped with privacy-preserving federated analysis, iPRIVATES enables multiple institutions to jointly perform GWAS analysis without leaking patient-level genotyping data. Only aggregated local statistics are exchanged within the study network. In addition, we evaluate the performance of iPRIVATES through both simulated data and a real-world application for identifying potential risk variants in ankylosing spondylitis (AS). The experimental results showed that the strongest signal of AS-associated SNPs reside mostly around the human leukocyte antigen (HLA) regions. The proposed iPRIVATES framework achieved equivalent results as traditional centralized implementation, demonstrating its great potential in driving collaborative genomic research for different diseases while preserving data privacy.
Collapse
Affiliation(s)
- Xin Wu
- Department of Rheumatology and Immunology, Shanghai Changzheng Hospital, Second Military Medical University, Shanghai, China
| | | | - Zuochao Dou
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co. Ltd, Hangzhou, China
| | - Feng Chen
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, China
| | - Jieren Deng
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, China
| | - Xiang Chen
- Department of Bioinformatics, Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, China
| | | | | | | | - Zhen Wang
- Department of Rheumatology and Immunology, Shanghai Changzheng Hospital, Second Military Medical University, China
| | - Yuhui Xiao
- Department of Bioinformatics, Hangzhou Nuowei Information Technology, Hangzhou, China
| | - Kang Xie
- Key Lab of Information Network Security of the Ministry of Public Security
| | - Shuang Wang
- Hangzhou Nuowei Information Technology Co., Ltd, Hangzhou, China
| | - Huji Xu
- Department of Rheumatology and Immunology, Shanghai Changzheng Hospital
| |
Collapse
|
20
|
May T. Are Public Repository Requirements Exacerbating Lack of Diversity? Trends Genet 2020; 36:390-394. [PMID: 32396832 DOI: 10.1016/j.tig.2020.03.004] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/31/2019] [Revised: 03/13/2020] [Accepted: 03/16/2020] [Indexed: 10/24/2022]
Abstract
Although public repository requirements are aimed at researchers and designed to ensure that the utility of the limited data we have is optimized, these policies also have ramifications for research participants. In this opinion article, I discuss how the nature of such repositories can subject participants whose data are 'banked' to unwitting participation in scientific projects they might find objectionable. In addition, concerns about the privacy of banked genomic data are exacerbated by recent projects that demonstrate the ability to re-identify genomic data, raising the specter of discriminatory or oppressive use of this information. These concerns are most likely to discourage participation in research that requires data sharing among those who have experienced these phenomena and are less likely to discount their likelihood.
Collapse
Affiliation(s)
- Thomas May
- Elson S. Floyd College of Medicine, Washington State University, Vancouver, WA, USA; HudsonAlpha Institute for Biotechnology, Huntsville, AL, USA.
| |
Collapse
|
21
|
Abstract
The Canadian Genomics Partnership for Rare Diseases, spearheaded by Genome Canada, will integrate genome-wide sequencing to rare disease clinical care in Canada. Centralized and tiered models of data stewardship are proposed to ensure that the data generated can be shared for secondary clinical, research, and quality assurance purposes in compliance with ethics and law. The principal ethico-legal obligations of clinicians, researchers, and institutions are synthesized. Governance infrastructures such as registered access platforms, data access compliance offices, and Beacon systems are proposed as potential organizational and technical foundations of responsible rare disease data sharing. The appropriate delegation of responsibilities, the transparent communication of rights and duties, and the integration of data privacy safeguards into infrastructure design are proposed as the cornerstones of rare disease data stewardship.
Collapse
Affiliation(s)
- Alexander Bernier
- Centre of Genomics and Policy, Faculty of Medicine, McGill University, Montreal, QC H3A 0G1, Canada
| |
Collapse
|
22
|
Boronow KE, Perovich LJ, Sweeney L, Yoo JS, Rudel RA, Brown P, Brody JG. Privacy Risks of Sharing Data from Environmental Health Studies. ENVIRONMENTAL HEALTH PERSPECTIVES 2020; 128:17008. [PMID: 31922426 PMCID: PMC7015543 DOI: 10.1289/ehp4817] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 11/30/2018] [Revised: 12/04/2019] [Accepted: 12/05/2019] [Indexed: 06/10/2023]
Abstract
BACKGROUND Sharing research data uses resources effectively; enables large, diverse data sets; and supports rigor and reproducibility. However, sharing such data increases privacy risks for participants who may be re-identified by linking study data to outside data sets. These risks have been investigated for genetic and medical records but rarely for environmental data. OBJECTIVES We evaluated how data in environmental health (EH) studies may be vulnerable to linkage and we investigated, in a case study, whether environmental measurements could contribute to inferring latent categories (e.g., geographic location), which increases privacy risks. METHODS We identified 12 prominent EH studies, reviewed the data types collected, and evaluated the availability of outside data sets that overlap with study data. With data from the Household Exposure Study in California and Massachusetts and the Green Housing Study in Boston, Massachusetts, and Cincinnati, Ohio, we used k-means clustering and principal component analysis to investigate whether participants' region of residence could be inferred from measurements of chemicals in household air and dust. RESULTS All 12 studies included at least two of five data types that overlap with outside data sets: geographic location (9 studies), medical data (9 studies), occupation (10 studies), housing characteristics (10 studies), and genetic data (7 studies). In our cluster analysis, participants' region of residence could be inferred with 80%-98% accuracy using environmental measurements with original laboratory reporting limits. DISCUSSION EH studies frequently include data that are vulnerable to linkage with voter lists, tax and real estate data, professional licensing lists, and ancestry websites, and exposure measurements may be used to identify subgroup membership, increasing likelihood of linkage. Thus, unsupervised sharing of EH research data potentially raises substantial privacy risks. Empirical research can help characterize risks and evaluate technical solutions. Our findings reinforce the need for legal and policy protections to shield participants from potential harms of re-identification from data sharing. https://doi.org/10.1289/EHP4817.
Collapse
Affiliation(s)
| | - Laura J. Perovich
- Silent Spring Institute, Newton, Massachusetts, USA
- MIT Media Lab, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| | - Latanya Sweeney
- Department of Government, Harvard University, Cambridge, Massachusetts, USA
| | - Ji Su Yoo
- Department of Government, Harvard University, Cambridge, Massachusetts, USA
| | | | - Phil Brown
- Department of Sociology and Anthropology and Department of Health Sciences, Northeastern University, Boston, Massachusetts, USA
| | | |
Collapse
|
23
|
Beyond Genes: Re-Identifiability of Proteomic Data and Its Implications for Personalized Medicine. Genes (Basel) 2019; 10:genes10090682. [PMID: 31492022 PMCID: PMC6770961 DOI: 10.3390/genes10090682] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/19/2019] [Revised: 08/30/2019] [Accepted: 09/01/2019] [Indexed: 02/07/2023] Open
Abstract
The increasing availability of high throughput proteomics data provides us with opportunities as well as posing new ethical challenges regarding data privacy and re-identifiability of participants. Moreover, the fact that proteomics represents a level between the genotype and the phenotype further exacerbates the situation, introducing dilemmas related to publicly available data, anonymization, ownership of information and incidental findings. In this paper, we try to differentiate proteomics from genomics data and cover the ethical challenges related to proteomics data sharing. Finally, we give an overview of the proposed solutions and the outlook for future studies.
Collapse
|