1
|
Wang X, Dervishi L, Li W, Ayday E, Jiang X, Vaidya J. Privacy-preserving federated genome-wide association studies via dynamic sampling. Bioinformatics 2023; 39:btad639. [PMID: 37856329 PMCID: PMC10612407 DOI: 10.1093/bioinformatics/btad639] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/09/2023] [Revised: 09/15/2023] [Accepted: 10/18/2023] [Indexed: 10/21/2023] Open
Abstract
MOTIVATION Genome-wide association studies (GWAS) benefit from the increasing availability of genomic data and cross-institution collaborations. However, sharing data across institutional boundaries jeopardizes medical data confidentiality and patient privacy. While modern cryptographic techniques provide formal secure guarantees, the substantial communication and computational overheads hinder the practical application of large-scale collaborative GWAS. RESULTS This work introduces an efficient framework for conducting collaborative GWAS on distributed datasets, maintaining data privacy without compromising the accuracy of the results. We propose a novel two-step strategy aimed at reducing communication and computational overheads, and we employ iterative and sampling techniques to ensure accurate results. We instantiate our approach using logistic regression, a commonly used statistical method for identifying associations between genetic markers and the phenotype of interest. We evaluate our proposed methods using two real genomic datasets and demonstrate their robustness in the presence of between-study heterogeneity and skewed phenotype distributions using a variety of experimental settings. The empirical results show the efficiency and applicability of the proposed method and the promise for its application for large-scale collaborative GWAS. AVAILABILITY AND IMPLEMENTATION The source code and data are available at https://github.com/amioamo/TDS.
Collapse
Affiliation(s)
- Xinyue Wang
- Management Science and Information Systems Department, Rutgers University, New Brunswick, NJ 07102, United States
| | - Leonard Dervishi
- Department of Computer and Data Sciences, Cleveland, OH 44106, United States
| | - Wentao Li
- Department of Health Data Science and Artificial Intelligence, Houston, TX 77030, United States
| | - Erman Ayday
- Department of Computer and Data Sciences, Cleveland, OH 44106, United States
| | - Xiaoqian Jiang
- Department of Health Data Science and Artificial Intelligence, Houston, TX 77030, United States
| | - Jaideep Vaidya
- Management Science and Information Systems Department, Rutgers University, New Brunswick, NJ 07102, United States
| |
Collapse
|
2
|
Tomietto M, McGill A, Kiernan MD. Implementing an electronic public health record for policy planning in the UK military sector: Validation of a secure hashing algorithm. Heliyon 2023; 9:e16116. [PMID: 37265623 PMCID: PMC10230209 DOI: 10.1016/j.heliyon.2023.e16116] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2023] [Revised: 05/07/2023] [Accepted: 05/07/2023] [Indexed: 06/03/2023] Open
Abstract
The digitalisation of healthcare services is a major resource to inform policy-makers. However, the availability of data and the establishment of a data flow present new issues to address, such as data anonymisation, records' reliability, and data linkage. The veterans' population in the UK presents complex needs and many organisations provide social and healthcare support, but their databases are not linked or aggregated to provide a comprehensive overview of service planning. This study aims to test the sensitivity and specificity of a Secure Hashing Algorithm to generate a unique anonymous identifier for data linkage across different organisations in the veterans' population. A Secure Hashing Algorithm was performed by considering two input variables from two different datasets. The uniqueness of the identifier was compared against the single personal key adopted as a current standard identifier. Chi-square, sensitivity, and specificity were calculated. The results demonstrated that the unique identifier generated by the Secure Hashing Algorithm detected more unique records when compared to the current gold standard. The identifier demonstrated optimal sensitivity and specificity and it allowed an enhanced data linkage between different datasets. The adoption of a Secure Hashing Algorithm improved the uniqueness of records. Moreover, it ensured data anonymity by transforming personal information into an encrypted identifier. This approach is beneficial for big data management and for creating an aggregated system for linking different organisations and, in this way, for providing a more comprehensive overview of healthcare provision and the foundation for precision public health strategies.
Collapse
|
3
|
Wirth FN, Kussel T, Müller A, Hamacher K, Prasser F. EasySMPC: a simple but powerful no-code tool for practical secure multiparty computation. BMC Bioinformatics 2022; 23:531. [PMID: 36494612 PMCID: PMC9733077 DOI: 10.1186/s12859-022-05044-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/31/2022] [Accepted: 11/08/2022] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Modern biomedical research is data-driven and relies heavily on the re-use and sharing of data. Biomedical data, however, is subject to strict data protection requirements. Due to the complexity of the data required and the scale of data use, obtaining informed consent is often infeasible. Other methods, such as anonymization or federation, in turn have their own limitations. Secure multi-party computation (SMPC) is a cryptographic technology for distributed calculations, which brings formally provable security and privacy guarantees and can be used to implement a wide-range of analytical approaches. As a relatively new technology, SMPC is still rarely used in real-world biomedical data sharing activities due to several barriers, including its technical complexity and lack of usability. RESULTS To overcome these barriers, we have developed the tool EasySMPC, which is implemented in Java as a cross-platform, stand-alone desktop application provided as open-source software. The tool makes use of the SMPC method Arithmetic Secret Sharing, which allows to securely sum up pre-defined sets of variables among different parties in two rounds of communication (input sharing and output reconstruction) and integrates this method into a graphical user interface. No additional software services need to be set up or configured, as EasySMPC uses the most widespread digital communication channel available: e-mails. No cryptographic keys need to be exchanged between the parties and e-mails are exchanged automatically by the software. To demonstrate the practicability of our solution, we evaluated its performance in a wide range of data sharing scenarios. The results of our evaluation show that our approach is scalable (summing up 10,000 variables between 20 parties takes less than 300 s) and that the number of participants is the essential factor. CONCLUSIONS We have developed an easy-to-use "no-code solution" for performing secure joint calculations on biomedical data using SMPC protocols, which is suitable for use by scientists without IT expertise and which has no special infrastructure requirements. We believe that innovative approaches to data sharing with SMPC are needed to foster the translation of complex protocols into practice.
Collapse
Affiliation(s)
- Felix Nikolaus Wirth
- grid.484013.a0000 0004 6879 971XBerlin Institute of Health at Charité – Universitätsmedizin Berlin, Medical Informatics Group, Charitéplatz 1, 10117 Berlin, Germany
| | - Tobias Kussel
- grid.6546.10000 0001 0940 1669Computational Biology and Simulation, TU Darmstadt, Darmstadt, Germany
| | - Armin Müller
- grid.484013.a0000 0004 6879 971XBerlin Institute of Health at Charité – Universitätsmedizin Berlin, Medical Informatics Group, Charitéplatz 1, 10117 Berlin, Germany
| | - Kay Hamacher
- grid.6546.10000 0001 0940 1669Computational Biology and Simulation, TU Darmstadt, Darmstadt, Germany
| | - Fabian Prasser
- grid.484013.a0000 0004 6879 971XBerlin Institute of Health at Charité – Universitätsmedizin Berlin, Medical Informatics Group, Charitéplatz 1, 10117 Berlin, Germany
| |
Collapse
|
4
|
Crossfield SSR, Zucker K, Baxter P, Wright P, Fistein J, Markham AF, Birkin M, Glaser AW, Hall G. A data flow process for confidential data and its application in a health research project. PLoS One 2022; 17:e0262609. [PMID: 35061834 PMCID: PMC8782367 DOI: 10.1371/journal.pone.0262609] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/13/2021] [Accepted: 12/29/2021] [Indexed: 11/30/2022] Open
Abstract
BACKGROUND The use of linked healthcare data in research has the potential to make major contributions to knowledge generation and service improvement. However, using healthcare data for secondary purposes raises legal and ethical concerns relating to confidentiality, privacy and data protection rights. Using a linkage and anonymisation approach that processes data lawfully and in line with ethical best practice to create an anonymous (non-personal) dataset can address these concerns, yet there is no set approach for defining all of the steps involved in such data flow end-to-end. We aimed to define such an approach with clear steps for dataset creation, and to describe its utilisation in a case study linking healthcare data. METHODS We developed a data flow protocol that generates pseudonymous datasets that can be reversibly linked, or irreversibly linked to form an anonymous research dataset. It was designed and implemented by the Comprehensive Patient Records (CPR) study in Leeds, UK. RESULTS We defined a clear approach that received ethico-legal approval for use in creating an anonymous research dataset. Our approach used individual-level linkage through a mechanism that is not computer-intensive and was rendered irreversible to both data providers and processors. We successfully applied it in the CPR study to hospital and general practice and community electronic health record data from two providers, along with patient reported outcomes, for 365,193 patients. The resultant anonymous research dataset is available via DATA-CAN, the Health Data Research Hub for Cancer in the UK. CONCLUSIONS Through ethical, legal and academic review, we believe that we contribute a defined approach that represents a framework that exceeds current minimum standards for effective pseudonymisation and anonymisation. This paper describes our methods and provides supporting information to facilitate the use of this approach in research.
Collapse
Affiliation(s)
| | - Kieran Zucker
- Leeds Institute of Medical Research at St James’s, University of Leeds, Leeds, United Kingdom
| | - Paul Baxter
- Leeds Institute of Cardiovascular and Metabolic Medicine, University of Leeds, Leeds, United Kingdom
| | - Penny Wright
- Leeds Institute of Medical Research at St James’s, University of Leeds, Leeds, United Kingdom
| | - Jon Fistein
- Leeds Institute for Data Analytics, University of Leeds, Leeds, United Kingdom
| | - Alex F. Markham
- Leeds Institute for Data Analytics, University of Leeds, Leeds, United Kingdom
- Leeds Institute of Medical Research at St James’s, University of Leeds, Leeds, United Kingdom
| | - Mark Birkin
- Leeds Institute for Data Analytics, University of Leeds, Leeds, United Kingdom
| | - Adam W. Glaser
- Leeds Institute for Data Analytics, University of Leeds, Leeds, United Kingdom
- Leeds Institute of Medical Research at St James’s, University of Leeds, Leeds, United Kingdom
| | - Geoff Hall
- Leeds Institute for Data Analytics, University of Leeds, Leeds, United Kingdom
- Leeds Institute of Medical Research at St James’s, University of Leeds, Leeds, United Kingdom
| |
Collapse
|
5
|
Dong X, Randolph DA, Weng C, Kho AN, Rogers JM, Wang X. Developing High Performance Secure Multi-Party Computation Protocols in Healthcare: A Case Study of Patient Risk Stratification. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE PROCEEDINGS. AMIA JOINT SUMMITS ON TRANSLATIONAL SCIENCE 2021; 2021:200-209. [PMID: 34457134 PMCID: PMC8378657] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Abstract
We demonstrate that secure multi-party computation (MPC) using garbled circuits is viable technology for solving clinical use cases that require cross-institution data exchange and collaboration. We describe two MPC protocols, based on Yao's garbled circuits and tested using large and realistically synthesized datasets. Linking records using private set intersection (PSI), we compute two metrics often used in patient risk stratification: high utilizer identification (PSI-HU) and comorbidity index calculation (PSI-CI). Cuckoo hashing enables our protocols to achieve extremely fast run times, with answers to clinically meaningful questions produced in minutes instead of hours. Also, our protocols are provably secure against any computationally bounded adversary in a semi-honest setting, the de-facto mode for cross-institution data analytics. Finally, these protocols eliminate the need for an implicitly trusted third-party "honest broker" to mediate the information linkage and exchange.
Collapse
Affiliation(s)
- Xiao Dong
- Center for Clinical and Translational Science, University of Illinois College of Medicine, Chicago, Illinois, USA
| | - David A Randolph
- Center for Clinical and Translational Science, University of Illinois College of Medicine, Chicago, Illinois, USA
| | - Chenkai Weng
- Department of Computer Science, Northwestern University, Evanston, Illinois, USA
| | - Abel N Kho
- Feinberg School of Medicine, Northwestern University, Chicago, Illinois, USA
| | - Jennie M Rogers
- Department of Computer Science, Northwestern University, Evanston, Illinois, USA
| | - Xiao Wang
- Department of Computer Science, Northwestern University, Evanston, Illinois, USA
| |
Collapse
|
6
|
Doctor JN, Vaidya J, Jiang X, Wang S, Schilling LM, Ong T, Matheny ME, Ohno-Machado L, Meeker D. Efficient determination of equivalence for encrypted data. Comput Secur 2020; 97. [PMID: 33223585 DOI: 10.1016/j.cose.2020.101939] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Secure computation of equivalence has fundamental application in many different areas, including health-care. We study this problem in the context of matching an individual's identity to link medical records across systems under the socialist millionaires' problem: Two millionaires wish to determine if their fortunes are equal without disclosing their net worth (Boudot, et al. 2001). In Theorem 2, we show that when a "greater than" algorithm is carried out on a totally ordered set it is easy to achieve secure matching without additional rounds of communication. We present this efficient solution to assess equivalence using a set intersection algorithm designed for "greater than" computation and demonstrate its effectiveness on equivalence of arbitrary data values, as well as demonstrate how it meets regulatory criteria for risk of disclosure.
Collapse
Affiliation(s)
- Jason N Doctor
- Schaeffer Center for Health Policy and Economics, University of Southern California, 635 Downey Way, Los Angeles, CA 90089-3333, United States
| | - Jaideep Vaidya
- Management Science & Information Systems Department, Rutgers University, Newark, NJ, United States
| | - Xiaoqian Jiang
- UCSD Health Department of Biomedical Informatics, UC San Diego, La Jolla, CA, United States
| | - Shuang Wang
- UCSD Health Department of Biomedical Informatics, UC San Diego, La Jolla, CA, United States
| | - Lisa M Schilling
- Department of Medicine, University of Colorado, Anschutz Medical Campus, CO, United States
| | - Toan Ong
- Department of Medicine, University of Colorado, Anschutz Medical Campus, CO, United States
| | - Michael E Matheny
- Geriatric Research Education and Clinical Care Service, Tennessee Valley Healthcare System VA, Nashville, TN, United States
- Department of Biomedical Informatics, Medicine, and Biostatistics, Vanderbilt University Medical Center, Nashville, TN, United States
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, UC San Diego, La Jolla, CA, United States
| | - Daniella Meeker
- Keck School of Medicine, University of Southern California, Los Angeles, CA, United States
| |
Collapse
|
7
|
Abstract
OBJECTIVES Clinical Research Informatics (CRI) declares its scope in its name, but its content, both in terms of the clinical research it supports-and sometimes initiates-and the methods it has developed over time, reach much further than the name suggests. The goal of this review is to celebrate the extraordinary diversity of activity and of results, not as a prize-giving pageant, but in recognition of the field, the community that both serves and is sustained by it, and of its interdisciplinarity and its international dimension. METHODS Beyond personal awareness of a range of work commensurate with the author's own research, it is clear that, even with a thorough literature search, a comprehensive review is impossible. Moreover, the field has grown and subdivided to an extent that makes it very hard for one individual to be familiar with every branch or with more than a few branches in any depth. A literature survey was conducted that focused on informatics-related terms in the general biomedical and healthcare literature, and specific concerns ("artificial intelligence", "data models", "analytics", etc.) in the biomedical informatics (BMI) literature. In addition to a selection from the results from these searches, suggestive references within them were also considered. RESULTS The substantive sections of the paper-Artificial Intelligence, Machine Learning, and "Big Data" Analytics; Common Data Models, Data Quality, and Standards; Phenotyping and Cohort Discovery; Privacy: Deidentification, Distributed Computation, Blockchain; Causal Inference and Real-World Evidence-provide broad coverage of these active research areas, with, no doubt, a bias towards this reviewer's interests and preferences, landing on a number of papers that stood out in one way or another, or, alternatively, exemplified a particular line of work. CONCLUSIONS CRI is thriving, not only in the familiar major centers of research, but more widely, throughout the world. This is not to pretend that the distribution is uniform, but to highlight the potential for this domain to play a prominent role in supporting progress in medicine, healthcare, and wellbeing everywhere. We conclude with the observation that CRI and its practitioners would make apt stewards of the new medical knowledge that their methods will bring forward.
Collapse
Affiliation(s)
- Anthony Solomonides
- Outcomes Research Network, Research Institute, NorthShore University HealthSystem, Evanston, IL, USA
| |
Collapse
|
8
|
Dong X, Randolph DA, Rajanna SK. Enabling Privacy Preserving Record Linkage Systems Using Asymmetric Key Cryptography. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2020; 2019:380-388. [PMID: 32308831 PMCID: PMC7153159] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Subscribe] [Scholar Register] [Indexed: 06/11/2023]
Abstract
We present a systemic approach to devise and deploy Privacy Preserving Record Linkage (PPRL) systems using asymmetric key cryptography and illustrate the strengths of such an approach. With our approach, the security implications of sharing a common secret salt across the network may be avoided, allowing the local participating sites to use private keys along with the current cryptographic hashes to maximally secure their own data. In addition, the final cyphertext tokens are compatible with those used by existing record linkage modules, allowing seamless integration with the existing PPRL infrastructures for downstream analysis. Finally, study-specific hash production requires action only by the central party. The main intuition for this work is derived from how asymmetric key approaches have enabled internet-scale applications. We demonstrate that such a design, where the local sites no longer need special-purpose software, affords greater flexibility and scalability for large scale multi-site linkage studies.
Collapse
Affiliation(s)
- Xiao Dong
- Center for Clinical and Translational Science, University of Illinois at Chicago
| | - David A Randolph
- Center for Clinical and Translational Science, University of Illinois at Chicago
| | | |
Collapse
|
9
|
Hindorff LA, Bonham VL, Ohno-Machado L. Enhancing diversity to reduce health information disparities and build an evidence base for genomic medicine. Per Med 2018; 15:403-412. [PMID: 30209973 PMCID: PMC6287493 DOI: 10.2217/pme-2018-0037] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/17/2022]
Abstract
Advances in genomic medicine are arising from efforts to build a national learning healthcare system (LHS) and large-scale precision medicine studies. However, the underlying evidence base lacks sufficient data from populations historically underrepresented in biomedical research. Although the literature on health and healthcare disparities is extensive, disparities in the availability and quality of health information about diverse and underrepresented populations are less well characterized. This Perspective describes scientific and ethical benefits to incorporating health information from diverse and underrepresented populations in the LHS, resulting in a more robust and generalizable LHS. Near-term recommendations for incorporating diversity into the evidence base for genomic medicine are proposed, even as the groundwork for national and international efforts is underway.
Collapse
Affiliation(s)
- Lucia A Hindorff
- Division of Genomic Medicine, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Vence L Bonham
- Division of Intramural Research, Social & Behavioral Research Branch & Office of the Director, National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, 20892, USA
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA, 92093, USA
| |
Collapse
|