1
|
Herzig AF, Rubinacci S, Marenne G, Perdry H, Deleuze JF, Dina C, Barc J, Redon R, Delaneau O, Génin E. SURFBAT: a surrogate family based association test building on large imputation reference panels. G3 (BETHESDA, MD.) 2025; 15:jkae287. [PMID: 39657733 PMCID: PMC12005154 DOI: 10.1093/g3journal/jkae287] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/06/2024] [Revised: 11/07/2024] [Accepted: 11/29/2024] [Indexed: 12/12/2024]
Abstract
Genotype-phenotype association tests are typically adjusted for population stratification using principal components that are estimated genome-wide. This lacks resolution when analyzing populations with fine structure and/or individuals with fine levels of admixture. This can affect power and precision, and is a particularly relevant consideration when control individuals are recruited using geographic selection criteria. Such is the case in France where we have recently created reference panels of individuals anchored to different geographic regions. To make correct comparisons against case groups, who would likely be gathered from large urban areas, new methods are needed. We present SURFBAT (a surrogate family based association test), which performs an approximation of the transmission-disequilibrium test. Our method hinges on the application of genotype imputation algorithms to match similar haplotypes between the case and control groups. This permits us to approximate local ancestry informed posterior probabilities of un-transmitted parental alleles of each case individual. This is achieved by assuming haplotypes from the imputation panel are well-matched for ancestry with the case individuals. When the first haplotype of an individual from the imputation panel matches that of a case individual, it is assumed that the second haplotype of the same reference individual can be used as a locally ancestry matched control haplotype and to approximately impute un-transmitted parental alleles. SURFBAT provides an association test that is inherently robust to fine-scale population stratification and opens up the possibility of efficiently using large imputation reference panels as control groups for association testing. In contrast to other methods for association testing that incorporate local-ancestry inference, SURFBAT does not require a set of ancestry groups to be defined, nor for local ancestry to be explicitly estimated. We demonstrate the interest of our tool on simulated datasets, as well as on a real-data example for a group of case individuals affected by Brugada syndrome.
Collapse
Affiliation(s)
- Anthony F Herzig
- Inserm, Université de Bretagne-Occidentale, EFS, UMR 1078, GGB, Brest F-29200, France
| | - Simone Rubinacci
- Institute for Molecular Medicine Finland, University of Helsinki, Helsinki 00290, Finland
| | - Gaëlle Marenne
- Inserm, Université de Bretagne-Occidentale, EFS, UMR 1078, GGB, Brest F-29200, France
| | - Hervé Perdry
- CESP Inserm U1018, Université Paris-Saclay, Villejuif F-94807, France
| | - Jean-François Deleuze
- Université Paris-Saclay, CEA, Centre National de Recherche en Génomique Humaine (CNRGH), Evry F-91000, France
- CEPH, Fondation Jean Dausset, Paris F-75010, France
| | - Christian Dina
- Nantes Université, CNRS, INSERM UMR 1087, L’Institut du Thorax, Nantes F-44000, France
| | - Julien Barc
- Nantes Université, CNRS, INSERM UMR 1087, L’Institut du Thorax, Nantes F-44000, France
| | - Richard Redon
- Nantes Université, CNRS, INSERM UMR 1087, L’Institut du Thorax, Nantes F-44000, France
| | | | - Emmanuelle Génin
- Inserm, Université de Bretagne-Occidentale, EFS, UMR 1078, GGB, Brest F-29200, France
- CHU Brest, Brest F-29200, France
| |
Collapse
|
2
|
Hong S, Walker CR, Choi YA, Gürsoy G. Secure and scalable gene expression quantification with pQuant. Nat Commun 2025; 16:2380. [PMID: 40064866 PMCID: PMC11894182 DOI: 10.1038/s41467-025-57393-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/17/2024] [Accepted: 02/21/2025] [Indexed: 03/14/2025] Open
Abstract
Next generation sequencing reads from RNA-seq studies expose private genotypes of individuals during computation. Here, we introduce pQuant, an algorithm that employs homomorphic encryption to ensure privacy-preserving quantification of gene expression from RNA-seq data across public and cloud servers. pQuant performs computations on encrypted data, allowing researchers to handle sensitive information without exposing it. Our evaluations demonstrate that pQuant achieves accuracy comparable to state-of-the-art non-secure algorithms like STAR and kallisto. pQuant is highly scalable and its runtime and memory do not depend on the number of reads. It also supports parallel processing to enhance efficiency regardless of the number of genes analyzed.
Collapse
Affiliation(s)
- Seungwan Hong
- Department of Biomedical Informatics, Columbia University, New York, NY, 10032, USA
- New York Genome Center, New York, NY, 10013, USA
| | - Conor R Walker
- Department of Biomedical Informatics, Columbia University, New York, NY, 10032, USA
- New York Genome Center, New York, NY, 10013, USA
| | - Yoolim A Choi
- Department of Biomedical Informatics, Columbia University, New York, NY, 10032, USA
- New York Genome Center, New York, NY, 10013, USA
| | - Gamze Gürsoy
- Department of Biomedical Informatics, Columbia University, New York, NY, 10032, USA.
- New York Genome Center, New York, NY, 10013, USA.
- Department of Computer Science, Columbia University, New York, NY, 10027, USA.
| |
Collapse
|
3
|
Zhi D, Jiang X, Harmanci A. Proxy panels enable privacy-aware outsourcing of genotype imputation. Genome Res 2025; 35:326-339. [PMID: 39794122 PMCID: PMC11874966 DOI: 10.1101/gr.278934.124] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2024] [Accepted: 01/06/2025] [Indexed: 01/13/2025]
Abstract
One of the major challenges in genomic data sharing is protecting participants' privacy in collaborative studies and in cases when genomic data are outsourced to perform analysis tasks, for example, genotype imputation services and federated collaborations genomic analysis. Although numerous cryptographic methods have been developed, these methods may not yet be practical for population-scale tasks in terms of computational requirements, rely on high-level expertise in security, and require each algorithm to be implemented from scratch. In this study, we focus on outsourcing of genotype imputation, a fundamental task that utilizes population-level reference panels, and develop protocols that rely on using "proxy panels" to protect genotype panels, whereas the imputation task is being outsourced at servers. The proxy panels are generated through a series of protection mechanisms such as haplotype sampling, allele hashing, and coordinate anonymization to protect the underlying sensitive panel's genetic variant coordinates, genetic maps, and chromosome-wide haplotypes. Although the resulting proxy panels are almost distinct from the sensitive panels, they are valid panels that can be used as input to imputation methods such as Beagle. We demonstrate that proxy-based imputation protects against well-known attacks with a minor decrease in imputation accuracy for variants in a wide range of allele frequencies.
Collapse
Affiliation(s)
- Degui Zhi
- Department of Bioinformatics and Systems Medicine, D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, Texas 77030, USA
| | - Xiaoqian Jiang
- Department of Health Data Science and Artificial Intelligence, D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, Texas 77030, USA
| | - Arif Harmanci
- Department of Bioinformatics and Systems Medicine, D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, Texas 77030, USA;
- Department of Health Data Science and Artificial Intelligence, D. Bradley McWilliams School of Biomedical Informatics, University of Texas Health Science Center, Houston, Texas 77030, USA
| |
Collapse
|
4
|
Cho H, Froelicher D, Dokmai N, Nandi A, Sadhuka S, Hong MM, Berger B. Privacy-Enhancing Technologies in Biomedical Data Science. Annu Rev Biomed Data Sci 2024; 7:317-343. [PMID: 39178425 PMCID: PMC11346580 DOI: 10.1146/annurev-biodatasci-120423-120107] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 08/25/2024]
Abstract
The rapidly growing scale and variety of biomedical data repositories raise important privacy concerns. Conventional frameworks for collecting and sharing human subject data offer limited privacy protection, often necessitating the creation of data silos. Privacy-enhancing technologies (PETs) promise to safeguard these data and broaden their usage by providing means to share and analyze sensitive data while protecting privacy. Here, we review prominent PETs and illustrate their role in advancing biomedicine. We describe key use cases of PETs and their latest technical advances and highlight recent applications of PETs in a range of biomedical domains. We conclude by discussing outstanding challenges and social considerations that need to be addressed to facilitate a broader adoption of PETs in biomedical data science.
Collapse
Affiliation(s)
- Hyunghoon Cho
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, Connecticut, USA;
| | - David Froelicher
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
| | - Natnatee Dokmai
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, Connecticut, USA;
| | - Anupama Nandi
- Department of Biomedical Informatics and Data Science, Yale School of Medicine, New Haven, Connecticut, USA;
| | - Shuvom Sadhuka
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
| | - Matthew M Hong
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
| | - Bonnie Berger
- Computer Science and Artificial Intelligence Laboratory, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA;
- Department of Mathematics, Massachusetts Institute of Technology, Cambridge, Massachusetts, USA
| |
Collapse
|
5
|
Hong S, Choi YA, Joo DS, Gürsoy G. Privacy-preserving model evaluation for logistic and linear regression using homomorphically encrypted genotype data. J Biomed Inform 2024; 156:104678. [PMID: 38936565 PMCID: PMC11272436 DOI: 10.1016/j.jbi.2024.104678] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2024] [Revised: 05/29/2024] [Accepted: 06/19/2024] [Indexed: 06/29/2024]
Abstract
OBJECTIVE Linear and logistic regression are widely used statistical techniques in population genetics for analyzing genetic data and uncovering patterns and associations in large genetic datasets, such as identifying genetic variations linked to specific diseases or traits. However, obtaining statistically significant results from these studies requires large amounts of sensitive genotype and phenotype information from thousands of patients, which raises privacy concerns. Although cryptographic techniques such as homomorphic encryption offers a potential solution to the privacy concerns as it allows computations on encrypted data, previous methods leveraging homomorphic encryption have not addressed the confidentiality of shared models, which can leak information about the training data. METHODS In this work, we present a secure model evaluation method for linear and logistic regression using homomorphic encryption for six prediction tasks, where input genotypes, output phenotypes, and model parameters are all encrypted. RESULTS Our method ensures no private information leakage during inference and achieves high accuracy (≥93% for all outcomes) with each inference taking less than ten seconds for ∼200 genomes. CONCLUSION Our study demonstrates that it is possible to perform linear and logistic regression model evaluation while protecting patient confidentiality with theoretical security guarantees. Our implementation and test data are available at https://github.com/G2Lab/privateML/.
Collapse
Affiliation(s)
- Seungwan Hong
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA
| | - Yoolim A Choi
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA
| | - Daniel S Joo
- New York Genome Center, New York, NY 10013, USA; Department of Computer Science, Columbia University, New York, NY 10032, USA
| | - Gamze Gürsoy
- Department of Biomedical Informatics, Columbia University, New York, NY 10032, USA; New York Genome Center, New York, NY 10013, USA; Department of Computer Science, Columbia University, New York, NY 10032, USA.
| |
Collapse
|
6
|
Aherrahrou N, Tairi H, Aherrahrou Z. Genomic privacy preservation in genome-wide association studies: taxonomy, limitations, challenges, and vision. Brief Bioinform 2024; 25:bbae356. [PMID: 39073827 PMCID: PMC11285165 DOI: 10.1093/bib/bbae356] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/24/2024] [Revised: 06/19/2024] [Accepted: 07/12/2024] [Indexed: 07/30/2024] Open
Abstract
Genome-wide association studies (GWAS) serve as a crucial tool for identifying genetic factors associated with specific traits. However, ethical constraints prevent the direct exchange of genetic information, prompting the need for privacy preservation solutions. To address these issues, earlier works are based on cryptographic mechanisms such as homomorphic encryption, secure multi-party computing, and differential privacy. Very recently, federated learning has emerged as a promising solution for enabling secure and collaborative GWAS computations. This work provides an extensive overview of existing methods for GWAS privacy preserving, with the main focus on collaborative and distributed approaches. This survey provides a comprehensive analysis of the challenges faced by existing methods, their limitations, and insights into designing efficient solutions.
Collapse
Affiliation(s)
- Noura Aherrahrou
- LISAC, Department of Computer Science, Faculty of Sciences Dhar El Mahraz, University Sidi Mohamed Ben Abdellah, B.P. 1796 – Atlas, 30003, Fez, Morocco
| | - Hamid Tairi
- LISAC, Department of Computer Science, Faculty of Sciences Dhar El Mahraz, University Sidi Mohamed Ben Abdellah, B.P. 1796 – Atlas, 30003, Fez, Morocco
| | - Zouhair Aherrahrou
- Institute for Cardiogenetics, Universität zu Lübeck, D-23562 Lübeck, Germany
- DZHK (German Centre for Cardiovascular Research), Partner Site Hamburg/Kiel/Lübeck, Germany
- University Heart Centre Lübeck, D-23562 Lübeck, Germany
| |
Collapse
|
7
|
Brauneck A, Schmalhorst L, Weiss S, Baumbach L, Völker U, Ellinghaus D, Baumbach J, Buchholtz G. Legal aspects of privacy-enhancing technologies in genome-wide association studies and their impact on performance and feasibility. Genome Biol 2024; 25:154. [PMID: 38872191 PMCID: PMC11170858 DOI: 10.1186/s13059-024-03296-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2023] [Accepted: 06/03/2024] [Indexed: 06/15/2024] Open
Abstract
Genomic data holds huge potential for medical progress but requires strict safety measures due to its sensitive nature to comply with data protection laws. This conflict is especially pronounced in genome-wide association studies (GWAS) which rely on vast amounts of genomic data to improve medical diagnoses. To ensure both their benefits and sufficient data security, we propose a federated approach in combination with privacy-enhancing technologies utilising the findings from a systematic review on federated learning and legal regulations in general and applying these to GWAS.
Collapse
Affiliation(s)
- Alissa Brauneck
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany.
| | - Louisa Schmalhorst
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany
| | - Stefan Weiss
- Interfaculty Institute of Genetics and Functional Genomics, Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - Linda Baumbach
- Department of Health Economics and Health Services Research, University Medical Center Hamburg-Eppendorf, Hamburg, Germany
| | - Uwe Völker
- Interfaculty Institute of Genetics and Functional Genomics, Department of Functional Genomics, University Medicine Greifswald, Greifswald, Germany
| | - David Ellinghaus
- Institute of Clinical Molecular Biology (IKMB), Kiel University and University Medical Center Schleswig-Holstein, Kiel, Germany
| | - Jan Baumbach
- Institute for Computational Systems Biology, University of Hamburg, Hamburg, Germany
| | - Gabriele Buchholtz
- Hamburg University Faculty of Law, University of Hamburg, Hamburg, Germany
| |
Collapse
|
8
|
Cavinato T, Rubinacci S, Malaspinas AS, Delaneau O. A resampling-based approach to share reference panels. NATURE COMPUTATIONAL SCIENCE 2024; 4:360-366. [PMID: 38745108 PMCID: PMC11136649 DOI: 10.1038/s43588-024-00630-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/02/2023] [Accepted: 04/16/2024] [Indexed: 05/16/2024]
Abstract
For many genome-wide association studies, imputing genotypes from a haplotype reference panel is a necessary step. Over the past 15 years, reference panels have become larger and more diverse, leading to improvements in imputation accuracy. However, the latest generation of reference panels is subject to restrictions on data sharing due to concerns about privacy, limiting their usefulness for genotype imputation. In this context, here we propose RESHAPE, a method that employs a recombination Poisson process on a reference panel to simulate the genomes of hypothetical descendants after multiple generations. This data transformation helps to protect against re-identification threats and preserves data attributes, such as linkage disequilibrium patterns and, to some degree, identity-by-descent sharing, allowing for genotype imputation. Our experiments on gold-standard datasets show that simulated descendants up to eight generations can serve as reference panels without substantially reducing genotype imputation accuracy.
Collapse
Affiliation(s)
- Théo Cavinato
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | - Simone Rubinacci
- Division of Genetics, Department of Medicine, Brigham and Women's Hospital and Harvard Medical School, Boston, MA, USA
- Program in Medical and Population Genetics, Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Anna-Sapfo Malaspinas
- Department of Computational Biology, University of Lausanne, Lausanne, Switzerland
- Swiss Institute of Bioinformatics, University of Lausanne, Lausanne, Switzerland
| | | |
Collapse
|
9
|
Zhou J, Huang C, Gao X. Patient privacy in AI-driven omics methods. Trends Genet 2024; 40:383-386. [PMID: 38637270 DOI: 10.1016/j.tig.2024.03.004] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/27/2024] [Revised: 03/18/2024] [Accepted: 03/19/2024] [Indexed: 04/20/2024]
Abstract
Artificial intelligence (AI) in omics analysis raises privacy threats to patients. Here, we briefly discuss risk factors to patient privacy in data sharing, model training, and release, as well as methods to safeguard and evaluate patient privacy in AI-driven omics methods.
Collapse
Affiliation(s)
- Juexiao Zhou
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia; Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia
| | - Chao Huang
- Ningbo Institute of Information Technology Application, Chinese Academy of Sciences (CAS), Ningbo, China
| | - Xin Gao
- Computer Science Program, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia; Computational Bioscience Research Center, Computer, Electrical and Mathematical Sciences and Engineering Division, King Abdullah University of Science and Technology (KAUST), Thuwal, 23955-6900, Kingdom of Saudi Arabia.
| |
Collapse
|
10
|
Shin H, Ryu K, Kim JY, Lee S. Application of privacy protection technology to healthcare big data. Digit Health 2024; 10:20552076241282242. [PMID: 39502481 PMCID: PMC11536567 DOI: 10.1177/20552076241282242] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2023] [Accepted: 08/23/2024] [Indexed: 11/08/2024] Open
Abstract
With the advent of the big data era, data security issues are becoming more common. Healthcare organizations have more data to use for analysis, but they lose money every year due to their inability to prevent data leakage. To overcome these challenges, research on the use of data protection technologies in healthcare is actively underway, particularly research on state-of-the-art technologies, such as federated learning announced by Google and blockchain technology, which has recently attracted attention. To learn about these research efforts, we explored the research, methods, and limitations of the most widely used privacy technologies. After investigating related papers published between 2017 and 2023 and identifying the latest technology trends, we selected related papers and reviewed related technologies. In the process, four technologies were the focus of this study: blockchain, federated learning, isomorphic encryption, and differential privacy. Overall, our analysis provides researchers with insight into privacy technology research by suggesting the limitations of current privacy technologies and suggesting future research directions.
Collapse
Affiliation(s)
- Hyunah Shin
- Department of Healthcare Data Science Center, Konyang University Hospital, Daejeon, Republic of Korea
| | - Kyeongmin Ryu
- Department of Healthcare Data Science Center, Konyang University Hospital, Daejeon, Republic of Korea
| | - Jong-Yeup Kim
- Department of Healthcare Data Science Center, Konyang University Hospital, Daejeon, Republic of Korea
- Department of Otorhinolaryngology—Head and Neck Surgery, Konyang University College of Medicine, Daejeon, Republic of Korea
- Department of Biomedical Informatics, Konyang University College of Medicine, Daejeon, Republic of Korea
| | - Suehyun Lee
- College of IT Convergence, Gachon University, Seongnam, Republic of Korea
| |
Collapse
|
11
|
Zhang QX, Liu T, Guo X, Zhen J, Yang MY, Khederzadeh S, Zhou F, Han X, Zheng Q, Jia P, Ding X, He M, Zou X, Liao JK, Zhang H, He J, Zhu X, Lu D, Chen H, Zeng C, Liu F, Zheng HF, Liu S, Xu HM, Chen GB. Searching across-cohort relatives in 54,092 GWAS samples via encrypted genotype regression. PLoS Genet 2024; 20:e1011037. [PMID: 38206971 PMCID: PMC10783776 DOI: 10.1371/journal.pgen.1011037] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/08/2023] [Accepted: 12/13/2023] [Indexed: 01/13/2024] Open
Abstract
Explicitly sharing individual level data in genomics studies has many merits comparing to sharing summary statistics, including more strict QCs, common statistical analyses, relative identification and improved statistical power in GWAS, but it is hampered by privacy or ethical constraints. In this study, we developed encG-reg, a regression approach that can detect relatives of various degrees based on encrypted genomic data, which is immune of ethical constraints. The encryption properties of encG-reg are based on the random matrix theory by masking the original genotypic matrix without sacrificing precision of individual-level genotype data. We established a connection between the dimension of a random matrix, which masked genotype matrices, and the required precision of a study for encrypted genotype data. encG-reg has false positive and false negative rates equivalent to sharing original individual level data, and is computationally efficient when searching relatives. We split the UK Biobank into their respective centers, and then encrypted the genotype data. We observed that the relatives estimated using encG-reg was equivalently accurate with the estimation by KING, which is a widely used software but requires original genotype data. In a more complex application, we launched a finely devised multi-center collaboration across 5 research institutes in China, covering 9 cohorts of 54,092 GWAS samples. encG-reg again identified true relatives existing across the cohorts with even different ethnic backgrounds and genotypic qualities. Our study clearly demonstrates that encrypted genomic data can be used for data sharing without loss of information or data sharing barrier.
Collapse
Affiliation(s)
- Qi-Xin Zhang
- Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang, China
- Center for Reproductive Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Hangzhou, Zhejiang, China
| | - Tianzi Liu
- CAS Key Laboratory of Computational Biology, Shanghai Institute of Nutrition and Health, Chinese Academy of Sciences, Shanghai, China
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Xinxin Guo
- School of Public Health (Shenzhen), Sun Yat-sen University, Shenzhen, Guangdong, China
| | - Jianxin Zhen
- Central Laboratory, Shenzhen Baoan Women’s and Children’s Hospital, Shenzhen, Guangdong, China
| | - Meng-yuan Yang
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
| | - Saber Khederzadeh
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
| | - Fang Zhou
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
| | - Xiaotong Han
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, Guangdong, China
| | - Qiwen Zheng
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Peilin Jia
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
| | - Xiaohu Ding
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, Guangdong, China
| | - Mingguang He
- State Key Laboratory of Ophthalmology, Zhongshan Ophthalmic Center, Sun Yat-sen University, Guangdong Provincial Key Laboratory of Ophthalmology and Visual Science, Guangdong Provincial Clinical Research Center for Ocular Diseases, Guangzhou, Guangdong, China
- Centre for Eye Research Australia, Royal Victorian Eye and Ear Hospital, Melbourne, Victoria, Australia
- Ophthalmology, Department of Surgery, University of Melbourne, Melbourne, Victoria, Australia
| | - Xin Zou
- State Key Laboratory of CAD & GC, Zhejiang University, Hangzhou, Zhejiang, China
| | - Jia-Kai Liao
- School of Mathematics and Statistics and Research Institute of Mathematical Sciences (RIMS), Jiangsu Provincial Key Laboratory of Educational Big Data Science and Engineering, Jiangsu Normal University, Xuzhou, Jiangsu, China
- Ningbo Institute of Life and Health Industry, University of Chinese Academy of Sciences, Ningbo, Zhejiang, China
| | - Hongxin Zhang
- State Key Laboratory of CAD & GC, Zhejiang University, Hangzhou, Zhejiang, China
| | - Ji He
- Department of Neurology, Peking University Third Hospital, Beijing, China
| | - Xiaofeng Zhu
- Department of Population and Quantitative Health Sciences, Case Western Reserve University, Cleveland, Ohio, United States of America
| | - Daru Lu
- State Key Laboratory of Genetic Engineering and MOE Engineering Research Center of Gene Technology, School of Life Sciences and Zhongshan Hospital, Fudan University, Shanghai, China
- NHC Key Laboratory of Birth Defects and Reproductive Health, Chongqing Population and Family Planning Science and Technology Research Institute, Chongqing, China
| | - Hongyan Chen
- State Key Laboratory of Genetic Engineering, School of Life Sciences, Fudan University, Shanghai, China
| | - Changqing Zeng
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- Henan Academy of Sciences, Zhengzhou, Henan, China
| | - Fan Liu
- CAS Key Laboratory of Genomic and Precision Medicine, Beijing Institute of Genomics, Chinese Academy of Sciences, Beijing, China
- Department of Forensic Sciences, College of Criminal Justice, Naif Arab University of Security Sciences, Riyadh, Kingdom of Saudi Arabia
| | - Hou-Feng Zheng
- Diseases & Population (DaP) Geninfo Lab, School of Life Sciences, Westlake University, Hangzhou, Zhejiang, China
| | - Siyang Liu
- School of Public Health (Shenzhen), Sun Yat-sen University, Shenzhen, Guangdong, China
| | - Hai-Ming Xu
- Institute of Bioinformatics, Zhejiang University, Hangzhou, Zhejiang, China
| | - Guo-Bo Chen
- Center for Reproductive Medicine, Department of Genetic and Genomic Medicine, and Clinical Research Institute, Zhejiang Provincial People’s Hospital, People’s Hospital of Hangzhou Medical College, Hangzhou, Zhejiang, China
- Key Laboratory of Endocrine Gland Diseases of Zhejiang Province, Hangzhou, Zhejiang, China
| |
Collapse
|
12
|
Casaletto J, Bernier A, McDougall R, Cline MS. Federated Analysis for Privacy-Preserving Data Sharing: A Technical and Legal Primer. Annu Rev Genomics Hum Genet 2023; 24:347-368. [PMID: 37253596 PMCID: PMC10846631 DOI: 10.1146/annurev-genom-110122-084756] [Citation(s) in RCA: 7] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
Continued advances in precision medicine rely on the widespread sharing of data that relate human genetic variation to disease. However, data sharing is severely limited by legal, regulatory, and ethical restrictions that safeguard patient privacy. Federated analysis addresses this problem by transferring the code to the data-providing the technical and legal capability to analyze the data within their secure home environment rather than transferring the data to another institution for analysis. This allows researchers to gain new insights from data that cannot be moved, while respecting patient privacy and the data stewards' legal obligations. Because federated analysis is a technical solution to the legal challenges inherent in data sharing, the technology and policy implications must be evaluated together. Here, we summarize the technical approaches to federated analysis and provide a legal analysis of their policy implications.
Collapse
Affiliation(s)
- James Casaletto
- Genomics Institute, University of California, Santa Cruz, California, USA; ,
| | - Alexander Bernier
- Centre of Genomics and Policy, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; ,
| | - Robyn McDougall
- Centre of Genomics and Policy, Faculty of Medicine and Health Sciences, McGill University, Montreal, Quebec, Canada; ,
| | - Melissa S Cline
- Genomics Institute, University of California, Santa Cruz, California, USA; ,
| |
Collapse
|
13
|
Li W, Chen H, Jiang X, Harmanci A. Federated generalized linear mixed models for collaborative genome-wide association studies. iScience 2023; 26:107227. [PMID: 37529100 PMCID: PMC10387571 DOI: 10.1016/j.isci.2023.107227] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2022] [Revised: 01/28/2023] [Accepted: 06/23/2023] [Indexed: 08/03/2023] Open
Abstract
Federated association testing is a powerful approach to conduct large-scale association studies where sites share intermediate statistics through a central server. There are, however, several standing challenges. Confounding factors like population stratification should be carefully modeled across sites. In addition, it is crucial to consider disease etiology using flexible models to prevent biases. Privacy protections for participants pose another significant challenge. Here, we propose distributed Mixed Effects Genome-wide Association study (dMEGA), a method that enables federated generalized linear mixed model-based association testing across multiple sites without explicitly sharing genotype and phenotype data. dMEGA employs a reference projection to correct for population-stratification and utilizes efficient local-gradient updates among sites, incorporating both fixed and random effects. The accuracy and efficiency of dMEGA are demonstrated through simulated and real datasets. dMEGA is publicly available at https://github.com/Li-Wentao/dMEGA.
Collapse
Affiliation(s)
- Wentao Li
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
| | - Han Chen
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
- School of Public Health, University of Texas Health Science Center, Houston, TX 77030, USA
| | - Xiaoqian Jiang
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
| | - Arif Harmanci
- School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX 77030, USA
| |
Collapse
|
14
|
Zhou J, Lei B, Lang H, Panaousis E, Liang K, Xiang J. Secure genotype imputation using homomorphic encryption. JOURNAL OF INFORMATION SECURITY AND APPLICATIONS 2023. [DOI: 10.1016/j.jisa.2022.103386] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/09/2022]
|
15
|
Suo J, Gu L, Yan X, Yang S, Hu X, Wang L. PP-DDP: a privacy-preserving outsourcing framework for solving the double digest problem. BMC Bioinformatics 2023; 24:34. [PMID: 36721089 PMCID: PMC9890771 DOI: 10.1186/s12859-023-05157-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/08/2022] [Accepted: 01/23/2023] [Indexed: 02/01/2023] Open
Abstract
BACKGROUND As one of the fundamental problems in bioinformatics, the double digest problem (DDP) focuses on reordering genetic fragments in a proper sequence. Although many algorithms for dealing with the DDP problem were proposed during the past decades, it is believed that solving DDP is still very time-consuming work due to the strongly NP-completeness of DDP. However, none of these algorithms consider the privacy issue of the DDP data that contains critical business interests and is collected with days or even months of gel-electrophoresis experiments. Thus, the DDP data owners are reluctant to deploy the task of solving DDP over cloud. RESULTS Our main motivation in this paper is to design a secure outsourcing computation framework for solving the DDP problem. We at first propose a privacy-preserving outsourcing framework for handling the DDP problem by using a cloud server; Then, to enable the cloud server to solve the DDP instances over ciphertexts, an order-preserving homomorphic index scheme (OPHI) is tailored from an order-preserving encryption scheme published at CCS 2012; And finally, our previous work on solving DDP problem, a quantum inspired genetic algorithm (QIGA), is merged into our outsourcing framework, with the supporting of the proposed OPHI scheme. Moreover, after the execution of QIGA at the cloud server side, the optimal solution, i.e. two mapping sequences, would be transferred publicly to the data owner. Security analysis shows that from these sequences, none can learn any information about the original DDP data. Performance analysis shows that the communication cost and the computational workload for both the client side and the server side are reasonable. In particular, our experiments show that PP-DDP can find optional solutions with a high success rate towards typical test DDP instances and random DDP instances, and PP-DDP takes less running time than DDmap, SK05 and GM12, while keeping the privacy of the original DDP data. CONCLUSION The proposed outsourcing framework, PP-DDP, is secure and effective for solving the DDP problem.
Collapse
Affiliation(s)
- Jingwen Suo
- grid.31880.320000 0000 8780 1230State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
| | - Lize Gu
- grid.31880.320000 0000 8780 1230State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
| | - Xingyu Yan
- grid.31880.320000 0000 8780 1230State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
| | - Sijia Yang
- grid.31880.320000 0000 8780 1230State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
| | - Xiaoya Hu
- grid.31880.320000 0000 8780 1230State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China
| | - Licheng Wang
- grid.31880.320000 0000 8780 1230State Key Laboratory of Networking and Switching Technology, Beijing University of Posts and Telecommunications, Beijing, China ,grid.43555.320000 0000 8841 6246School of Cyberspace Science and Technology, Beijing Institute of Technology, Beijing, China
| |
Collapse
|
16
|
Sarkar E, Chielle E, Gursoy G, Chen L, Gerstein M, Maniatakos M. Privacy-preserving cancer type prediction with homomorphic encryption. Sci Rep 2023; 13:1661. [PMID: 36717667 PMCID: PMC9886900 DOI: 10.1038/s41598-023-28481-8] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/01/2022] [Accepted: 01/19/2023] [Indexed: 01/31/2023] Open
Abstract
Cancer genomics tailors diagnosis and treatment based on an individual's genetic information and is the crux of precision medicine. However, analysis and maintenance of high volume of genetic mutation data to build a machine learning (ML) model to predict the cancer type is a computationally expensive task and is often outsourced to powerful cloud servers, raising critical privacy concerns for patients' data. Homomorphic encryption (HE) enables computation on encrypted data, thus, providing cryptographic guarantees to protect privacy. But restrictive overheads of encrypted computation deter its usage. In this work, we explore the challenges of privacy preserving cancer type prediction using a dataset consisting of more than 2 million genetic mutations from 2713 patients for several cancer types by building a highly accurate ML model and then implementing its privacy preserving version in HE. Our solution for cancer type inference encodes somatic mutations based on their impact on the cancer genomes into the feature space and then uses statistical tests for feature selection. We propose a fast matrix multiplication algorithm for HE-based model. Our final model achieves 0.98 micro-average area under curve improving accuracy from 70.08 to 83.61% , being 550 times faster than the standard matrix multiplication-based privacy-preserving models. Our tool can be found at https://github.com/momalab/octal-candet .
Collapse
Affiliation(s)
- Esha Sarkar
- Tandon School of Engineering, New York University, Brooklyn, NY, 11201, USA.
| | - Eduardo Chielle
- Center for Cyber Security, New York University Abu Dhabi, Abu Dhabi, 129188, UAE
| | - Gamze Gursoy
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA
| | - Leo Chen
- Department of Computer Science, Yale University, New Haven, CT, 06520, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, 06520, USA
| | - Michail Maniatakos
- Center for Cyber Security, New York University Abu Dhabi, Abu Dhabi, 129188, UAE
| |
Collapse
|
17
|
Kuo TT, Jiang X, Tang H, Wang X, Harmanci A, Kim M, Post K, Bu D, Bath T, Kim J, Liu W, Chen H, Ohno-Machado L. The evolving privacy and security concerns for genomic data analysis and sharing as observed from the iDASH competition. J Am Med Inform Assoc 2022; 29:2182-2190. [PMID: 36164820 PMCID: PMC9667175 DOI: 10.1093/jamia/ocac165] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2022] [Revised: 08/25/2022] [Accepted: 09/13/2022] [Indexed: 01/11/2023] Open
Abstract
Concerns regarding inappropriate leakage of sensitive personal information as well as unauthorized data use are increasing with the growth of genomic data repositories. Therefore, privacy and security of genomic data have become increasingly important and need to be studied. With many proposed protection techniques, their applicability in support of biomedical research should be well understood. For this purpose, we have organized a community effort in the past 8 years through the integrating data for analysis, anonymization and sharing consortium to address this practical challenge. In this article, we summarize our experience from these competitions, report lessons learned from the events in 2020/2021 as examples, and discuss potential future research directions in this emerging field.
Collapse
Affiliation(s)
- Tsung-Ting Kuo
- Corresponding Author: Tsung-Ting Kuo, PhD, UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, CA 92093, USA;
| | | | | | | | - Arif Harmanci
- School of Biomedical Informatics, The University of Texas Health Science Center at Houston, Houston, Texas, USA
| | - Miran Kim
- Department of Mathematics, Hanyang University, Seoul, Republic of Korea,Department of Computer Science, Hanyang University, Seoul, Republic of Korea
| | - Kai Post
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| | - Diyue Bu
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA
| | - Tyler Bath
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| | - Jihoon Kim
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA
| | - Weijie Liu
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA
| | - Hongbo Chen
- Luddy School of Informatics, Computing, and Engineering, Indiana University Bloomington, Bloomington, Indiana, USA
| | - Lucila Ohno-Machado
- UCSD Health Department of Biomedical Informatics, University of California San Diego, La Jolla, California, USA,Division of Health Services Research & Development, Veteran Affairs San Diego Healthcare System, San Diego, California, USA
| |
Collapse
|
18
|
TrustGWAS: A full-process workflow for encrypted GWAS using multi-key homomorphic encryption and pseudorandom number perturbation. Cell Syst 2022; 13:752-767.e6. [PMID: 36041458 DOI: 10.1016/j.cels.2022.08.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2021] [Revised: 04/21/2022] [Accepted: 08/04/2022] [Indexed: 01/26/2023]
Abstract
The statistical power of genome-wide association studies (GWASs) is affected by the effective sample size. However, the privacy and security concerns associated with individual-level genotype data pose great challenges for cross-institutional cooperation. The full-process cryptographic solutions are in demand but have not been covered, especially the essential principal-component analysis (PCA). Here, we present TrustGWAS, a complete solution for secure, large-scale GWAS, recapitulating gold standard results against PLINK without compromising privacy and supporting basic PLINK steps including quality control, linkage disequilibrium pruning, PCA, chi-square test, Cochran-Armitage trend test, covariate-supported logistic regression and linear regression, and their sequential combinations. TrustGWAS leverages pseudorandom number perturbations for PCA and multiparty scheme of multi-key homomorphic encryption for all other modules. TrustGWAS can evaluate 100,000 individuals with 1 million variants and complete QC-LD-PCA-regression workflow within 50 h. We further successfully discover gene loci associated with fasting blood glucose, consistent with the findings of the ChinaMAP project.
Collapse
|
19
|
Wang S, Kim M, Jiang X, Harmanci AO. Evaluation of vicinity-based hidden Markov models for genotype imputation. BMC Bioinformatics 2022; 23:356. [PMID: 36038834 PMCID: PMC9422108 DOI: 10.1186/s12859-022-04896-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Accepted: 08/08/2022] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND The decreasing cost of DNA sequencing has led to a great increase in our knowledge about genetic variation. While population-scale projects bring important insight into genotype-phenotype relationships, the cost of performing whole-genome sequencing on large samples is still prohibitive. In-silico genotype imputation coupled with genotyping-by-arrays is a cost-effective and accurate alternative for genotyping of common and uncommon variants. Imputation methods compare the genotypes of the typed variants with the large population-specific reference panels and estimate the genotypes of untyped variants by making use of the linkage disequilibrium patterns. Most accurate imputation methods are based on the Li-Stephens hidden Markov model, HMM, that treats the sequence of each chromosome as a mosaic of the haplotypes from the reference panel. RESULTS Here we assess the accuracy of vicinity-based HMMs, where each untyped variant is imputed using the typed variants in a small window around itself (as small as 1 centimorgan). Locality-based imputation is used recently by machine learning-based genotype imputation approaches. We assess how the parameters of the vicinity-based HMMs impact the imputation accuracy in a comprehensive set of benchmarks and show that vicinity-based HMMs can accurately impute common and uncommon variants. CONCLUSIONS Our results indicate that locality-based imputation models can be effectively used for genotype imputation. The parameter settings that we identified can be used in future methods and vicinity-based HMMs can be used for re-structuring and parallelizing new imputation methods. The source code for the vicinity-based HMM implementations is publicly available at https://github.com/harmancilab/LoHaMMer .
Collapse
Affiliation(s)
- Su Wang
- Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Miran Kim
- Department of Mathematics, Hanyang University, Seoul, 04763, Republic of Korea
| | - Xiaoqian Jiang
- Center for Secure Artificial Intelligence For hEalthcare (SAFE), School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA
| | - Arif Ozgun Harmanci
- Center for Precision Health, School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, 77030, USA.
| |
Collapse
|
20
|
Kim M, Jiang X, Lauter K, Ismayilzada E, Shams S. Secure human action recognition by encrypted neural network inference. Nat Commun 2022; 13:4799. [PMID: 35970834 PMCID: PMC9378731 DOI: 10.1038/s41467-022-32168-5] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/09/2021] [Accepted: 07/12/2022] [Indexed: 11/29/2022] Open
Abstract
Advanced computer vision technology can provide near real-time home monitoring to support "aging in place" by detecting falls and symptoms related to seizures and stroke. Affordable webcams, together with cloud computing services (to run machine learning algorithms), can potentially bring significant social benefits. However, it has not been deployed in practice because of privacy concerns. In this paper, we propose a strategy that uses homomorphic encryption to resolve this dilemma, which guarantees information confidentiality while retaining action detection. Our protocol for secure inference can distinguish falls from activities of daily living with 86.21% sensitivity and 99.14% specificity, with an average inference latency of 1.2 seconds and 2.4 seconds on real-world test datasets using small and large neural nets, respectively. We show that our method enables a 613x speedup over the latency-optimized LoLa and achieves an average of 3.1x throughput increase in secure inference compared to the throughput-optimized nGraph-HE2.
Collapse
Affiliation(s)
- Miran Kim
- Department of Mathematics, Hanyang University, Seoul, Republic of Korea.
- Department of Computer Science, Hanyang University, Seoul, Republic of Korea.
| | - Xiaoqian Jiang
- Center for Secure Artificial intelligence For hEalthcare (SAFE), School of Biomedical Informatics, University of Texas Health Science Center, Houston, TX, USA
| | | | - Elkhan Ismayilzada
- Department of Computer Science and Engineering, Ulsan National Institute of Science and Technology, Ulsan, Republic of Korea
| | - Shayan Shams
- Department of Applied Data Science, San Jose State University, San Jose, CA, USA.
| |
Collapse
|
21
|
Abstract
Genomics data are important for advancing biomedical research, improving clinical care, and informing other disciplines such as forensics and genealogy. However, privacy concerns arise when genomic data are shared. In particular, the identifying nature of genetic information, its direct relationship to health status, and the potential financial harm and stigmatization posed to individuals and their blood relatives call for a survey of the privacy issues related to sharing genetic and related data and potential solutions to overcome these issues. In this work, we provide an overview of the importance of genomic privacy, the information gleaned from genomics data, the sources of potential private information leakages in genomics, and ways to preserve privacy while utilizing the genetic information in research. We discuss the relationship between trust in the scientific community and protecting privacy, illuminating a future roadmap for data sharing and study participation.
Collapse
Affiliation(s)
- Gamze Gürsoy
- Department of Biomedical Informatics, Columbia University, New York, NY, USA; .,New York Genome Center, New York, NY, USA
| |
Collapse
|
22
|
Gürsoy G, Brannon CM, Ni E, Wagner S, Khanna A, Gerstein M. Storing and analyzing a genome on a blockchain. Genome Biol 2022; 23:134. [PMID: 35765079 PMCID: PMC9241283 DOI: 10.1186/s13059-022-02699-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2021] [Accepted: 06/04/2022] [Indexed: 11/26/2022] Open
Abstract
There are major efforts underway to make genome sequencing a routine part of clinical practice. A critical barrier to these is achieving practical solutions for data ownership and integrity. Blockchain provides solutions to these challenges in other realms, such as finance. However, its use in genomics is stymied due to the difficulty in storing large-scale data on-chain, slow transaction speeds, and limitations on querying. To overcome these roadblocks, we developed a private blockchain network to store genomic variants and reference-aligned reads on-chain. It uses nested database indexing with an accompanying tool suite to rapidly access and analyze the data.
Collapse
Affiliation(s)
- Gamze Gürsoy
- Program in Computational Biology and Bioinformatics, Yale University, Whitney Avenue, New Haven, CT, 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, Whitney Avenue, New Haven, CT, 06520, USA
- Current Address: Department of Biomedical Informatics, Columbia University, New York, NY, USA
- Current Address: New York Genome Center, New York, NY, USA
| | - Charlotte M Brannon
- Program in Computational Biology and Bioinformatics, Yale University, Whitney Avenue, New Haven, CT, 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, Whitney Avenue, New Haven, CT, 06520, USA
- Current Address: Stanford University, Stanford, CA, USA
| | - Eric Ni
- Program in Computational Biology and Bioinformatics, Yale University, Whitney Avenue, New Haven, CT, 06520, USA
- Department of Molecular Biophysics and Biochemistry, Yale University, Whitney Avenue, New Haven, CT, 06520, USA
| | - Sarah Wagner
- Department of Computer Science, Yale University, Prospect Street, New Haven, CT, 06520, USA
| | - Amol Khanna
- Department of Biomedical Engineering, Johns Hopkins University, Baltimore, MD, USA
- Department of Applied Mathematics, Johns Hopkins University, Baltimore, MD, USA
| | - Mark Gerstein
- Program in Computational Biology and Bioinformatics, Yale University, Whitney Avenue, New Haven, CT, 06520, USA.
- Department of Molecular Biophysics and Biochemistry, Yale University, Whitney Avenue, New Haven, CT, 06520, USA.
- Department of Computer Science, Yale University, Prospect Street, New Haven, CT, 06520, USA.
| |
Collapse
|
23
|
Hong S, Park JH, Cho W, Choe H, Cheon JH. Secure tumor classification by shallow neural network using homomorphic encryption. BMC Genomics 2022; 23:284. [PMID: 35395714 PMCID: PMC8994372 DOI: 10.1186/s12864-022-08469-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/03/2021] [Accepted: 03/04/2022] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Disclosure of patients' genetic information in the process of applying machine learning techniques for tumor classification hinders the privacy of personal information. Homomorphic Encryption (HE), which supports operations between encrypted data, can be used as one of the tools to perform such computation without information leakage, but it brings great challenges for directly applying general machine learning algorithms due to the limitations of operations supported by HE. In particular, non-polynomial activation functions, including softmax functions, are difficult to implement with HE and require a suitable approximation method to minimize the loss of accuracy. In the secure genome analysis competition called iDASH 2020, it is presented as a competition task that a multi-label tumor classification method that predicts the class of samples based on genetic information using HE. METHODS We develop a secure multi-label tumor classification method using HE to ensure privacy during all the computations of the model inference process. Our solution is based on a 1-layer neural network with the softmax activation function model and uses the approximate HE scheme. We present an approximation method that enables softmax activation in the model using HE and a technique for efficiently encoding data to reduce computational costs. In addition, we propose a HE-friendly data filtering method to reduce the size of large-scale genetic data. RESULTS We aim to analyze the dataset from The Cancer Genome Atlas (TCGA) dataset, which consists of 3,622 samples from 11 types of cancers, genetic features from 25,128 genes. Our preprocessing method reduces the number of genes to 4,096 or less and achieves a microAUC value of 0.9882 (85% accuracy) with a 1-layer shallow neural network. Using our model, we successfully compute the tumor classification inference steps on the encrypted test data in 3.75 minutes. As a result of exceptionally high microAUC values, our solution was awarded co-first place in iDASH 2020 Track 1: "Secure multi-label Tumor classification using Homomorphic Encryption". CONCLUSIONS Our solution is the first result of implementing a neural network model with softmax activation using HE. Also, HE optimization methods presented in this work enable machine learning implementation using HE or other challenging HE applications.
Collapse
Affiliation(s)
- Seungwan Hong
- Department of Mathematical Sciences, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea.
| | - Jai Hyun Park
- Department of Mathematical Sciences, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
| | - Wonhee Cho
- Department of Mathematical Sciences, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
| | - Hyeongmin Choe
- Department of Mathematical Sciences, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
| | - Jung Hee Cheon
- Department of Mathematical Sciences, Seoul National University, 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea.,Cryptolab Inc., 1, Gwanak-ro, Gwanak-gu, Seoul, Republic of Korea
| |
Collapse
|
24
|
Privacy-preserving genotype imputation with fully homomorphic encryption. Cell Syst 2022; 13:173-182.e3. [PMID: 34758288 PMCID: PMC8857019 DOI: 10.1016/j.cels.2021.10.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/21/2020] [Revised: 06/28/2021] [Accepted: 10/15/2021] [Indexed: 12/17/2022]
Abstract
Genotype imputation is the inference of unknown genotypes using known population structure observed in large genomic datasets; it can further our understanding of phenotype-genotype relationships and is useful for QTL mapping and GWASs. However, the compute-intensive nature of genotype imputation can overwhelm local servers for computation and storage. Hence, many researchers are moving toward using cloud services, raising privacy concerns. We address these concerns by developing an efficient, privacy-preserving algorithm called p-Impute. Our method uses homomorphic encryption, allowing calculations on ciphertext, thereby avoiding the decryption of private genotypes in the cloud. It is similar to k-nearest neighbor approaches, inferring missing genotypes in a genomic block based on the SNP genotypes of genetically related individuals in the same block. Our results demonstrate accuracy in agreement with the state-of-the-art plaintext solutions. Moreover, p-Impute is scalable to real-world applications as its memory and time requirements increase linearly with the increasing number of samples. p-Impute is freely available for download here: https://doi.org/10.5281/zenodo.5542001.
Collapse
|
25
|
Dokmai N, Kockan C, Zhu K, Wang X, Sahinalp SC, Cho H. Privacy-preserving genotype imputation in a trusted execution environment. Cell Syst 2021; 12:983-993.e7. [PMID: 34450045 PMCID: PMC8542641 DOI: 10.1016/j.cels.2021.08.001] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2021] [Revised: 07/14/2021] [Accepted: 08/02/2021] [Indexed: 01/02/2023]
Abstract
Genotype imputation is an essential tool in genomics research, whereby missing genotypes are inferred using reference genomes to enhance downstream analyses. Recently, public imputation servers have allowed researchers to leverage large-scale genomic data resources for imputation. However, privacy concerns about uploading one's genetic data to a server limit the utility of these services. We introduce a secure hardware-based solution for privacy-preserving genotype imputation, which keeps the input genomes private by processing them within Intel SGX's trusted execution environment. Our solution features SMac, an efficient and secure imputation algorithm designed for Intel SGX, which employs a state-of-the-art imputation strategy also utilized by existing imputation servers. SMac achieves imputation accuracy equivalent to existing tools and provides protection against known side-channel attacks on SGX while maintaining scalability. We also show the necessity of our enhanced security by identifying vulnerabilities in existing imputation software. Our work represents a step toward privacy-preserving genomic analysis services.
Collapse
Affiliation(s)
- Natnatee Dokmai
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA; Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA
| | - Can Kockan
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA; Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - Kaiyuan Zhu
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA; Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA
| | - XiaoFeng Wang
- Department of Computer Science, Indiana University, Bloomington, IN 47408, USA
| | - S Cenk Sahinalp
- Cancer Data Science Laboratory, National Cancer Institute, National Institutes of Health, Bethesda, MD 20892, USA.
| | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Cambridge, MA 02142, USA.
| |
Collapse
|
26
|
Dokmai N, Kockan C, Zhu K, Wang X, Sahinalp SC, Cho H. Privacy-Preserving Genotype Imputation in a Trusted Execution Environment. RESEARCH IN COMPUTATIONAL MOLECULAR BIOLOGY : ... ANNUAL INTERNATIONAL CONFERENCE, RECOMB ... : PROCEEDINGS. RECOMB (CONFERENCE : 2005- ) 2021; 12:983-993.e7. [PMID: 34859247 PMCID: PMC8635452] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Grants] [Subscribe] [Scholar Register] [Indexed: 06/13/2023]
Affiliation(s)
- Natnatee Dokmai
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - Can Kockan
- Department of Computer Science, Indiana University, Bloomington, IN, USA
- Cancer Data Science Lab, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Kaiyuan Zhu
- Department of Computer Science, Indiana University, Bloomington, IN, USA
- Cancer Data Science Lab, National Cancer Institute, NIH, Bethesda, MD, USA
| | - XiaoFeng Wang
- Department of Computer Science, Indiana University, Bloomington, IN, USA
| | - S. Cenk Sahinalp
- Cancer Data Science Lab, National Cancer Institute, NIH, Bethesda, MD, USA
| | - Hyunghoon Cho
- Broad Institute of MIT and Harvard, Cambridge, MA, USA
| |
Collapse
|