1
|
Yaraş T, Oktay Y, Karakülah G. PGSXplorer: an integrated nextflow pipeline for comprehensive quality control and polygenic score model development. PeerJ 2025; 13:e18973. [PMID: 39959831 PMCID: PMC11829630 DOI: 10.7717/peerj.18973] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/01/2024] [Accepted: 01/21/2025] [Indexed: 02/18/2025] Open
Abstract
The rapid development of next-generation sequencing technologies and genomic data sharing initiatives during the post-Human Genome Project-era has catalyzed major advances in individualized medicine research. Genome-wide association studies (GWAS) have become a cornerstone of efforts towards understanding the genetic basis of complex diseases, leading to the development of polygenic scores (PGS). Despite their immense potential, the scarcity of standardized PGS development pipelines limits widespread adoption of PGS. Herein, we introduce PGSXplorer, a comprehensive Nextflow DSL2 pipeline that enables quality control of genomic data and automates the phasing, imputation, and construction of PGS models using reference GWAS data. PGSXplorer integrates various PGS development tools such as PLINK, PRSice-2, LD-Pred2, Lassosum2, MegaPRS, SBayesR-C, PRS-CSx and MUSSEL, improving the generalizability of PGS through multi-origin data integration. Tested with synthetic datasets, our fully Docker-encapsulated tool has demonstrated scalability and effectiveness for both single- and multi-population analyses. Continuously updated as an open-source tool, PGSXplorer is freely available with user tutorials at https://github.com/tutkuyaras/PGSXplorer, making it a valuable resource for advancing precision medicine in genetic research.
Collapse
Affiliation(s)
- Tutku Yaraş
- İzmir Biomedicine and Genome Center, İzmir, Turkey
- İzmir International Biomedicine and Genome Institute, Dokuz Eylül University, İzmir, Turkey
| | - Yavuz Oktay
- İzmir Biomedicine and Genome Center, İzmir, Turkey
- İzmir International Biomedicine and Genome Institute, Dokuz Eylül University, İzmir, Turkey
- Department of Medical Biology, Faculty of Medicine, Dokuz Eylül University, İzmir, Turkey
| | - Gökhan Karakülah
- İzmir Biomedicine and Genome Center, İzmir, Turkey
- İzmir International Biomedicine and Genome Institute, Dokuz Eylül University, İzmir, Turkey
| |
Collapse
|
2
|
Jin J, Li B, Wang X, Yang X, Li Y, Wang R, Ye C, Shu J, Fan Z, Xue F, Ge T, Ritchie MD, Pasaniuc B, Wojcik G, Zhao B. PennPRS: a centralized cloud computing platform for efficient polygenic risk score training in precision medicine. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2025.02.07.25321875. [PMID: 39990574 PMCID: PMC11844566 DOI: 10.1101/2025.02.07.25321875] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 02/25/2025]
Abstract
Polygenic risk scores (PRS) are becoming increasingly vital for risk prediction and stratification in precision medicine. However, PRS model training presents significant challenges for broader adoption of PRS, including limited access to computational resources, difficulties in implementing advanced PRS methods, and availability and privacy concerns over individual-level genetic data. Cloud computing provides a promising solution with centralized computing and data resources. Here we introduce PennPRS (https://pennprs.org), a scalable cloud computing platform for online PRS model training in precision medicine. We developed novel pseudo-training algorithms for multiple PRS methods and ensemble approaches, enabling model training without requiring individual-level data. These methods were rigorously validated through extensive simulations and large-scale real data analyses involving over 6,000 phenotypes across various data sources. PennPRS supports online single- and multi-ancestry PRS training with seven methods, allowing users to upload their own data or query from more than 27,000 datasets in the GWAS Catalog, submit jobs, and download trained PRS models. Additionally, we applied our pseudo-training pipeline to train PRS models for over 8,000 phenotypes and made their PRS weights publicly accessible. In summary, PennPRS provides a novel cloud computing solution to improve the accessibility of PRS applications and reduce disparities in computational resources for the global PRS research community.
Collapse
Affiliation(s)
- Jin Jin
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Penn Center for Eye-Brain Health, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Bingxuan Li
- UCLA Samueli School of Engineering, Los Angeles, CA 90095, USA
| | - Xiyao Wang
- Department of Computer Science, Columbia University, New York, NY 10027, USA
| | - Xiaochen Yang
- Department of Statistics, Purdue University, West Lafayette, IN 47907, USA
| | - Yujue Li
- Department of Statistics, Purdue University, West Lafayette, IN 47907, USA
| | - Ruofan Wang
- Department of Biostatistics, Epidemiology and Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Chenglong Ye
- Department of Statistics, University of Kentucky, Lexington, KY 40536, USA
| | - Juan Shu
- Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Zirui Fan
- Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Fei Xue
- Department of Statistics, Purdue University, West Lafayette, IN 47907, USA
| | - Tian Ge
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA 02114, USA
| | - Marylyn D. Ritchie
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Penn Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Bogdan Pasaniuc
- Department of Genetics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| | - Genevieve Wojcik
- Department of Epidemiology, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21205, USA
| | - Bingxin Zhao
- Penn Center for Eye-Brain Health, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
- Department of Statistics, Purdue University, West Lafayette, IN 47907, USA
- Department of Statistics and Data Science, University of Pennsylvania, Philadelphia, PA 19104, USA
- Penn Institute for Biomedical Informatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA 19104, USA
| |
Collapse
|
3
|
Ruan Y, Bhukar R, Patel A, Koyama S, Hull L, Truong B, Hornsby W, Zhang H, Chatterjee N, Natarajan P. Leveraging genetic ancestry continuum information to interpolate PRS for admixed populations. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2025:2024.11.09.24316996. [PMID: 39867390 PMCID: PMC11759244 DOI: 10.1101/2024.11.09.24316996] [Citation(s) in RCA: 1] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 01/28/2025]
Abstract
The relatively low representation of admixed populations in both discovery and fine-tuning individual-level datasets limits polygenic risk score (PRS) development and equitable clinical translation for admixed populations. Under the assumption that the most informative PRS weight for a homogeneous sample varies linearly in an ancestry continuum space, we introduce a Genetic Distance-assisted PRS Combination Pipeline for Diverse Genetic Ancestries (DiscoDivas) to interpolate a harmonized PRS for diverse, especially admixed, ancestries, leveraging multiple PRS weights fine-tuned within single-ancestry samples and genetic distance. DiscoDivas treats ancestry as a continuous variable and does not require shifting between different models when calculating PRS for different ancestries. We generated PRS with DiscoDivas and the current conventional method, i.e. fine-tuning multiple GWAS PRS using the matched or similar ancestry samples. DiscoDivas generated a harmonized PRS of the accuracy comparable to or higher than the conventional approach, with the greatest advantage exhibited in admixed individuals.
Collapse
Affiliation(s)
- Yunfeng Ruan
- Program in Medical and Population, Genetics and Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
| | - Rohan Bhukar
- Program in Medical and Population, Genetics and Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
| | - Aniruddh Patel
- Program in Medical and Population, Genetics and Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA, USA
| | - Satoshi Koyama
- Program in Medical and Population, Genetics and Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA, USA
- Laboratory for Cardiovascular Genomics and Informatics, RIKEN Center for Integrative Medical Sciences, Yokohama, Japan
| | - Leland Hull
- Division of General Internal Medicine, Massachusetts General Hospital, Boston, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Buu Truong
- Program in Medical and Population, Genetics and Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA, USA
- Department of Genetic Epidemiology and Statistical Genetics, Harvard T.H. School of Public Health, Cambridge, MA, US
| | - Whitney Hornsby
- Program in Medical and Population, Genetics and Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
| | - Haoyu Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
| | - Nilanjan Chatterjee
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD, USA
- Department of Oncology, School of Medicine, Johns Hopkins University, Baltimore, MD, USA
| | - Pradeep Natarajan
- Program in Medical and Population, Genetics and Cardiovascular Disease Initiative, Broad Institute of Harvard and MIT, Cambridge, MA, USA
- Cardiovascular Research Center, Massachusetts General Hospital, Boston, MA, USA
- Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| |
Collapse
|
4
|
Zhao Z, Dorn S, Wu Y, Yang X, Jin J, Lu Q. One score to rule them all: regularized ensemble polygenic risk prediction with GWAS summary statistics. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2024.11.27.625748. [PMID: 39677614 PMCID: PMC11642782 DOI: 10.1101/2024.11.27.625748] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Indexed: 12/17/2024]
Abstract
Ensemble learning has been increasingly popular for boosting the predictive power of polygenic risk scores (PRS), with almost every recent multi-ancestry PRS approach employing ensemble learning as a final step. Existing ensemble approaches rely on individual-level data for model training, which severely limits their real-world applications, especially in non-European populations without sufficient genomic samples. Here, we introduce a statistical framework to construct regularized ensemble PRS, which allows us to combine a large number of candidate PRS models using only summary statistics from genome-wide association studies. We demonstrate its robust and substantial improvement over many existing PRS models in both within- and cross-ancestry applications. We believe this is truly "one score to rule them all" due to its capability to continuously combine newly developed PRS models with existing models to improve prediction performance, which makes it a universal approach that should always be employed in future PRS applications.
Collapse
Affiliation(s)
- Zijie Zhao
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Stephen Dorn
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Yuchang Wu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Xiaoyu Yang
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
| | - Jin Jin
- Department of Biostatistics, Epidemiology and Bioinformatics, Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA
| | - Qiongshi Lu
- Department of Biostatistics and Medical Informatics, University of Wisconsin-Madison, Madison, WI
- Department of Statistics, University of Wisconsin-Madison, Madison, WI
| |
Collapse
|
5
|
Blechter B, Wang X, Shi J, Shiraishi K, Choi J, Matsuo K, Chen TY, Dai J, Hung RJ, Chen K, Shu XO, Kim YT, Choudhury PP, Williams J, Landi MT, Lin D, Zheng W, Yin Z, Zhou B, Wang J, Seow WJ, Song L, Chang IS, Hu W, Chien LH, Cai Q, Hong YC, Kim HN, Wu YL, Wong MP, Richardson BD, Li S, Zhang T, Breeze C, Wang Z, Bassig BA, Kim JH, Albanes D, Wong JY, Shin MH, Chung LP, Yang Y, An SJ, Zheng H, Yatabe Y, Zhang XC, Kim YC, Caporaso NE, Chang J, Man Ho JC, Kubo M, Daigo Y, Song M, Momozawa Y, Kamatani Y, Kobayashi M, Okubo K, Honda T, Hosgood HD, Kunitoh H, Watanabe SI, Miyagi Y, Nakayama H, Matsumoto S, Horinouchi H, Tsuboi M, Hamamoto R, Goto K, Ohe Y, Takahashi A, Goto A, Minamiya Y, Hara M, Nishida Y, Takeuchi K, Wakai K, Matsuda K, Murakami Y, Shimizu K, Suzuki H, Saito M, Ohtaki Y, Tanaka K, Wu T, Wei F, Dai H, Machiela MJ, Su J, Kim YH, Oh IJ, Fun Lee VH, Chang GC, Tsai YH, Che KY, Huang MS, Su WC, Chen YM, Seow A, Park JY, Kweon SS, et alBlechter B, Wang X, Shi J, Shiraishi K, Choi J, Matsuo K, Chen TY, Dai J, Hung RJ, Chen K, Shu XO, Kim YT, Choudhury PP, Williams J, Landi MT, Lin D, Zheng W, Yin Z, Zhou B, Wang J, Seow WJ, Song L, Chang IS, Hu W, Chien LH, Cai Q, Hong YC, Kim HN, Wu YL, Wong MP, Richardson BD, Li S, Zhang T, Breeze C, Wang Z, Bassig BA, Kim JH, Albanes D, Wong JY, Shin MH, Chung LP, Yang Y, An SJ, Zheng H, Yatabe Y, Zhang XC, Kim YC, Caporaso NE, Chang J, Man Ho JC, Kubo M, Daigo Y, Song M, Momozawa Y, Kamatani Y, Kobayashi M, Okubo K, Honda T, Hosgood HD, Kunitoh H, Watanabe SI, Miyagi Y, Nakayama H, Matsumoto S, Horinouchi H, Tsuboi M, Hamamoto R, Goto K, Ohe Y, Takahashi A, Goto A, Minamiya Y, Hara M, Nishida Y, Takeuchi K, Wakai K, Matsuda K, Murakami Y, Shimizu K, Suzuki H, Saito M, Ohtaki Y, Tanaka K, Wu T, Wei F, Dai H, Machiela MJ, Su J, Kim YH, Oh IJ, Fun Lee VH, Chang GC, Tsai YH, Che KY, Huang MS, Su WC, Chen YM, Seow A, Park JY, Kweon SS, Chen KC, Gao YT, Qian B, Wu C, Lu D, Liu J, Schwartz AG, Houlston R, Spitz MR, Gorlov IP, Wu X, Yang P, Lam S, Tardon A, Chen C, Bojesen SE, Johansson M, Risch A, Bickeböller H, Ji BT, Wichmann HE, Christiani DC, Rennert G, Arnold S, Brennan P, McKay J, Field JK, Davies MPA, Shete SS, Le Marchand L, Liu G, Andrew A, Kiemeney LA, Zienolddiny-Narui S, Grankvist K, Johansson M, Cox A, Taylor F, Yuan JM, Lazarus P, Schabath MB, Aldrich MC, Jeon HS, Jiang SS, Sung JS, Chen CH, Hsiao CF, Jung YJ, Guo H, Hu Z, Burdett L, Yeager M, Hutchinson A, Hicks B, Liu J, Zhu B, Berndt SI, Wu W, Wang J, Li Y, Choi JE, Park KH, Sung SW, Liu L, Kang CH, Wang WC, Xu J, Guan P, Tan W, Yu CJ, Yang G, Loon Sihoe AD, Chen Y, Choi YY, Kim JS, Yoon HI, Park IK, Xu P, He Q, Wang CL, Hung HH, Vermeulen RCH, Cheng I, Wu J, Lim WY, Tsai FY, Chan JKC, Li J, Chen H, Lin HC, Jin L, Liu J, Sawada N, Yamaji T, Wyatt K, Li SA, Ma H, Zhu M, Wang Z, Cheng S, Li X, Ren Y, Chao A, Iwasaki M, Zhu J, Jiang G, Fei K, Wu G, Chen CY, Chen CJ, Yang PC, Yu J, Stevens VL, Fraumeni JF, Chatterjee N, Gorlova OY, Amos CI, Shen H, Hsiung CA, Chanock SJ, Rothman N, Kohno T, Lan Q, Zhang H. Stratifying Lung Adenocarcinoma Risk with Multi-ancestry Polygenic Risk Scores in East Asian Never-Smokers. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.06.26.24309127. [PMID: 38978671 PMCID: PMC11230324 DOI: 10.1101/2024.06.26.24309127] [Show More Authors] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 07/10/2024]
Abstract
Polygenic risk scores (PRSs) are promising for risk stratification but have mainly been developed in European populations. This study developed single- and multi-ancestry PRSs for lung adenocarcinoma (LUAD) in East Asian (EAS) never-smokers using genome-wide association study summary statistics from EAS (8,002 cases; 20,782 controls) and European (2,058 cases; 5,575 controls) populations. A multi-ancestry PRS, developed using CT-SLEB, was strongly associated with LUAD risk (odds ratio=1.71, 95% confidence interval (CI):1.61,1.82), with an area under the receiver operating curve value of 0.640 (95% CI:0.629,0.653). Individuals in the highest 20% of the PRS had nearly four times the risk compared to the lowest 20%. Individuals in the 95 th percentile of the PRS had an estimated 6.69% lifetime absolute risk. Notably, this group reached the average population 10-year LUAD risk at age 50 (0.42%) by age 41. Our study underscores the potential of multi-ancestry PRS approaches to enhance LUAD risk stratification in EAS never-smokers.
Collapse
|
6
|
Xu L, Zhou G, Jiang W, Zhang H, Dong Y, Guan L, Zhao H. JointPRS: A Data-Adaptive Framework for Multi-Population Genetic Risk Prediction Incorporating Genetic Correlation. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2024:2023.10.29.564615. [PMID: 37961111 PMCID: PMC10634936 DOI: 10.1101/2023.10.29.564615] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/15/2023]
Abstract
Genetic prediction accuracy for non-European populations is hindered by the limited sample size of Genome-wide association studies (GWAS) data in these populations. Additionally, it is challenging to tune model parameters with a small tuning dataset for methods that require tuning data, which is often the case for non-European samples. To address these challenges, we propose JointPRS, a novel, data-adaptive framework that simultaneously models multiple populations using GWAS summary statistics. JointPRS incorporates genetic correlation structures into the prediction framework, enabling accurate performance even without individual-level tuning data. Additionally, it uniquely employs a data-adaptive approach, providing a robust solution when only a small tuning dataset is available. Through extensive simulations and real data applications to 22 quantitative traits and four binary traits in five continental populations (European (EUR); East Asian (EAS); African (AFR); South Asian (SAS); and Admixed American (AMR)) evaluated using the UK Biobank (UKBB) and All of Us (AoU), we demonstrate that JointPRS outperforms six other state-of-art methods across three different data scenarios (no tuning data, tuning and testing data from the same cohort, and tuning and testing data from different cohorts) for most traits in non-European populations, while maintaining model simplicity and computational efficiency.
Collapse
Affiliation(s)
- Leqi Xu
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Geyu Zhou
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Wei Jiang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Haoyu Zhang
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, Bethesda, MD, USA
| | - Yikai Dong
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Leying Guan
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
- Program in Computational Biology and Bioinformatics, Yale University, New Haven, CT, USA
| |
Collapse
|
7
|
Chen T, Pham G, Fox L, Adler N, Wang X, Zhang J, Byun J, Han Y, Saunders GRB, Liu D, Bray MJ, Ramsey AT, McKay J, Bierut L, Amos CI, Hung RJ, Lin X, Zhang H, Chen LS. Genomic Insights for Personalized Care: Motivating At-Risk Individuals Toward Evidence-Based Health Practices. MEDRXIV : THE PREPRINT SERVER FOR HEALTH SCIENCES 2024:2024.03.19.24304556. [PMID: 38562690 PMCID: PMC10984046 DOI: 10.1101/2024.03.19.24304556] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 04/04/2024]
Abstract
Background Lung cancer and tobacco use pose significant global health challenges, necessitating a comprehensive translational roadmap for improved prevention strategies. Polygenic risk scores (PRSs) are powerful tools for patient risk stratification but have not yet been widely used in primary care for lung cancer, particularly in diverse patient populations. Methods We propose the GREAT care paradigm, which employs PRSs to stratify disease risk and personalize interventions. We developed PRSs using large-scale multi-ancestry genome-wide association studies and standardized PRS distributions across all ancestries. We applied our PRSs to 796 individuals from the GISC Trial, 350,154 from UK Biobank (UKBB), and 210,826 from All of Us Research Program (AoU), totaling 561,776 individuals of diverse ancestry. Results Significant odds ratios (ORs) for lung cancer and difficulty quitting smoking were observed in both UKBB and AoU. For lung cancer, the ORs for individuals in the highest risk group (top 20% versus bottom 20%) were 1.85 (95% CI: 1.58 - 2.18) in UKBB and 2.39 (95% CI: 1.93 - 2.97) in AoU. For difficulty quitting smoking, the ORs (top 33% versus bottom 33%) were 1.36 (95% CI: 1.32 - 1.41) in UKBB and 1.32 (95% CI: 1.28 - 1.36) in AoU. Conclusion Our PRS-based intervention model leverages large-scale genetic data for robust risk assessment across populations. This model will be evaluated in two cluster-randomized clinical trials aimed at motivating health behavior changes in high-risk patients of diverse ancestry. This pioneering approach integrates genomic insights into primary care, promising improved outcomes in cancer prevention and tobacco treatment.
Collapse
|
8
|
Shah Y, Kulm S, Nauseef JT, Chen Z, Elemento O, Kensler KH, Sharaf RN. Benchmarking multi-ancestry prostate cancer polygenic risk scores in a real-world cohort. PLoS Comput Biol 2024; 20:e1011990. [PMID: 38598551 PMCID: PMC11034641 DOI: 10.1371/journal.pcbi.1011990] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 04/22/2024] [Accepted: 03/11/2024] [Indexed: 04/12/2024] Open
Abstract
Prostate cancer is a heritable disease with ancestry-biased incidence and mortality. Polygenic risk scores (PRSs) offer promising advancements in predicting disease risk, including prostate cancer. While their accuracy continues to improve, research aimed at enhancing their effectiveness within African and Asian populations remains key for equitable use. Recent algorithmic developments for PRS derivation have resulted in improved pan-ancestral risk prediction for several diseases. In this study, we benchmark the predictive power of six widely used PRS derivation algorithms, including four of which adjust for ancestry, against prostate cancer cases and controls from the UK Biobank and All of Us cohorts. We find modest improvement in discriminatory ability when compared with a simple method that prioritizes variants, clumping, and published polygenic risk scores. Our findings underscore the importance of improving upon risk prediction algorithms and the sampling of diverse cohorts.
Collapse
Affiliation(s)
- Yajas Shah
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York City, New York, United States of America
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York City, New York, United States of America
| | - Scott Kulm
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York City, New York, United States of America
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York City, New York, United States of America
| | - Jones T. Nauseef
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York City, New York, United States of America
- Department of Medicine—Hematology and Medical Oncology, Weill Cornell Medicine, New York City, New York, United States of America
| | - Zhengming Chen
- Department of Population Health Sciences, Weill Cornell Medicine, New York City, New York, United States of America
| | - Olivier Elemento
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York City, New York, United States of America
- Department of Physiology and Biophysics, Weill Cornell Medicine, New York City, New York, United States of America
| | - Kevin H. Kensler
- Department of Population Health Sciences, Weill Cornell Medicine, New York City, New York, United States of America
| | - Ravi N. Sharaf
- Englander Institute for Precision Medicine, Weill Cornell Medicine, New York City, New York, United States of America
- Department of Population Health Sciences, Weill Cornell Medicine, New York City, New York, United States of America
- Department of Medicine–Gastroenterology and Hepatology, Weill Cornell Medicine, New York City, New York, United States of America
| |
Collapse
|
9
|
Jiang W, Chen L, Girgenti MJ, Zhao H. Tuning parameters for polygenic risk score methods using GWAS summary statistics from training data. Nat Commun 2024; 15:24. [PMID: 38169469 PMCID: PMC10762162 DOI: 10.1038/s41467-023-44009-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/15/2023] [Accepted: 11/27/2023] [Indexed: 01/05/2024] Open
Abstract
Various polygenic risk scores (PRS) methods have been proposed to combine the estimated effects of single nucleotide polymorphisms (SNPs) to predict genetic risks for common diseases, using data collected from genome-wide association studies (GWAS). Some methods require external individual-level GWAS dataset for parameter tuning, posing privacy and security-related concerns. Leaving out partial data for parameter tuning can also reduce model prediction accuracy. In this article, we propose PRStuning, a method that tunes parameters for different PRS methods using GWAS summary statistics from the training data. PRStuning predicts the PRS performance with different parameters, and then selects the best-performing parameters. Because directly using training data effects tends to overestimate the performance in the testing data, we adopt an empirical Bayes approach to shrinking the predicted performance in accordance with the genetic architecture of the disease. Extensive simulations and real data applications demonstrate PRStuning's accuracy across PRS methods and parameters.
Collapse
Affiliation(s)
- Wei Jiang
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA
| | - Ling Chen
- Department of Statistics, Columbia University, New York, NY, USA
| | | | - Hongyu Zhao
- Department of Biostatistics, Yale School of Public Health, New Haven, CT, USA.
| |
Collapse
|
10
|
Kachuri L, Chatterjee N, Hirbo J, Schaid DJ, Martin I, Kullo IJ, Kenny EE, Pasaniuc B, Witte JS, Ge T. Principles and methods for transferring polygenic risk scores across global populations. Nat Rev Genet 2024; 25:8-25. [PMID: 37620596 PMCID: PMC10961971 DOI: 10.1038/s41576-023-00637-2] [Citation(s) in RCA: 103] [Impact Index Per Article: 103.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 07/11/2023] [Indexed: 08/26/2023]
Abstract
Polygenic risk scores (PRSs) summarize the genetic predisposition of a complex human trait or disease and may become a valuable tool for advancing precision medicine. However, PRSs that are developed in populations of predominantly European genetic ancestries can increase health disparities due to poor predictive performance in individuals of diverse and complex genetic ancestries. We describe genetic and modifiable risk factors that limit the transferability of PRSs across populations and review the strengths and weaknesses of existing PRS construction methods for diverse ancestries. Developing PRSs that benefit global populations in research and clinical settings provides an opportunity for innovation and is essential for health equity.
Collapse
Affiliation(s)
- Linda Kachuri
- Department of Epidemiology and Population Health, Stanford University School of Medicine, Stanford, CA, USA
- Stanford Cancer Institute, Stanford University School of Medicine, Stanford, CA, USA
| | - Nilanjan Chatterjee
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD, USA
| | - Jibril Hirbo
- Department of Medicine Division of Genetic Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
- Vanderbilt Genetics Institute, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Daniel J Schaid
- Department of Quantitative Health Sciences, Mayo Clinic, Rochester, MN, USA
| | - Iman Martin
- Division of Genomic Medicine, National Human Genome Research Institute, Bethesda, MD, USA
| | - Iftikhar J Kullo
- Department of Cardiovascular Medicine, Mayo Clinic, Rochester, MN, USA
| | - Eimear E Kenny
- Institute for Genomic Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Department of Medicine, Icahn School of Medicine at Mount Sinai, New York, NY, USA
- Department of Genetics and Genomic Sciences, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - Bogdan Pasaniuc
- Department of Computational Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Human Genetics, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
- Department of Pathology and Laboratory Medicine, David Geffen School of Medicine, University of California, Los Angeles, Los Angeles, CA, USA
| | - John S Witte
- Department of Epidemiology and Population Health, Stanford University School of Medicine, Stanford, CA, USA.
- Stanford Cancer Institute, Stanford University School of Medicine, Stanford, CA, USA.
- Department of Biomedical Data Science, Stanford University, Stanford, CA, USA.
- Department of Genetics, Stanford University, Stanford, CA, USA.
| | - Tian Ge
- Psychiatric and Neurodevelopmental Genetics Unit, Center for Genomic Medicine, Massachusetts General Hospital, Boston, MA, USA.
- Center for Precision Psychiatry, Department of Psychiatry, Massachusetts General Hospital, Harvard Medical School, Boston, MA, USA.
- Stanley Center for Psychiatric Research, Broad Institute of MIT and Harvard, Cambridge, MA, USA.
| |
Collapse
|
11
|
Khan A, Shang N, Nestor JG, Weng C, Hripcsak G, Harris PC, Gharavi AG, Kiryluk K. Polygenic risk alters the penetrance of monogenic kidney disease. Nat Commun 2023; 14:8318. [PMID: 38097619 PMCID: PMC10721887 DOI: 10.1038/s41467-023-43878-9] [Citation(s) in RCA: 14] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/11/2023] [Accepted: 11/22/2023] [Indexed: 12/17/2023] Open
Abstract
Chronic kidney disease (CKD) is determined by an interplay of monogenic, polygenic, and environmental risks. Autosomal dominant polycystic kidney disease (ADPKD) and COL4A-associated nephropathy (COL4A-AN) represent the most common forms of monogenic kidney diseases. These disorders have incomplete penetrance and variable expressivity, and we hypothesize that polygenic factors explain some of this variability. By combining SNP array, exome/genome sequence, and electronic health record data from the UK Biobank and All-of-Us cohorts, we demonstrate that the genome-wide polygenic score (GPS) significantly predicts CKD among ADPKD monogenic variant carriers. Compared to the middle tertile of the GPS for noncarriers, ADPKD variant carriers in the top tertile have a 54-fold increased risk of CKD, while ADPKD variant carriers in the bottom tertile have only a 3-fold increased risk of CKD. Similarly, the GPS significantly predicts CKD in COL4A-AN carriers. The carriers in the top tertile of the GPS have a 2.5-fold higher risk of CKD, while the risk for carriers in the bottom tertile is not different from the average population risk. These results suggest that accounting for polygenic risk improves risk stratification in monogenic kidney disease.
Collapse
Affiliation(s)
- Atlas Khan
- Division of Nephrology, Department of Medicine, Vagelos College of Physicians & Surgeons, Columbia University, New York, NY, USA
| | - Ning Shang
- Division of Nephrology, Department of Medicine, Vagelos College of Physicians & Surgeons, Columbia University, New York, NY, USA
| | - Jordan G Nestor
- Division of Nephrology, Department of Medicine, Vagelos College of Physicians & Surgeons, Columbia University, New York, NY, USA
| | - Chunhua Weng
- Department of Biomedical Informatics, Vagelos College of Physicians & Surgeons, Columbia University, New York, NY, USA
| | - George Hripcsak
- Department of Biomedical Informatics, Vagelos College of Physicians & Surgeons, Columbia University, New York, NY, USA
| | - Peter C Harris
- Division of Nephrology and Hypertension, Mayo Clinic, Rochester, MN, USA
| | - Ali G Gharavi
- Division of Nephrology, Department of Medicine, Vagelos College of Physicians & Surgeons, Columbia University, New York, NY, USA
| | - Krzysztof Kiryluk
- Division of Nephrology, Department of Medicine, Vagelos College of Physicians & Surgeons, Columbia University, New York, NY, USA.
| |
Collapse
|
12
|
Chen T, Zhang H, Mazumder R, Lin X. Ensembled best subset selection using summary statistics for polygenic risk prediction. BIORXIV : THE PREPRINT SERVER FOR BIOLOGY 2023:2023.09.25.559307. [PMID: 37886515 PMCID: PMC10602024 DOI: 10.1101/2023.09.25.559307] [Citation(s) in RCA: 2] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/28/2023]
Abstract
Polygenic risk scores (PRS) enhance population risk stratification and advance personalized medicine, yet existing methods face a tradeoff between predictive power and computational efficiency. We introduce ALL-Sum, a fast and scalable PRS method that combines an efficient summary statistic-based L 0 L 2 penalized regression algorithm with an ensembling step that aggregates estimates from different tuning parameters for improved prediction performance. In extensive large-scale simulations across a wide range of polygenicity and genome-wide association studies (GWAS) sample sizes, ALL-Sum consistently outperforms popular alternative methods in terms of prediction accuracy, runtime, and memory usage. We analyze 27 published GWAS summary statistics for 11 complex traits from 9 reputable data sources, including the Global Lipids Genetics Consortium, Breast Cancer Association Consortium, and FinnGen, evaluated using individual-level UKBB data. ALL-Sum achieves the highest accuracy for most traits, particularly for GWAS with large sample sizes. We provide ALL-Sum as a user-friendly command-line software with pre-computed reference data for streamlined user-end analysis.
Collapse
|