1
|
Shin J, Wu J, Kim HJ, Xi W. Neighborhood-level social determinants of suicidality in youth with schizophrenia: An EHR-based study. Schizophr Res 2025; 281:74-81. [PMID: 40318312 DOI: 10.1016/j.schres.2025.04.035] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 02/02/2025] [Revised: 04/18/2025] [Accepted: 04/29/2025] [Indexed: 05/07/2025]
Abstract
BACKGROUND Suicidal thoughts and behaviors (STB) among youth with schizophrenia represent a significant public health concern. It is well-established that neighborhood-level social determinants of health (SDoHs) can impact health outcomes in individuals with schizophrenia. We aimed to investigate the effects of neighborhood-level social determinants on developing future STB in youth with schizophrenia. METHODS We conducted a retrospective cohort study using electronic health records from the INSIGHT Clinical Research Network, which contains >22 million unique patients across five healthcare systems in New York City. Patients' neighborhood-level SDoHs were measured at their residential ZIP Code Tabulation Area using a composite measure, Social Deprivation Index (SDI), as well as specific components derived from the American Community Survey. Survival analysis was used to study the association between neighborhood-level SDoHs and time to STB since the first schizophrenia diagnosis. RESULTS Between 10/1/2015 and 10/1/2022, we identified 1209 youth aged between 10 and 25 years with a schizophrenia diagnosis and no prior STB, among whom 176 developed STB during follow-up. SDI quintiles were not associated with the risk of future STB, whereas two specific neighborhood characteristics, Gini index and percentage of residents commuting by car/truck/van, were associated with a decreased risk of STB, after controlling for patients' demographic characteristics. CONCLUSIONS Although the overall neighborhood deprivation level was not associated with the risk of STB among youth with schizophrenia, specific neighborhood characteristics were. These findings underscore the need for more targeted community-based suicide prevention strategies. Further research is essential to better understand the underlying mechanism of these associations.
Collapse
Affiliation(s)
- Jeonghyun Shin
- Department of Social and Behavioral Sciences, Harvard T.H. Chan School of Public Health, Boston, MA, USA
| | - Jialin Wu
- Division of Biostatistics, Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA
| | - Hyun Jung Kim
- Department of Psychiatry, Harvard Medical School, Boston, MA, USA; Division of Psychotic Disorders, McLean Hospital, Belmont, MA, USA
| | - Wenna Xi
- Division of Biostatistics, Department of Population Health Sciences, Weill Cornell Medicine, New York, NY, USA.
| |
Collapse
|
2
|
Lo Barco T, Garcelon N, Neuraz A, Nabbout R. Natural history of rare diseases using natural language processing of narrative unstructured electronic health records: The example of Dravet syndrome. Epilepsia 2024; 65:350-361. [PMID: 38065926 DOI: 10.1111/epi.17855] [Citation(s) in RCA: 2] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2023] [Revised: 12/07/2023] [Accepted: 12/07/2023] [Indexed: 12/31/2023]
Abstract
OBJECTIVE The increasing implementation of electronic health records allows the use of advanced text-mining methods for establishing new patient phenotypes and stratification, and for revealing outcome correlations. In this study, we aimed to explore the electronic narrative clinical reports of a cohort of patients with Dravet syndrome (DS) longitudinally followed at our center, to identify the capacity of this methodology to retrace natural history of DS during the early years. METHODS We used a document-based clinical data warehouse employing natural language processing to recognize the phenotype concepts in the narrative medical reports. We included patients with DS who have a medical report produced before the age of 2 years and a follow-up after the age of 3 years ("DS cohort," 56 individuals). We selected two control populations, a "general control cohort" (275 individuals) and a "neurological control cohort" (281 individuals), with similar characteristics in terms of gender, number of reports, and age at last report. To find concepts specifically associated with DS, we performed a phenome-wide association study using Cox regression, comparing the reports of the three cohorts. We then performed a qualitative analysis of the surviving concepts based on their median age at first appearance. RESULTS A total of 76 concepts were prevalent in the reports of children with DS. Concepts appearing during the first 2 years were mostly related with the epilepsy features at the onset of DS (convulsive and prolonged seizures triggered by fever, often requiring in-hospital care). Subsequently, concepts related to new types of seizures and to drug resistance appeared. A series of non-seizure-related concepts emerged after the age of 2-3 years, referring to the nonseizure comorbidities classically associated with DS. SIGNIFICANCE The extraction of clinical terms by narrative reports of children with DS allows outlining the known natural history of this rare disease in early childhood. This original model of "longitudinal phenotyping" could be applied to other rare and very rare conditions with poor natural history description.
Collapse
Affiliation(s)
- Tommaso Lo Barco
- Department of Pediatric Neurology, Necker-Enfants Malades Hospital, Assistance Publique-Hôpitaux de Paris, Reference Center for Rare Epilepsies, Member of European Reference Network EpiCARE, Université Paris Cité, Paris, France
| | - Nicolas Garcelon
- Data Science Platform, Institut National de la Santé et de la Recherche Médicale Unité Mixte de Recherche 1163, Imagine Institute, Université Paris Cité, Paris, France
| | - Antoine Neuraz
- Data Science Platform, Institut National de la Santé et de la Recherche Médicale Unité Mixte de Recherche 1163, Imagine Institute, Université Paris Cité, Paris, France
| | - Rima Nabbout
- Department of Pediatric Neurology, Necker-Enfants Malades Hospital, Assistance Publique-Hôpitaux de Paris, Reference Center for Rare Epilepsies, Member of European Reference Network EpiCARE, Université Paris Cité, Paris, France
- Translational Research for Neurological Disorders, Institut National de la Santé et de la Recherche Médicale Unité Mixte de Recherche 1163, Imagine Institute, Université Paris Cité, Paris, France
| |
Collapse
|
3
|
Gebski V, Silva SSM, Byth K, Jenkins A, Keech A. Improving efficiency of fitting Cox proportional hazards models for time-to-event outcomes in genome-wide association studies (GWAS). BIOINFORMATICS ADVANCES 2023; 3:vbad148. [PMID: 37928342 PMCID: PMC10625458 DOI: 10.1093/bioadv/vbad148] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 04/17/2023] [Revised: 10/02/2023] [Accepted: 10/11/2023] [Indexed: 11/07/2023]
Abstract
Summary Technologies identifying single nucleotide polymorphisms (SNPs) in DNA sequencing yield an avalanche of data requiring analysis and interpretation. Standard methods may require many weeks of processing time. The use of statistical methods requiring data sorting, matrix inversions of a high-dimension and replication in subsets of the data on multiple outcomes exacerbate these times.A method which reduces the computational time in problems with time-to-event outcomes and hundreds of thousands/millions of SNPs using Cox-Snell residuals after fitting the Cox proportional hazards model (PH) to a fixed set of concomitant variables is proposed. This yields coefficients for SNP effect from a Cox-Snell adjusted Poisson model and shows a high concordance to the adjusted PH model.The method is illustrated with a sample of 10 000 SNPs from a genome-wide association study in a diabetic population. The gain in processing efficiency using the proposed method based on Poisson modelling can be as high as 62%. This could result in saving of over three weeks processing time if 5 million SNPs require analysis. The method involves only a single predictor variable (SNP), offering a simpler, computationally more stable approach to examining and identifying SNP patterns associated with the outcome(s) allowing for a faster development of genetic signatures. Use of deviance residuals from the PH model to screen SNPs demonstrates a large discordance rate at a 0.2% threshold of concordance. This rate is 15 times larger than that based on the Cox-Snell residuals from the Cox-Snell adjusted Poisson model. Availability and implementation The method is simple to implement as the procedures are available in most statistical packges. The approach involves obtaining Cox-Snell residuals from a PH model, to a binary time-to-event outcome, for factors which need to be common when assessing each SNP. Each SNP is then fitted as a predictor to the outcome of interest using a Poisson model with the Cox-Snell as the exposure variable.
Collapse
Affiliation(s)
- Val Gebski
- NHMRC Clinical Trials Centre, University of Sydney, Camperdown, NSW 1450, Australia
| | - S Sandun M Silva
- NHMRC Clinical Trials Centre, University of Sydney, Camperdown, NSW 1450, Australia
| | - Karen Byth
- NHMRC Clinical Trials Centre, University of Sydney, Camperdown, NSW 1450, Australia
| | - Alicia Jenkins
- NHMRC Clinical Trials Centre, University of Sydney, Camperdown, NSW 1450, Australia
| | - Anthony Keech
- NHMRC Clinical Trials Centre, University of Sydney, Camperdown, NSW 1450, Australia
| |
Collapse
|
4
|
Sadeqi MB, Ballvora A, Léon J. Local and Bayesian Survival FDR Estimations to Identify Reliable Associations in Whole Genome of Bread Wheat. Int J Mol Sci 2023; 24:14011. [PMID: 37762314 PMCID: PMC10531084 DOI: 10.3390/ijms241814011] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/03/2023] [Revised: 09/02/2023] [Accepted: 09/07/2023] [Indexed: 09/29/2023] Open
Abstract
Estimating the FDR significance threshold in genome-wide association studies remains a major challenge in distinguishing true positive hypotheses from false positive and negative errors. Several comparative methods for multiple testing comparison have been developed to determine the significance threshold; however, these methods may be overly conservative and lead to an increase in false negative results. The local FDR approach is suitable for testing many associations simultaneously based on the empirical Bayes perspective. In the local FDR, the maximum likelihood estimator is sensitive to bias when the GWAS model contains two or more explanatory variables as genetic parameters simultaneously. The main criticism of local FDR is that it focuses only locally on the effects of single nucleotide polymorphism (SNP) in tails of distribution, whereas the signal associations are distributed across the whole genome. The advantage of the Bayesian perspective is that knowledge of prior distribution comes from other genetic parameters included in the GWAS model, such as linkage disequilibrium (LD) analysis, minor allele frequency (MAF) and call rate of significant associations. We also proposed Bayesian survival FDR to solve the multi-collinearity and large-scale problems, respectively, in grain yield (GY) vector in bread wheat with large-scale SNP information. The objective of this study was to obtain a short list of SNPs that are reliably associated with GY under low and high levels of nitrogen (N) in the population. The five top significant SNPs were compared with different Bayesian models. Based on the time to events in the Bayesian survival analysis, the differentiation between minor and major alleles within the association panel can be identified.
Collapse
Affiliation(s)
| | - Agim Ballvora
- INRES-Plant Breeding, Rheinische Friedrich-Wilhelms-Universität Bonn, 53113 Bonn, Germany; (M.B.S.); (J.L.)
| | | |
Collapse
|
5
|
Pedersen EM, Agerbo E, Plana-Ripoll O, Steinbach J, Krebs MD, Hougaard DM, Werge T, Nordentoft M, Børglum AD, Musliner KL, Ganna A, Schork AJ, Mortensen PB, McGrath JJ, Privé F, Vilhjálmsson BJ. ADuLT: An efficient and robust time-to-event GWAS. Nat Commun 2023; 14:5553. [PMID: 37689771 PMCID: PMC10492844 DOI: 10.1038/s41467-023-41210-z] [Citation(s) in RCA: 5] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/31/2022] [Accepted: 08/28/2023] [Indexed: 09/11/2023] Open
Abstract
Proportional hazards models have been proposed to analyse time-to-event phenotypes in genome-wide association studies (GWAS). However, little is known about the ability of proportional hazards models to identify genetic associations under different generative models and when ascertainment is present. Here we propose the age-dependent liability threshold (ADuLT) model as an alternative to a Cox regression based GWAS, here represented by SPACox. We compare ADuLT, SPACox, and standard case-control GWAS in simulations under two generative models and with varying degrees of ascertainment as well as in the iPSYCH cohort. We find Cox regression GWAS to be underpowered when cases are strongly ascertained (cases are oversampled by a factor 5), regardless of the generative model used. ADuLT is robust to ascertainment in all simulated scenarios. Then, we analyse four psychiatric disorders in iPSYCH, ADHD, Autism, Depression, and Schizophrenia, with a strong case-ascertainment. Across these psychiatric disorders, ADuLT identifies 20 independent genome-wide significant associations, case-control GWAS finds 17, and SPACox finds 8, which is consistent with simulation results. As more genetic data are being linked to electronic health records, robust GWAS methods that can make use of age-of-onset information will help increase power in analyses for common health outcomes.
Collapse
Affiliation(s)
- Emil M Pedersen
- National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark.
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark.
| | - Esben Agerbo
- National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark
- Centre for Integrated Register-based Research at Aarhus University, Aarhus, Denmark
| | - Oleguer Plana-Ripoll
- National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark
- Department of Clinical Epidemiology, Aarhus University and Aarhus University Hospital, Aarhus, Denmark
| | - Jette Steinbach
- National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark
| | - Morten D Krebs
- Institute of Biological Psychiatry, Mental Health Center - Sct Hans, Copenhagen University Hospital - Mental Health Services CPH, Copenhagen, Denmark
| | - David M Hougaard
- Department for Congenital Disorders, Statens Serum Institut, Copenhagen, Denmark
| | - Thomas Werge
- Institute of Biological Psychiatry, Mental Health Center - Sct Hans, Copenhagen University Hospital - Mental Health Services CPH, Copenhagen, Denmark
- Department of Clinical Sciences, Copenhagen University, Copenhagen, Denmark
- Section for Geogenetics, GLOBE Institute, Faculty of Health and Medical Science, Copenhagen University, Copenhagen, Denmark
| | - Merete Nordentoft
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark
- CORE- Copenhagen Centre for Research in Mental Health, Mental Health Center-Copenhagen, Copenhagen University Hospital - Mental Health Services CPH, Copenhagen, Denmark
| | - Anders D Børglum
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark
- Department of Biomedicine and iSEQ Centre, Aarhus University, Aarhus, Denmark
- Center for Genomics and Personalized Medicine, CGPM, Aarhus University, Aarhus, Denmark
| | - Katherine L Musliner
- National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark
- Department of Affective Disorders, Aarhus University Hospital-Psychiatry, Aarhus, Denmark
- Department of Clinical Medicine, Aarhus University, Aarhus, Denmark
| | - Andrea Ganna
- Institute for Molecular Medicine Finland, University of Helsinki, Helsinki, Finland
| | - Andrew J Schork
- Institute of Biological Psychiatry, Mental Health Center - Sct Hans, Copenhagen University Hospital - Mental Health Services CPH, Copenhagen, Denmark
- Section for Geogenetics, GLOBE Institute, Faculty of Health and Medical Science, Copenhagen University, Copenhagen, Denmark
- Neurogenomics Division, The Translational Genomics Research Institute (TGEN), Phoenix, AZ, USA
| | - Preben B Mortensen
- National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark
| | - John J McGrath
- National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark
- Queensland Brain Institute, University of Queensland, St Lucia, QLD, Australia
- Queensland Centre for Mental Health Research, The Park Centre for Mental Health, Wacol, QLD, Australia
| | - Florian Privé
- National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark
| | - Bjarni J Vilhjálmsson
- National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark.
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, iPSYCH, Aarhus, Denmark.
- Bioinformatics Research Centre, Aarhus University, Aarhus, Denmark.
- Novo Nordisk Foundation Center for Genomic Mechanisms of Disease, the Broad Institute of MIT and Harvard, Massachusetts, USA.
| |
Collapse
|
6
|
Bastarache L, Delozier S, Pandit A, He J, Lewis A, Annis AC, LeFaive J, Denny JC, Carroll RJ, Altman RB, Hughey JJ, Zawistowski M, Peterson JF. The phenotype-genotype reference map: Improving biobank data science through replication. Am J Hum Genet 2023; 110:1522-1533. [PMID: 37607538 PMCID: PMC10502848 DOI: 10.1016/j.ajhg.2023.07.012] [Citation(s) in RCA: 6] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2023] [Revised: 07/26/2023] [Accepted: 07/27/2023] [Indexed: 08/24/2023] Open
Abstract
Population-scale biobanks linked to electronic health record data provide vast opportunities to extend our knowledge of human genetics and discover new phenotype-genotype associations. Given their dense phenotype data, biobanks can also facilitate replication studies on a phenome-wide scale. Here, we introduce the phenotype-genotype reference map (PGRM), a set of 5,879 genetic associations from 523 GWAS publications that can be used for high-throughput replication experiments. PGRM phenotypes are standardized as phecodes, ensuring interoperability between biobanks. We applied the PGRM to five ancestry-specific cohorts from four independent biobanks and found evidence of robust replications across a wide array of phenotypes. We show how the PGRM can be used to detect data corruption and to empirically assess parameters for phenome-wide studies. Finally, we use the PGRM to explore factors associated with replicability of GWAS results.
Collapse
Affiliation(s)
- Lisa Bastarache
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA.
| | - Sarah Delozier
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Anita Pandit
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Jing He
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Adam Lewis
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Aubrey C Annis
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Jonathon LeFaive
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Joshua C Denny
- National Human Genome Research Institute, National Institutes of Health, Bethesda, MD, USA
| | - Robert J Carroll
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Russ B Altman
- Department of Genetics, Stanford University, Stanford, CA, USA
| | - Jacob J Hughey
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA
| | - Matthew Zawistowski
- Department of Biostatistics and Center for Statistical Genetics, University of Michigan School of Public Health, Ann Arbor, MI 48109, USA
| | - Josh F Peterson
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA; Department of Medicine, Vanderbilt University Medical Center, Nashville, TN, USA
| |
Collapse
|
7
|
Kan H, Liu H, Mu Y, Li Y, Zhang M, Cao Y, Dong Y, Li Y, Wang K, Li Q, Hu A, Zheng Y. Novel genetic variants linked to prelabor rupture of membranes among Chinese pregnant women. Placenta 2023; 137:14-22. [PMID: 37054626 DOI: 10.1016/j.placenta.2023.04.007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/15/2022] [Revised: 03/04/2023] [Accepted: 04/07/2023] [Indexed: 04/15/2023]
Abstract
INTRODUCTION The etiology of prelabor rupture of membranes (PROM), either preterm or term PROM (PPROM or TPROM), remains largely unknown. This study aimed to investigate the association between maternal genetic variants (GVs) and PROM and further establish a GV-based prediction model for PROM. METHODS In this case-cohort study (n = 1166), Chinese pregnant women with PPROM (n = 51), TPROM (n = 283) and controls (n = 832) were enrolled. A weighted Cox model was applied to identify the GVs (single nucleotide polymorphisms [SNPs], insertions/deletions, and copy number variants) associated with either PPROM or TPROM. Gene set enrichment analysis (GSEA) was to explore the mechanisms. The suggestively significant GVs were applied to establish a random forest (RF) model. RESULTS PTPRT variants (rs117950601, P = 4.37 × 10-9; rs147178603, P = 8.98 × 10-9) and SNRNP40 variant (rs117573344, P = 2.13 × 10-8) were associated with PPROM. STXBP5L variant (rs10511405, P = 4.66 × 10-8) was associated with TPROM. GSEA results showed that genes associated with PPROM were enriched in cell adhesion, and TPROM in ascorbate and glucuronidation metabolism. The area under the receiver operating characteristic curve of SNP-based RF model for PPROM was 0.961, with a sensitivity of 100.0% and specificity of 83.3%. DISCUSSION Maternal GVs in PTPRT and SNRNP40 were associated with PPROM, and GV in STXBP5L was associated with TPROM. Cell adhesion participated in PPROM, while ascorbate and glucuronidation metabolism contributed in TPROM. The PPROM might be well predicted using the SNP-based RF model.
Collapse
Affiliation(s)
- Hui Kan
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, 200032, China; Key Laboratory for Health Technology Assessment, National Commission of Health and Family Planning, Fudan University, Shanghai, 200032, China
| | - Haiyan Liu
- Department of Clinical Laboratory, Anqing Municipal Hospital, Anqing, 246003, China; Department of Blood Transfusion, Anqing Municipal Hospital, Anqing, 246003, China
| | - Yutong Mu
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, 200032, China; Key Laboratory for Health Technology Assessment, National Commission of Health and Family Planning, Fudan University, Shanghai, 200032, China
| | - Yijie Li
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, 200032, China; Key Laboratory for Health Technology Assessment, National Commission of Health and Family Planning, Fudan University, Shanghai, 200032, China
| | - Miao Zhang
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, 200032, China; Key Laboratory for Health Technology Assessment, National Commission of Health and Family Planning, Fudan University, Shanghai, 200032, China
| | - Yanmin Cao
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, 200032, China; Key Laboratory for Health Technology Assessment, National Commission of Health and Family Planning, Fudan University, Shanghai, 200032, China
| | - Yao Dong
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, 200032, China; Key Laboratory for Health Technology Assessment, National Commission of Health and Family Planning, Fudan University, Shanghai, 200032, China
| | - Yaxin Li
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, 200032, China; Key Laboratory for Health Technology Assessment, National Commission of Health and Family Planning, Fudan University, Shanghai, 200032, China
| | - Kailin Wang
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, 200032, China; Key Laboratory for Health Technology Assessment, National Commission of Health and Family Planning, Fudan University, Shanghai, 200032, China
| | - Qing Li
- Department of Obstetrics and Gynecology, Anqing Municipal Hospital, Anqing, 246003, China.
| | - Anqun Hu
- Department of Clinical Laboratory, Anqing Municipal Hospital, Anqing, 246003, China.
| | - Yingjie Zheng
- Department of Epidemiology, School of Public Health, Fudan University, Shanghai, 200032, China; Key Laboratory for Health Technology Assessment, National Commission of Health and Family Planning, Fudan University, Shanghai, 200032, China; Laboratory of Public Health Safety, Ministry of Education, School of Public Health, Fudan University, Shanghai, 200032, China.
| |
Collapse
|
8
|
Irlmeier R, Hughey JJ, Bastarache L, Denny JC, Chen Q. Cox regression is robust to inaccurate EHR-extracted event time: an application to EHR-based GWAS. Bioinformatics 2022; 38:2297-2306. [PMID: 35157022 PMCID: PMC10060718 DOI: 10.1093/bioinformatics/btac086] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2021] [Revised: 12/14/2021] [Accepted: 02/09/2022] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Logistic regression models are used in genomic studies to analyze the genetic data linked to electronic health records (EHRs), and do not take full usage of the time-to-event information available in EHRs. Previous work has shown that Cox regression, which can account for left truncation and right censoring in EHRs, increased the power to detect genotype-phenotype associations compared to logistic regression. We extend this to evaluate the relative performance of Cox regression and various logistic regression models in the presence of positive errors in event time (delayed event time), relating to recorded event time accuracy. RESULTS One Cox model and three logistic regression models were considered under different scenarios of delayed event time. Extensive simulations and a genomic study application were used to evaluate the impact of delayed event time. While logistic regression does not model the time-to-event directly, various logistic regression models used in the literature were more sensitive to delayed event time than Cox regression. Results highlighted the importance to identify and exclude the patients diagnosed before entry time. Cox regression had similar or modest improvement in statistical power over various logistic regression models at controlled type I error. This was supported by the empirical data, where the Cox models steadily had the highest sensitivity to detect known genotype-phenotype associations under all scenarios of delayed event time. AVAILABILITY AND IMPLEMENTATION Access to individual-level EHR and genotype data is restricted by the IRB. Simulation code and R script for data process are at: https://github.com/QingxiaCindyChen/CoxRobustEHR.git. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Rebecca Irlmeier
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Jacob J Hughey
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA.,Department of Biomedical Sciences, Vanderbilt University, Nashville, TN 37203, USA
| | - Lisa Bastarache
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| | - Joshua C Denny
- All of Us Research Program, National Institutes of Health, Bethesda, MD 20892, USA
| | - Qingxia Chen
- Department of Biostatistics, Vanderbilt University Medical Center, Nashville, TN 37203, USA.,Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN 37203, USA
| |
Collapse
|
9
|
Pedersen EM, Agerbo E, Plana-Ripoll O, Grove J, Dreier JW, Musliner KL, Bækvad-Hansen M, Athanasiadis G, Schork A, Bybjerg-Grauholm J, Hougaard DM, Werge T, Nordentoft M, Mors O, Dalsgaard S, Christensen J, Børglum AD, Mortensen PB, McGrath JJ, Privé F, Vilhjálmsson BJ. Accounting for age of onset and family history improves power in genome-wide association studies. Am J Hum Genet 2022; 109:417-432. [PMID: 35139346 PMCID: PMC8948165 DOI: 10.1016/j.ajhg.2022.01.009] [Citation(s) in RCA: 16] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2021] [Accepted: 01/07/2022] [Indexed: 11/01/2022] Open
Abstract
Genome-wide association studies (GWASs) have revolutionized human genetics, allowing researchers to identify thousands of disease-related genes and possible drug targets. However, case-control status does not account for the fact that not all controls may have lived through their period of risk for the disorder of interest. This can be quantified by examining the age-of-onset distribution and the age of the controls or the age of onset for cases. The age-of-onset distribution may also depend on information such as sex and birth year. In addition, family history is not routinely included in the assessment of control status. Here, we present LT-FH++, an extension of the liability threshold model conditioned on family history (LT-FH), which jointly accounts for age of onset and sex as well as family history. Using simulations, we show that, when family history and the age-of-onset distribution are available, the proposed approach yields statistically significant power gains over LT-FH and large power gains over genome-wide association study by proxy (GWAX). We applied our method to four psychiatric disorders available in the iPSYCH data and to mortality in the UK Biobank and found 20 genome-wide significant associations with LT-FH++, compared to ten for LT-FH and eight for a standard case-control GWAS. As more genetic data with linked electronic health records become available to researchers, we expect methods that account for additional health information, such as LT-FH++, to become even more beneficial.
Collapse
Affiliation(s)
- Emil M Pedersen
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark; Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark.
| | - Esben Agerbo
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark; Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Centre for Integrated Register-Based Research at Aarhus University, 8210 Aarhus, Denmark
| | - Oleguer Plana-Ripoll
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark
| | - Jakob Grove
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Bioinformatics Research Centre, Aarhus University, 8000 Aarhus, Denmark; Department of Biomedicine and Center for Integrative Sequencing, Aarhus University, 8000 Aarhus, Denmark; Center for Genomics and Personalized Medicine, Aarhus University, 8000 Aarhus, Denmark
| | - Julie W Dreier
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark; Centre for Integrated Register-Based Research at Aarhus University, 8210 Aarhus, Denmark
| | - Katherine L Musliner
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark; Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Centre for Integrated Register-Based Research at Aarhus University, 8210 Aarhus, Denmark
| | - Marie Bækvad-Hansen
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Center for Neonatal Screening, Department for Congenital Disorders, Statens Serum Institut, 2300 Copenhagen, Denmark
| | - Georgios Athanasiadis
- Institute of Biological Psychiatry, MHC Sct. Hans, Mental Health Services Copenhagen, 4000 Roskilde, Denmark
| | - Andrew Schork
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Institute of Biological Psychiatry, MHC Sct. Hans, Mental Health Services Copenhagen, 4000 Roskilde, Denmark
| | - Jonas Bybjerg-Grauholm
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Center for Neonatal Screening, Department for Congenital Disorders, Statens Serum Institut, 2300 Copenhagen, Denmark
| | - David M Hougaard
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Center for Neonatal Screening, Department for Congenital Disorders, Statens Serum Institut, 2300 Copenhagen, Denmark
| | - Thomas Werge
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Institute of Biological Psychiatry, MHC Sct. Hans, Mental Health Services Copenhagen, 4000 Roskilde, Denmark; Department of Clinical Medicine, University of Copenhagen, 2200 Copenhagen, Denmark
| | - Merete Nordentoft
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Mental Health Services in the Capital Region of Denmark, Mental Health Center Copenhagen, University of Copenhagen, 2100 Copenhagen, Denmark
| | - Ole Mors
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Psychosis Research Unit, Aarhus University Hospital, 8245 Risskov, Denmark
| | - Søren Dalsgaard
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark
| | - Jakob Christensen
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark; Department of Neurology, Aarhus University Hospital, 8200 Aarhus, Denmark; Department of Clinical Medicine, Aarhus University, 8200 Aarhus, Denmark
| | - Anders D Børglum
- Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Center for Genomics and Personalized Medicine, Aarhus University, 8000 Aarhus, Denmark; Department of Biomedicine - Human Genetics, Aarhus University, 8000 Aarhus, Denmark
| | - Preben B Mortensen
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark; Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Centre for Integrated Register-Based Research at Aarhus University, 8210 Aarhus, Denmark
| | - John J McGrath
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark; Queensland Brain Institute, University of Queensland, St Lucia, QLD 4072, Australia; Queensland Centre for Mental Health Research, The Park Centre for Mental Health, Wacol, QLD 4076, Australia
| | - Florian Privé
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark
| | - Bjarni J Vilhjálmsson
- National Centre for Register-Based Research, Aarhus University, 8210 Aarhus, Denmark; Lundbeck Foundation Initiative for Integrative Psychiatric Research, 8210 Aarhus, Denmark; Bioinformatics Research Centre, Aarhus University, 8000 Aarhus, Denmark.
| |
Collapse
|
10
|
Kawaguchi ES, Li G, Lewinger JP, Gauderman WJ. Two-step hypothesis testing to detect gene-environment interactions in a genome-wide scan with a survival endpoint. Stat Med 2022; 41:1644-1657. [PMID: 35075649 PMCID: PMC9007892 DOI: 10.1002/sim.9319] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2021] [Revised: 11/10/2021] [Accepted: 12/26/2021] [Indexed: 01/13/2023]
Abstract
Defined by their genetic profile, individuals may exhibit differential clinical outcomes due to an environmental exposure. Identifying subgroups based on specific exposure-modifying genes can lead to targeted interventions and focused studies. Genome-wide interaction scans (GWIS) can be performed to identify such genes, but these scans typically suffer from low power due to the large multiple testing burden. We provide a novel framework for powerful two-step hypothesis tests for GWIS with a time-to-event endpoint under the Cox proportional hazards model. In the Cox regression setting, we develop an approach that prioritizes genes for Step-2 G × E testing based on a carefully constructed Step-1 screening procedure. Simulation results demonstrate this two-step approach can lead to substantially higher power for identifying gene-environment ( G × E ) interactions compared to the standard GWIS while preserving the family wise error rate over a range of scenarios. In a taxane-anthracycline chemotherapy study for breast cancer patients, the two-step approach identifies several gene expression by treatment interactions that would not be detected using the standard GWIS.
Collapse
Affiliation(s)
- Eric S Kawaguchi
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, USA
| | - Gang Li
- Department of Biostatistics, University of California, Los Angeles, Los Angeles, California, USA.,Department of Computational Medicine, University of California, Los Angeles, Los Angeles, California, USA
| | - Juan Pablo Lewinger
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, USA
| | - W James Gauderman
- Department of Population and Public Health Sciences, University of Southern California, Los Angeles, California, USA
| |
Collapse
|
11
|
Grassmann F, Yang H, Eriksson M, Azam S, Hall P, Czene K. Mammographic features are associated with cardiometabolic disease risk and mortality. Eur Heart J 2021; 42:3361-3370. [PMID: 34338750 PMCID: PMC8423470 DOI: 10.1093/eurheartj/ehab502] [Citation(s) in RCA: 12] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 12/10/2020] [Revised: 04/01/2021] [Accepted: 07/15/2021] [Indexed: 01/03/2023] Open
Abstract
AIMS In recent years, microcalcifications identified in routine mammograms were found to be associated with cardiometabolic disease in women. Here, we aimed to systematically evaluate the association of microcalcifications and other mammographic features with cardiometabolic disease risk and mortality in a large screening cohort and to understand a potential genetic contribution. METHODS AND RESULTS This study included 57 867 women from a prospective mammographic screening cohort in Sweden (KARMA) and 49 583 sisters. Cardiometabolic disease diagnoses and mortality and medication were extracted by linkage to Swedish population registries with virtually no missing data. In the cardiometabolic phenome-wide association study, we found that a higher number of microcalcifications were associated with increased risk for multiple cardiometabolic diseases, particularly in women with pre-existing cardiometabolic diseases. In contrast, dense breasts were associated with a lower incidence of cardiometabolic diseases. Importantly, we observed similar associations in sisters of KARMA women, indicating a potential genetic overlap between mammographic features and cardiometabolic traits. Finally, we observed that the presence of microcalcifications was associated with increased cardiometabolic mortality in women with pre-existing cardiometabolic diseases (hazard ratio and 95% confidence interval: 1.79 [1.24-2.58], P = 0.002) while we did not find such effects in women without cardiometabolic diseases. CONCLUSIONS We found that mammographic features are associated with cardiometabolic risk and mortality. Our results strengthen the notion that a combination of mammographic features and other breast cancer risk factors could be a novel and affordable tool to assess cardiometabolic health in women attending mammographic screening.
Collapse
Affiliation(s)
- Felix Grassmann
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Nobels väg 12A, Stockholm 171 65, Sweden
- Institute of Medical Sciences, University of Aberdeen, Foresterhill, Aberdeen AB25 2ZD, UK
| | - Haomin Yang
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Nobels väg 12A, Stockholm 171 65, Sweden
- Department of Epidemiology and Health Statistics, The School of Public Health, Fujian Medical University, Xuefu North Road 1, University Town, Fuzhou 350122, China
| | - Mikael Eriksson
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Nobels väg 12A, Stockholm 171 65, Sweden
| | - Shadi Azam
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Nobels väg 12A, Stockholm 171 65, Sweden
| | - Per Hall
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Nobels väg 12A, Stockholm 171 65, Sweden
| | - Kamila Czene
- Department of Medical Epidemiology and Biostatistics, Karolinska Institutet, Nobels väg 12A, Stockholm 171 65, Sweden
| |
Collapse
|
12
|
Suri P, Stanaway IB, Zhang Y, Freidin MB, Tsepilov YA, Carrell DS, Williams FM, Aulchenko YS, Hakonarson H, Namjou B, Crosslin DR, Jarvik GP, Lee MT. Genome-wide association studies of low back pain and lumbar spinal disorders using electronic health record data identify a locus associated with lumbar spinal stenosis. Pain 2021; 162:2263-2272. [PMID: 33729212 PMCID: PMC8277660 DOI: 10.1097/j.pain.0000000000002221] [Citation(s) in RCA: 10] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2020] [Accepted: 01/15/2021] [Indexed: 12/30/2022]
Abstract
ABSTRACT Identifying genetic risk factors for lumbar spine disorders may lead to knowledge regarding underlying mechanisms and the development of new treatments. We conducted a genome-wide association study involving 100,811 participants with genotypes and longitudinal electronic health record data from the Electronic Medical Records and Genomics Network and Geisinger Health. Cases and controls were defined using validated algorithms and clinical diagnostic codes. Electronic health record-defined phenotypes included low back pain requiring healthcare utilization (LBP-HC), lumbosacral radicular syndrome (LSRS), and lumbar spinal stenosis (LSS). Genome-wide association study used logistic regression with additive genetic effects adjusting for age, sex, site-specific factors, and ancestry (principal components). A fixed-effect inverse-variance weighted meta-analysis was conducted. Genetic variants of genome-wide significance (P < 5 × 10-8) were carried forward for replication in an independent sample from UK Biobank. Phenotype prevalence was 48.8% for LBP-HC, 19.8% for LSRS, and 7.9% for LSS. No variants were significantly associated with LBP-HC. One locus was associated with LSRS (lead variant rs146153280:C>G, odds ratio [OR] = 1.17 for G, P = 2.1 × 10-9), but was not replicated. Another locus on chromosome 2 spanning GFPT1, NFU1, and AAK1 was associated with LSS (lead variant rs13427243:G>A, OR = 1.10 for A, P = 4.3 × 10-8) and replicated in UK Biobank (OR = 1.11, P = 5.4 × 10-5). This was the first genome-wide association study meta-analysis of lumbar spinal disorders using electronic health record data. We identified 2 novel associations with LSRS and LSS; the latter was replicated in an independent sample.
Collapse
Affiliation(s)
- Pradeep Suri
- Seattle Epidemiologic Research and Information Center, VA Puget Sound Health Care System, 1660 S. Columbian Way, Seattle, WA 98108, USA
- Division of Rehabilitation Care Services, 1660 S. Columbian Way, Seattle, WA 98108, USA
- Clinical Learning, Evidence, and Research Center, University of Washington, 325 Ninth Avenue, Box 359612 Seattle, WA 98104, USA
- Department of Rehabilitation Medicine, University of Washington, 325 Ninth Avenue, Box 359612 Seattle, WA 98104, USA
| | - Ian B. Stanaway
- Department of Medicine (Medical Genetics), University of Washington Medical Center, 3720 15th Ave NE, Seattle, WA 98105, USA
| | - Yanfei Zhang
- Genomic Medicine Institute, Geisinger, 100 N. Academy Avenue, Danville, PA 17822, USA
| | - Maxim B. Freidin
- Department of Twin Research and Genetic Epidemiology, School of Life Course Sciences, King’s College London, London, SE1 7EH, UK
| | - Yakov A. Tsepilov
- Laboratory of Theoretical and Applied Functional Genomics, Novosibirsk State University, 1 Pirogova Street, Novosibirsk, 630090, Russia
- Laboratory of Recombination and Segregation Analysis, Institute of Cytology and Genetics, 10 Lavrentiev Avenue, Novosibirsk, 630090, Russia
- PolyOmica, s’-Hetogenbosch,5237 PA, The Netherlands
| | - David S. Carrell
- Kaiser Permante Washington Health Research Institute, 1700 Minor Ave, Suite 1600, Seattle, WA 98101, USA
| | - Frances M.K. Williams
- Department of Twin Research and Genetic Epidemiology, School of Life Course Sciences, King’s College London, London, SE1 7EH, UK
| | - Yurii S. Aulchenko
- PolyOmica, s’-Hetogenbosch,5237 PA, The Netherlands
- Kurchatov Genomics Center of the Institute of Cytology and Genetics, Siberian Branch of the Russian Academy of Sciences, Novosibirsk, 630090, Russia
| | - Hakon Hakonarson
- Department of Pediatrics, Children’s Hospital of Philadelphia, 3615 Civic Center Blvd.Philadelphia, PA 19104, USA
| | - Bahram Namjou
- Department of Pediatrics, Cincinnati Children’s Hospital Medical Center, 3333 Burnet Ave, Cincinnati, OH 45229, USA
| | - David R. Crosslin
- Department of Biomedical Informatics and Education, University of Washington, 3720 15th Ave NE, Seattle, WA 98105, USA
| | - Gail P. Jarvik
- Department of Medicine (Medical Genetics), University of Washington Medical Center, 3720 15th Ave NE, Seattle, WA 98105, USA
| | - Ming Ta Lee
- Genomic Medicine Institute, Geisinger, 100 N. Academy Avenue, Danville, PA 17822, USA
| |
Collapse
|
13
|
Abstract
Electronic health records (EHRs) are a rich source of data for researchers, but extracting meaningful information out of this highly complex data source is challenging. Phecodes represent one strategy for defining phenotypes for research using EHR data. They are a high-throughput phenotyping tool based on ICD (International Classification of Diseases) codes that can be used to rapidly define the case/control status of thousands of clinically meaningful diseases and conditions. Phecodes were originally developed to conduct phenome-wide association studies to scan for phenotypic associations with common genetic variants. Since then, phecodes have been used to support a wide range of EHR-based phenotyping methods, including the phenotype risk score. This review aims to comprehensively describe the development, validation, and applications of phecodes and suggest some future directions for phecodes and high-throughput phenotyping.
Collapse
Affiliation(s)
- Lisa Bastarache
- Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, Tennessee 37232, USA;
| |
Collapse
|
14
|
Le Guen Y, Belloy ME, Napolioni V, Eger SJ, Kennedy G, Tao R, He Z, Greicius MD. A novel age-informed approach for genetic association analysis in Alzheimer's disease. Alzheimers Res Ther 2021; 13:72. [PMID: 33794991 PMCID: PMC8017764 DOI: 10.1186/s13195-021-00808-5] [Citation(s) in RCA: 19] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/15/2021] [Accepted: 03/11/2021] [Indexed: 01/17/2023]
Abstract
BACKGROUND Many Alzheimer's disease (AD) genetic association studies disregard age or incorrectly account for it, hampering variant discovery. METHODS Using simulated data, we compared the statistical power of several models: logistic regression on AD diagnosis adjusted and not adjusted for age; linear regression on a score integrating case-control status and age; and multivariate Cox regression on age-at-onset. We applied these models to real exome-wide data of 11,127 sequenced individuals (54% cases) and replicated suggestive associations in 21,631 genotype-imputed individuals (51% cases). RESULTS Modeling variable AD risk across age results in 5-10% statistical power gain compared to logistic regression without age adjustment, while incorrect age adjustment leads to critical power loss. Applying our novel AD-age score and/or Cox regression, we discovered and replicated novel variants associated with AD on KIF21B, USH2A, RAB10, RIN3, and TAOK2 genes. CONCLUSION Our AD-age score provides a simple means for statistical power gain and is recommended for future AD studies.
Collapse
Affiliation(s)
- Yann Le Guen
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA.
| | - Michael E Belloy
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA
| | - Valerio Napolioni
- School of Biosciences and Veterinary Medicine, University of Camerino, 62032, Camerino, Italy
| | - Sarah J Eger
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA
| | - Gabriel Kennedy
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA
| | - Ran Tao
- Department of Biostatistics and Vanderbilt Genetic Institute, Vanderbilt University, Nashville, TN, 37203, USA
| | - Zihuai He
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA
- Quantitative Sciences Unit, Department of Medicine, Stanford University, Stanford, CA, 94304, USA
| | - Michael D Greicius
- Department of Neurology and Neurological Sciences, Stanford University, Stanford, CA, 94304, USA
| |
Collapse
|
15
|
Li C, Wu D, Lu Q. Set-based genetic association and interaction tests for survival outcomes based on weighted V statistics. Genet Epidemiol 2020; 45:46-63. [PMID: 32896012 DOI: 10.1002/gepi.22353] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/10/2020] [Revised: 08/03/2020] [Accepted: 08/03/2020] [Indexed: 01/07/2023]
Abstract
With advancements in high-throughout technologies, studies have been conducted to investigate the role of massive genetic variants in human diseases. While set-based tests have been developed for binary and continuous disease outcomes, there are few computationally efficient set-based tests available for time-to-event outcomes. To facilitate the genetic association and interaction analyses of time-to-event outcomes, We develop a suite of multivariant tests based on weighted V statistics with or without considering potential genetic heterogeneity. In addition to the computation efficiency and nice asymptotic properties, all the new tests can deal with left truncation and competing risks in the survival data, and adjust for covariates. Simulation studies show that the new tests run faster, are more accurate in small samples, and account for confounding effect better than the existing multivariant survival tests. When the genetic effect is heterogeneous across individuals/subpopulations, the association test considering genetic heterogeneity is more powerful than the existing tests that do not account for genetic heterogeneity. Using the new methods, we perform a genome-wide association analysis of the genotype and age-to-Alzheimer's data from the Rush Memory and Aging Project and the Religious Orders Study. The analysis identifies two genes, APOE and APOC1, associated with age to Alzheimer's disease onset.
Collapse
Affiliation(s)
- Chenxi Li
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, USA
| | - Di Wu
- Department of Epidemiology and Biostatistics, Michigan State University, East Lansing, Michigan, USA
| | - Qing Lu
- Department of Biostatistics, University of Florida, Gainesville, Florida, USA
| |
Collapse
|
16
|
Bi W, Fritsche LG, Mukherjee B, Kim S, Lee S. A Fast and Accurate Method for Genome-Wide Time-to-Event Data Analysis and Its Application to UK Biobank. Am J Hum Genet 2020; 107:222-233. [PMID: 32589924 DOI: 10.1016/j.ajhg.2020.06.003] [Citation(s) in RCA: 43] [Impact Index Per Article: 8.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/07/2020] [Accepted: 06/03/2020] [Indexed: 12/09/2022] Open
Abstract
With increasing biobanking efforts connecting electronic health records and national registries to germline genetics, the time-to-event data analysis has attracted increasing attention in the genetics studies of human diseases. In time-to-event data analysis, the Cox proportional hazards (PH) regression model is one of the most used approaches. However, existing methods and tools are not scalable when analyzing a large biobank with hundreds of thousands of samples and endpoints, and they are not accurate when testing low-frequency and rare variants. Here, we propose a scalable and accurate method, SPACox (a saddlepoint approximation implementation based on the Cox PH regression model), that is applicable for genome-wide scale time-to-event data analysis. SPACox requires fitting a Cox PH regression model only once across the genome-wide analysis and then uses a saddlepoint approximation (SPA) to calibrate the test statistics. Simulation studies show that SPACox is 76-252 times faster than other existing alternatives, such as gwasurvivr, 185-511 times faster than the standard Wald test, and more than 6,000 times faster than the Firth correction and can control type I error rates at the genome-wide significance level regardless of minor allele frequencies. Through the analysis of UK Biobank inpatient data of 282,871 white British European ancestry samples, we show that SPACox can efficiently analyze large sample sizes and accurately control type I error rates. We identified 611 loci associated with time-to-event phenotypes of 12 common diseases, of which 38 loci would be missed within a logistic regression framework with a binary phenotype defined as event occurrence status during the follow-up period.
Collapse
|
17
|
Helzlsouer K, Meerzaman D, Taplin S, Dunn BK. Humanizing Big Data: Recognizing the Human Aspect of Big Data. Front Oncol 2020; 10:186. [PMID: 32231993 PMCID: PMC7082327 DOI: 10.3389/fonc.2020.00186] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2019] [Accepted: 02/04/2020] [Indexed: 11/28/2022] Open
Abstract
The term “big data” refers broadly to large volumes of data, often gathered from several sources, that are then analyzed, for example, for predictive analytics. Combining and mining genetic data from varied sources including clinical genetic testing, for example, electronic health records, what might be termed as “recreational” genetic testing such as ancestry testing, as well as research studies, provide one type of “big data.” Challenges and cautions in analyzing big data include recognizing the lack of systematic collection of the source data, the variety of assay technologies used, the potential variation in classification and interpretation of genetic variants. While advanced technologies such as microarrays and, more recently, next-generation sequencing, that enable testing an individual's DNA for thousands of genes and variants simultaneously are briefly discussed, attention is focused more closely on challenges to analysis of the massive data generated by these genomic technologies. The main theme of this review is to evaluate challenges associated with big data in general and specifically to bring the sophisticated technology of genetic/genomic testing down to the individual level, keeping in mind the human aspect of the data source and considering where the impact of the data will be translated and applied. Considerations in this “humanizing” process include providing adequate counseling and consent for genetic testing in all settings, as well as understanding the strengths and limitations of assays and their interpretation.
Collapse
Affiliation(s)
- Kathy Helzlsouer
- Division of Cancer Control and Population Sciences, National Cancer Institute, Bethesda, MD, United States
| | - Daoud Meerzaman
- Center for Biomedical Informatics and Information Technology, National Cancer Institute, Bethesda, MD, United States
| | - Stephen Taplin
- Center for Global Health, National Cancer Institute, Bethesda, MD, United States
| | - Barbara K Dunn
- Division of Cancer Prevention, National Cancer Institute, Bethesda, MD, United States
| |
Collapse
|
18
|
Privé F, Vilhjálmsson BJ, Aschard H, Blum MGB. Making the Most of Clumping and Thresholding for Polygenic Scores. Am J Hum Genet 2019; 105:1213-1221. [PMID: 31761295 PMCID: PMC6904799 DOI: 10.1016/j.ajhg.2019.11.001] [Citation(s) in RCA: 116] [Impact Index Per Article: 19.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/09/2019] [Accepted: 10/28/2019] [Indexed: 12/19/2022] Open
Abstract
Polygenic prediction has the potential to contribute to precision medicine. Clumping and thresholding (C+T) is a widely used method to derive polygenic scores. When using C+T, several p value thresholds are tested to maximize predictive ability of the derived polygenic scores. Along with this p value threshold, we propose to tune three other hyper-parameters for C+T. We implement an efficient way to derive thousands of different C+T scores corresponding to a grid over four hyper-parameters. For example, it takes a few hours to derive 123K different C+T scores for 300K individuals and 1M variants using 16 physical cores. We find that optimizing over these four hyper-parameters improves the predictive performance of C+T in both simulations and real data applications as compared to tuning only the p value threshold. A particularly large increase can be noted when predicting depression status, from an AUC of 0.557 (95% CI: [0.544-0.569]) when tuning only the p value threshold to an AUC of 0.592 (95% CI: [0.580-0.604]) when tuning all four hyper-parameters we propose for C+T. We further propose stacked clumping and thresholding (SCT), a polygenic score that results from stacking all derived C+T scores. Instead of choosing one set of hyper-parameters that maximizes prediction in some training set, SCT learns an optimal linear combination of all C+T scores by using an efficient penalized regression. We apply SCT to eight different case-control diseases in the UK biobank data and find that SCT substantially improves prediction accuracy with an average AUC increase of 0.035 over standard C+T.
Collapse
Affiliation(s)
- Florian Privé
- Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes, CNRS, La Tronche, France; Department of Economics and Business Economics, National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark.
| | - Bjarni J Vilhjálmsson
- Department of Economics and Business Economics, National Centre for Register-Based Research, Aarhus University, Aarhus, Denmark
| | - Hugues Aschard
- Centre de Bioinformatique, Biostatistique et Biologie Intégrative (C3BI), Institut Pasteur, Paris, France
| | - Michael G B Blum
- Laboratoire TIMC-IMAG, UMR 5525, Univ. Grenoble Alpes, CNRS, La Tronche, France.
| |
Collapse
|