1
|
Han SH, Camp SY, Chu H, Collins R, Gillani R, Park J, Bakouny Z, Ricker CA, Reardon B, Moore N, Kofman E, Labaki C, Braun D, Choueiri TK, AlDubayan SH, Van Allen EM. Integrative Analysis of Germline Rare Variants in Clear and Non-clear Cell Renal Cell Carcinoma. EUR UROL SUPPL 2024; 62:107-122. [PMID: 38496821 PMCID: PMC10940785 DOI: 10.1016/j.euros.2024.02.006] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 02/12/2024] [Indexed: 03/19/2024] Open
Abstract
Background and objective Previous germline studies on renal cell carcinoma (RCC) have usually pooled clear and non-clear cell RCCs and have not adequately accounted for population stratification, which might have led to an inaccurate estimation of genetic risk. Here, we aim to analyze the major germline drivers of RCC risk and clinically relevant but underexplored germline variant types. Methods We first characterized germline pathogenic variants (PVs), cryptic splice variants, and copy number variants (CNVs) in 1436 unselected RCC patients. To evaluate the enrichment of PVs in RCC, we conducted a case-control study of 1356 RCC patients ancestry matched with 16 512 cancer-free controls using approaches accounting for population stratification and histological subtypes, followed by characterization of secondary somatic events. Key findings and limitations Clear cell RCC patients (n = 976) exhibited a significant burden of PVs in VHL compared with controls (odds ratio [OR]: 39.1, p = 4.95e-05). Non-clear cell RCC patients (n = 380) carried enrichment of PVs in FH (OR: 77.9, p = 1.55e-08) and MET (OR: 1.98e11, p = 2.07e-05). In a CHEK2-focused analysis with European participants, clear cell RCC (n = 906) harbored nominal enrichment of low-penetrance CHEK2 variants-p.Ile157Thr (OR: 1.84, p = 0.049) and p.Ser428Phe (OR: 5.20, p = 0.045), while non-clear cell RCC (n = 295) exhibited nominal enrichment of CHEK2 loss of function PVs (OR: 3.51, p = 0.033). Patients with germline PVs in FH, MET, and VHL exhibited significantly earlier age of cancer onset than patients without germline PVs (mean: 46.0 vs 60.2 yr, p < 0.0001), and more than half had secondary somatic events affecting the same gene (n = 10/15, 66.7%). Conversely, CHEK2 PV carriers exhibited a similar age of onset to patients without germline PVs (mean: 60.1 vs 60.2 yr, p = 0.99), and only 30.4% carried somatic events in CHEK2 (n = 7/23). Finally, pathogenic germline cryptic splice variants were identified in SDHA and TSC1, and pathogenic germline CNVs were found in 18 patients, including CNVs in FH, SDHA, and VHL. Conclusions and clinical implications This analysis supports the existing link between several RCC risk genes and RCC risk manifesting in earlier age of onset. It calls for caution when assessing the role of CHEK2 due to the burden of founder variants with varying population frequency. It also broadens the definition of the RCC germline landscape of pathogenicity to incorporate previously understudied types of germline variants. Patient summary In this study, we carefully compared the frequency of rare inherited mutations with a focus on patients' genetic ancestry. We discovered that subtle variations in genetic background may confound a case-control analysis, especially in evaluating the cancer risk associated with specific genes, such as CHEK2. We also identified previously less explored forms of rare inherited mutations, which could potentially increase the risk of kidney cancer.
Collapse
Affiliation(s)
- Seung Hun Han
- Ph.D. Program in Biological and Biomedical Sciences, Harvard Medical School, Boston, MA, USA
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
- Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Sabrina Y. Camp
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
- Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Hoyin Chu
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
- Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Ryan Collins
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
- Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Medicine, Harvard Medical School, Boston, MA, USA
| | - Riaz Gillani
- Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Department of Pediatric Oncology, Dana-Farber Cancer Institute, Boston, MA, USA
- Department of Pediatrics, Harvard Medical School, Boston, MA, USA
- Boston Children’s Hospital, Boston, MA, USA
| | - Jihye Park
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
- Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Ziad Bakouny
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
- Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Cora A. Ricker
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
- Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Brendan Reardon
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
- Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
| | - Nicholas Moore
- Department of Therapeutic Radiology, Yale School of Medicine, New Haven, CT, USA
| | - Eric Kofman
- Department of Cellular and Molecular Medicine, University of California San Diego, La Jolla, CA, USA
| | - Chris Labaki
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
| | - David Braun
- Center of Molecular and Cellular Oncology, Yale School of Medicine, New Haven, CT, USA
| | - Toni K. Choueiri
- Lank Center for Genitourinary Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
- Brigham and Women’s Hospital, Boston, MA, USA
| | - Saud H. AlDubayan
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
- Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Division of Genetics, Brigham and Women’s Hospital, Boston, MA, USA
- College of Medicine, King Saud bin Abdulaziz University for Health Sciences, Riyadh, Saudi Arabia
| | - Eliezer M. Van Allen
- Department of Medical Oncology, Dana-Farber Cancer Institute, Harvard Medical School, Boston, MA, USA
- Cancer Program, The Broad Institute of MIT and Harvard, Cambridge, MA, USA
- Center for Cancer Genomics, Dana-Farber Cancer Institute, Boston, MA, USA
| |
Collapse
|
2
|
Peng W, Fu C, Shu S, Wang G, Wang H, Yue B, Zhang M, Liu X, Liu Y, Zhang J, Zhong J, Wang J. Whole-genome resequencing of major populations revealed domestication-related genes in yaks. BMC Genomics 2024; 25:69. [PMID: 38233755 PMCID: PMC10795378 DOI: 10.1186/s12864-024-09993-7] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/04/2023] [Accepted: 01/08/2024] [Indexed: 01/19/2024] Open
Abstract
BACKGROUND The yak is a symbol of the Qinghai-Tibet Plateau and provides important basic resources for human life on the plateau. Domestic yaks have been subjected to strong artificial selection and environmental pressures over the long-term. Understanding the molecular mechanisms of phenotypic differences in yak populations can reveal key functional genes involved in the domestication process and improve genetic breeding. MATERIAL AND METHOD Here, we re-sequenced 80 yaks (Maiwa, Yushu, and Huanhu populations) to identify single-nucleotide polymorphisms (SNPs) as genetic variants. After filtering and quality control, remaining SNPs were kept to identify the genome-wide regions of selective sweeps associated with domestic traits. The four methods (π, XPEHH, iHS, and XP-nSL) were used to detect the population genetic separation. RESULTS By comparing the differences in the population stratification, linkage disequilibrium decay rate, and characteristic selective sweep signals, we identified 203 putative selective regions of domestic traits, 45 of which were mapped to 27 known genes. They were clustered into 4 major GO biological process terms. All known genes were associated with seven major domestication traits, such as dwarfism (ANKRD28), milk (HECW1, HECW2, and OSBPL2), meat (SPATA5 and GRHL2), fertility (BTBD11 and ARFIP1), adaptation (NCKAP5, ANTXR1, LAMA5, OSBPL2, AOC2, and RYR2), growth (GRHL2, GRID2, SMARCAL1, and EPHB2), and the immune system (INPP5D and ADCYAP1R1). CONCLUSIONS We provided there is an obvious genetic different among domestic progress in these three yak populations. Our findings improve the understanding of the major genetic switches and domestic processes among yak populations.
Collapse
Affiliation(s)
- Wei Peng
- Qinghai Academy of Animal Science and Veterinary Medicine, Qinghai University, Xining, 810016, China
| | - Changqi Fu
- Qinghai Academy of Animal Science and Veterinary Medicine, Qinghai University, Xining, 810016, China
| | - Shi Shu
- Qinghai Academy of Animal Science and Veterinary Medicine, Qinghai University, Xining, 810016, China
| | - Guowen Wang
- Qinghai Academy of Animal Science and Veterinary Medicine, Qinghai University, Xining, 810016, China
| | - Hui Wang
- Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization (Sichuan Province and Ministry of Education), Southwest Minzu University, Chengdu, 610041, China
| | - Binglin Yue
- Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization (Sichuan Province and Ministry of Education), Southwest Minzu University, Chengdu, 610041, China
| | - Ming Zhang
- Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization (Sichuan Province and Ministry of Education), Southwest Minzu University, Chengdu, 610041, China
| | - Xinrui Liu
- Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization (Sichuan Province and Ministry of Education), Southwest Minzu University, Chengdu, 610041, China
| | - Yaxin Liu
- Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization (Sichuan Province and Ministry of Education), Southwest Minzu University, Chengdu, 610041, China
| | - Jun Zhang
- Qinghai Academy of Animal Science and Veterinary Medicine, Qinghai University, Xining, 810016, China.
| | - Jincheng Zhong
- Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization (Sichuan Province and Ministry of Education), Southwest Minzu University, Chengdu, 610041, China.
| | - Jiabo Wang
- Key Laboratory of Qinghai-Tibetan Plateau Animal Genetic Resource Reservation and Utilization (Sichuan Province and Ministry of Education), Southwest Minzu University, Chengdu, 610041, China.
| |
Collapse
|
3
|
Wendt FR, Pathak GA, Vahey J, Qin X, Koller D, Cabrera-Mendoza B, Haeny A, Harrington KM, Rajeevan N, Duong LM, Levey DF, De Angelis F, De Lillo A, Bigdeli TB, Pyarajan S, Gaziano JM, Gelernter J, Aslan M, Provenzale D, Helmer DA, Hauser ER, Polimanti R. Modeling the longitudinal changes of ancestry diversity in the Million Veteran Program. Hum Genomics 2023; 17:46. [PMID: 37268996 PMCID: PMC10239111 DOI: 10.1186/s40246-023-00487-3] [Citation(s) in RCA: 3] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2022] [Accepted: 05/05/2023] [Indexed: 06/04/2023] Open
Abstract
BACKGROUND The Million Veteran Program (MVP) participants represent 100 years of US history, including significant social and demographic changes over time. Our study assessed two aspects of the MVP: (i) longitudinal changes in population diversity and (ii) how these changes can be accounted for in genome-wide association studies (GWAS). To investigate these aspects, we divided MVP participants into five birth cohorts (N-range = 123,888 [born from 1943 to 1947] to 136,699 [born from 1948 to 1953]). RESULTS Ancestry groups were defined by (i) HARE (harmonized ancestry and race/ethnicity) and (ii) a random-forest clustering approach using the 1000 Genomes Project and the Human Genome Diversity Project (1kGP + HGDP) reference panels (77 world populations representing six continental groups). In these groups, we performed GWASs of height, a trait potentially affected by population stratification. Birth cohorts demonstrate important trends in ancestry diversity over time. More recent HARE-assigned Europeans, Africans, and Hispanics had lower European ancestry proportions than older birth cohorts (0.010 < Cohen's d < 0.259, p < 7.80 × 10-4). Conversely, HARE-assigned East Asians showed an increase in European ancestry proportion over time. In GWAS of height using HARE assignments, genomic inflation due to population stratification was prevalent across all birth cohorts (linkage disequilibrium score regression intercept = 1.08 ± 0.042). The 1kGP + HGDP-based ancestry assignment significantly reduced the population stratification (mean intercept reduction = 0.045 ± 0.007, p < 0.05) confounding in the GWAS statistics. CONCLUSIONS This study provides a characterization of ancestry diversity of the MVP cohort over time and compares two strategies to infer genetically defined ancestry groups by assessing differences in controlling population stratification in genome-wide association studies.
Collapse
Affiliation(s)
- Frank R Wendt
- Department of Psychiatry, Yale School of Medicine, New Haven, CT, USA
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA CT Healthcare System, VA CT 116A2, 950 Campbell Avenue, West Haven, CT, 06516, USA
| | - Gita A Pathak
- Department of Psychiatry, Yale School of Medicine, New Haven, CT, USA
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA CT Healthcare System, VA CT 116A2, 950 Campbell Avenue, West Haven, CT, 06516, USA
| | - Jacqueline Vahey
- Durham VA Medical Center, Durham, NC, USA
- Duke University, Carmichael Building, 300 N Duke St, Durham, NC, 27701, USA
| | - Xuejun Qin
- Durham VA Medical Center, Durham, NC, USA
- Duke University, Carmichael Building, 300 N Duke St, Durham, NC, 27701, USA
| | - Dora Koller
- Department of Psychiatry, Yale School of Medicine, New Haven, CT, USA
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA CT Healthcare System, VA CT 116A2, 950 Campbell Avenue, West Haven, CT, 06516, USA
| | - Brenda Cabrera-Mendoza
- Department of Psychiatry, Yale School of Medicine, New Haven, CT, USA
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA CT Healthcare System, VA CT 116A2, 950 Campbell Avenue, West Haven, CT, 06516, USA
| | - Angela Haeny
- Department of Psychiatry, Yale School of Medicine, New Haven, CT, USA
| | - Kelly M Harrington
- Massachusetts Veterans Epidemiology Research and Information Center, VA Boston Healthcare System, Boston, MA, USA
- Department of Psychiatry, Boston University School of Medicine, Boston, MA, USA
| | - Nallakkandi Rajeevan
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA CT Healthcare System, VA CT 116A2, 950 Campbell Avenue, West Haven, CT, 06516, USA
- Yale Center for Medical Informatics, Yale School of Medicine, New Haven, CT, USA
| | - Linh M Duong
- Department of Psychiatry, Yale School of Medicine, New Haven, CT, USA
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA CT Healthcare System, VA CT 116A2, 950 Campbell Avenue, West Haven, CT, 06516, USA
| | - Daniel F Levey
- Department of Psychiatry, Yale School of Medicine, New Haven, CT, USA
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA CT Healthcare System, VA CT 116A2, 950 Campbell Avenue, West Haven, CT, 06516, USA
| | - Flavio De Angelis
- Department of Psychiatry, Yale School of Medicine, New Haven, CT, USA
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA CT Healthcare System, VA CT 116A2, 950 Campbell Avenue, West Haven, CT, 06516, USA
| | | | - Tim B Bigdeli
- SUNY Downstate Health Sciences University, Brooklyn, NY, USA
- VA New York Harbor Healthcare System, Brooklyn, NY, USA
| | - Saiju Pyarajan
- Massachusetts Area Veterans Epidemiology, Research, and Information Center (MAVERIC), Jamaica Plain, MA, USA
- VA Cooperative Studies Program, VA Boston Healthcare System, Boston, MA, USA
- Department of Medicine, Brigham and Women's Hospital and Harvard School of Medicine, Boston, MA, USA
| | - John Michael Gaziano
- VA Cooperative Studies Program, VA Boston Healthcare System, Boston, MA, USA
- Department of Medicine, Brigham and Women's Hospital and Harvard School of Medicine, Boston, MA, USA
| | - Joel Gelernter
- Department of Psychiatry, Yale School of Medicine, New Haven, CT, USA
- Department of Genetics, Yale School of Medicine, New Haven, CT, USA
- Department of Neuroscience, Yale School of Medicine, New Haven, CT, USA
- Department of Psychiatry, VA CT Healthcare System, West Haven, CT, USA
| | - Mihaela Aslan
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA CT Healthcare System, VA CT 116A2, 950 Campbell Avenue, West Haven, CT, 06516, USA
- Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA
| | - Dawn Provenzale
- Durham VA Medical Center, Durham, NC, USA
- Duke University, Carmichael Building, 300 N Duke St, Durham, NC, 27701, USA
- Department of Internal Medicine, Yale School of Medicine, New Haven, CT, USA
| | - Drew A Helmer
- Center for Innovations in Quality, Effectiveness and Safety, Michael E. DeBakey VA Medical Center, Houston, TX, USA
- Department of Medicine, Baylor College of Medicine, Houston, TX, USA
| | - Elizabeth R Hauser
- Durham VA Medical Center, Durham, NC, USA.
- Duke University, Carmichael Building, 300 N Duke St, Durham, NC, 27701, USA.
| | - Renato Polimanti
- Department of Psychiatry, Yale School of Medicine, New Haven, CT, USA.
- VA Cooperative Studies Program Clinical Epidemiology Research Center (CSP-CERC), VA CT Healthcare System, VA CT 116A2, 950 Campbell Avenue, West Haven, CT, 06516, USA.
| |
Collapse
|
4
|
Solovieva E, Sakai H. PSReliP: an integrated pipeline for analysis and visualization of population structure and relatedness based on genome-wide genetic variant data. BMC Bioinformatics 2023; 24:135. [PMID: 37020193 PMCID: PMC10074814 DOI: 10.1186/s12859-023-05169-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/27/2022] [Accepted: 02/02/2023] [Indexed: 04/07/2023] Open
Abstract
BACKGROUND Population structure and cryptic relatedness between individuals (samples) are two major factors affecting false positives in genome-wide association studies (GWAS). In addition, population stratification and genetic relatedness in genomic selection in animal and plant breeding can affect prediction accuracy. The methods commonly used for solving these problems are principal component analysis (to adjust for population stratification) and marker-based kinship estimates (to correct for the confounding effects of genetic relatedness). Currently, many tools and software are available that analyze genetic variation among individuals to determine population structure and genetic relationships. However, none of these tools or pipelines perform such analyses in a single workflow and visualize all the various results in a single interactive web application. RESULTS We developed PSReliP, a standalone, freely available pipeline for the analysis and visualization of population structure and relatedness between individuals in a user-specified genetic variant dataset. The analysis stage of PSReliP is responsible for executing all steps of data filtering and analysis and contains an ordered sequence of commands from PLINK, a whole-genome association analysis toolset, along with in-house shell scripts and Perl programs that support data pipelining. The visualization stage is provided by Shiny apps, an R-based interactive web application. In this study, we describe the characteristics and features of PSReliP and demonstrate how it can be applied to real genome-wide genetic variant data. CONCLUSIONS The PSReliP pipeline allows users to quickly analyze genetic variants such as single nucleotide polymorphisms and small insertions or deletions at the genome level to estimate population structure and cryptic relatedness using PLINK software and to visualize the analysis results in interactive tables, plots, and charts using Shiny technology. The analysis and assessment of population stratification and genetic relatedness can aid in choosing an appropriate approach for the statistical analysis of GWAS data and predictions in genomic selection. The various outputs from PLINK can be used for further downstream analysis. The code and manual for PSReliP are available at https://github.com/solelena/PSReliP .
Collapse
Affiliation(s)
- Elena Solovieva
- Research Center for Advanced Analysis, National Agriculture and Food Research Organization, Tsukuba, Ibaraki, Japan
| | - Hiroaki Sakai
- Research Center for Advanced Analysis, National Agriculture and Food Research Organization, Tsukuba, Ibaraki, Japan.
| |
Collapse
|
5
|
Xia X, Zhang Y, Wei Y, Wang MH. Statistical Methods for Disease Risk Prediction with Genotype Data. Methods Mol Biol 2023; 2629:331-347. [PMID: 36929084 DOI: 10.1007/978-1-0716-2986-4_15] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/18/2023]
Abstract
Single-nucleotide polymorphism (SNP) is the basic unit to understand the heritability of complex traits. One attractive application of the susceptible SNPs is to construct prediction models for assessing disease risk. Here, we introduce prediction methods for human traits using SNPs data, including the polygenic risk score (PRS), linear mixed models (LMMs), penalized regressions, and methods for controlling population stratification.
Collapse
Affiliation(s)
- Xiaoxuan Xia
- JC School of Public Health and Primary Care, the Chinese University of Hong Kong (CUHK), Shatin, Hong Kong
- Department of Statistics, the Chinese University of Hong Kong (CUHK), Shatin, Hong Kong
| | | | - Yingying Wei
- Department of Statistics, the Chinese University of Hong Kong (CUHK), Shatin, Hong Kong
| | - Maggie Haitian Wang
- JC School of Public Health and Primary Care, the Chinese University of Hong Kong (CUHK), Shatin, Hong Kong.
- CUHK Shenzhen Institute, Shenzhen, China.
| |
Collapse
|
6
|
Onifade M, Roy-Gagnon MH, Parent MÉ, Burkett KM. Comparison of mixed model based approaches for correcting for population substructure with application to extreme phenotype sampling. BMC Genomics 2022; 23:98. [PMID: 35120458 PMCID: PMC8815214 DOI: 10.1186/s12864-022-08297-y] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/05/2021] [Accepted: 01/06/2022] [Indexed: 11/10/2022] Open
Abstract
Background Mixed models are used to correct for confounding due to population stratification and hidden relatedness in genome-wide association studies. This class of models includes linear mixed models and generalized linear mixed models. Existing mixed model approaches to correct for population substructure have been previously investigated with both continuous and case-control response variables. However, they have not been investigated in the context of extreme phenotype sampling (EPS), where genetic covariates are only collected on samples having extreme response variable values. In this work, we compare the performance of existing binary trait mixed model approaches (GMMAT, LEAP and CARAT) on EPS data. Since linear mixed models are commonly used even with binary traits, we also evaluate the performance of a popular linear mixed model implementation (GEMMA). Results We used simulation studies to estimate the type I error rate and power of all approaches assuming a population with substructure. Our simulation results show that for a common candidate variant, both LEAP and GMMAT control the type I error rate while CARAT’s rate remains inflated. We applied all methods to a real dataset from a Québec, Canada, case-control study that is known to have population substructure. We observe similar type I error control with the analysis on the Québec dataset. For rare variants, the false positive rate remains inflated even after correction with mixed model approaches. For methods that control the type I error rate, the estimated power is comparable. Conclusions The methods compared in this study differ in their type I error control. Therefore, when data are from an EPS study, care should be taken to ensure that the models underlying the methodology are suitable to the sampling strategy and to the minor allele frequency of the candidate SNPs. Supplementary Information The online version contains supplementary material available at (10.1186/s12864-022-08297-y).
Collapse
Affiliation(s)
- Maryam Onifade
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, Canada
| | | | - Marie-Élise Parent
- Centre Armand-Frappier Santé Biotechnologie, Institut national de la recherche scientifique, Université du Québec, Laval, Canada
| | - Kelly M Burkett
- Department of Mathematics and Statistics, University of Ottawa, Ottawa, Canada.
| |
Collapse
|
7
|
Xu Y, Vuckovic D, Ritchie SC, Akbari P, Jiang T, Grealey J, Butterworth AS, Ouwehand WH, Roberts DJ, Di Angelantonio E, Danesh J, Soranzo N, Inouye M. Machine learning optimized polygenic scores for blood cell traits identify sex-specific trajectories and genetic correlations with disease. Cell Genom 2022; 2:None. [PMID: 35072137 PMCID: PMC8758502 DOI: 10.1016/j.xgen.2021.100086] [Citation(s) in RCA: 4] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Figures] [Subscribe] [Scholar Register] [Received: 10/20/2020] [Revised: 08/24/2021] [Accepted: 12/13/2021] [Indexed: 12/13/2022]
Abstract
Genetic association studies for blood cell traits, which are key indicators of health and immune function, have identified several hundred associations and defined a complex polygenic architecture. Polygenic scores (PGSs) for blood cell traits have potential clinical utility in disease risk prediction and prevention, but designing PGS remains challenging and the optimal methods are unclear. To address this, we evaluated the relative performance of 6 methods to develop PGS for 26 blood cell traits, including a standard method of pruning and thresholding (P + T) and 5 learning methods: LDpred2, elastic net (EN), Bayesian ridge (BR), multilayer perceptron (MLP) and convolutional neural network (CNN). We evaluated these optimized PGSs on blood cell trait data from UK Biobank and INTERVAL. We find that PGSs designed using common machine learning methods EN and BR show improved prediction of blood cell traits and consistently outperform other methods. Our analyses suggest EN/BR as the top choices for PGS construction, showing improved performance for 25 blood cell traits in the external validation, with correlations with the directly measured traits increasing by 10%-23%. Ten PGSs showed significant statistical interaction with sex, and sex-specific PGS stratification showed that all of them had substantial variation in the trajectories of blood cell traits with age. Genetic correlations between the PGSs for blood cell traits and common human diseases identified well-known as well as new associations. We develop machine learning-optimized PGS for blood cell traits, demonstrate their relationships with sex, age, and disease, and make these publicly available as a resource.
Collapse
Affiliation(s)
- Yu Xu
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
| | - Dragana Vuckovic
- Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
| | - Scott C. Ritchie
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
| | - Parsa Akbari
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
| | - Tao Jiang
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
| | - Jason Grealey
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia
- Department of Mathematics and Statistics, La Trobe University, Bundoora, VIC 3086, Australia
| | - Adam S. Butterworth
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK
| | - Willem H. Ouwehand
- Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
- National Health Service (NHS) Blood and Transplant, Cambridge Biomedical Campus, Cambridge CB2 0PT, UK
- Department of Haematology, University of Cambridge, Cambridge CB2 0PT, UK
| | - David J. Roberts
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
- National Health Service (NHS) Blood and Transplant, Cambridge Biomedical Campus, Cambridge CB2 0PT, UK
- National Institute for Health Research Oxford Biomedical Research Centre, University of Oxford and John Radcliffe Hospital, Oxford OX3 9DU, UK
| | - Emanuele Di Angelantonio
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
- Health Data Science Research Centre, Human Technopole, Milan 20157, Italy
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK
| | - John Danesh
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK
| | - Nicole Soranzo
- Department of Human Genetics, Wellcome Sanger Institute, Hinxton CB10 1SA, UK
- National Institute for Health Research Blood and Transplant Research Unit in Donor Health and Genomics, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
| | - Michael Inouye
- Cambridge Baker Systems Genomics Initiative, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- Cambridge Baker Systems Genomics Initiative, Baker Heart and Diabetes Institute, Melbourne, VIC 3004, Australia
- British Heart Foundation Cardiovascular Epidemiology Unit, Department of Public Health and Primary Care, University of Cambridge, Cambridge CB1 8RN, UK
- British Heart Foundation Centre of Research Excellence, University of Cambridge, Cambridge CB1 8RN, UK
- Health Data Research UK Cambridge, Wellcome Genome Campus and University of Cambridge, Cambridge CB10 1SA, UK
- The Alan Turing Institute, London NW1 2DB, UK
| |
Collapse
|
8
|
Liu R, Yuan M, Xu XS, Yang Y. Fast and efficient correction for population stratification in multi-locus genome-wide association studies. Genetica 2021; 149:313-325. [PMID: 34480683 DOI: 10.1007/s10709-021-00129-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2021] [Accepted: 07/26/2021] [Indexed: 11/27/2022]
Abstract
Reducing false discoveries caused by population stratification (PS) has always been a challenge in genome-wide association studies (GWAS). The current literature established several single marker approaches including genomic control (GC), EIGENSTRAT and generalized linear mixed model association test (GMMAT) and multi-marker methods such as LASSO mixed model (LASSOMM). However, the single-marker methods require prespecifying an arbitrary p value threshold in the selection process, likely resulting in suboptimal precision or recall. On the other hand, it appears that LASSOMM is extremely computationally intensive and may not suitable for large-scale GWAS. In this paper, we proposed a simple multi-marker approach (PCA-LASSO) combining principal component analysis (PCA) and least absolute shrinkage and selection operator (LASSO). We utilize PCA to correct for the confounding effects of PS and LASSO with built-in cross-validation for a data-driven selection. Compared to the current single-marker approaches, the proposed PCA-LASSO provides optimal balance between precision and recall, and consequently superior F1 scores. Similarly, compared to LASSOMM, PCA-LASSO markedly increases the precision while minimizing the loss of recall, and therefore improves the overall F1 score in presence of PS. More importantly, PCA-LASSO drastically reduces the computational time by > 1000 times when compared to LASSOMM. We applied PCA-LASSO to a real dataset of Alzheimer's disease and successfully identified SNP rs429358 (Gene APOE4) which has been widely reported to be associated with the onset and elevated risk of Alzheimer's disease. In conclusion, PCA-LASSO is a simple, fast, but accurate approach for GWAS in presence of latent PS.
Collapse
Affiliation(s)
- Rui Liu
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, 230026, Anhui, China
| | - Min Yuan
- Center for Data Science in Health, School of Public Health Administration, Anhui Medical University, Hefei, 230032, Anhui, China
| | | | - Yaning Yang
- Department of Statistics and Finance, University of Science and Technology of China, Hefei, 230026, Anhui, China.
| |
Collapse
|
9
|
Sanderson E, Richardson TG, Hemani G, Davey Smith G. The use of negative control outcomes in Mendelian randomization to detect potential population stratification. Int J Epidemiol 2021; 50:1350-1361. [PMID: 33570130 PMCID: PMC8407870 DOI: 10.1093/ije/dyaa288] [Citation(s) in RCA: 40] [Impact Index Per Article: 13.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Accepted: 01/18/2021] [Indexed: 12/14/2022] Open
Abstract
A key assumption of Mendelian randomization (MR) analysis is that there is no association between the genetic variants used as instruments and the outcome other than through the exposure of interest. One way in which this assumption can be violated is through population stratification, which can introduce confounding of the relationship between the genetic variants and the outcome and so induce an association between them. Negative control outcomes are increasingly used to detect unobserved confounding in observational epidemiological studies. Here we consider the use of negative control outcomes in MR studies to detect confounding of the genetic variants and the exposure or outcome. As a negative control outcome in an MR study, we propose the use of phenotypes which are determined before the exposure and outcome but which are likely to be subject to the same confounding as the exposure or outcome of interest. We illustrate our method with a two-sample MR analysis of a preselected set of exposures on self-reported tanning ability and hair colour. Our results show that, of the 33 exposures considered, genome-wide association studies (GWAS) of adiposity and education-related traits are likely to be subject to population stratification that is not controlled for through adjustment, and so any MR study including these traits may be subject to bias that cannot be identified through standard pleiotropy robust methods. Negative control outcomes should therefore be used regularly in MR studies to detect potential population stratification in the data used.
Collapse
Affiliation(s)
- Eleanor Sanderson
- MRC Integrative Epidemiology Unit at the University of Bristol, Bristol, UK.,Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
| | - Tom G Richardson
- MRC Integrative Epidemiology Unit at the University of Bristol, Bristol, UK.,Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
| | - Gibran Hemani
- MRC Integrative Epidemiology Unit at the University of Bristol, Bristol, UK.,Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
| | - George Davey Smith
- MRC Integrative Epidemiology Unit at the University of Bristol, Bristol, UK.,Population Health Sciences, Bristol Medical School, University of Bristol, Bristol, UK
| |
Collapse
|
10
|
Zhang W, Cheng L, Huang G. Towards fine-scale population stratification modeling based on kernel principal component analysis and random forest. Genes Genomics 2021. [PMID: 34097252 DOI: 10.1007/s13258-021-01057-4] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2020] [Accepted: 01/26/2021] [Indexed: 10/21/2022]
Abstract
BACKGROUND Population stratification modeling is essential in Genome-Wide Association Studies. OBJECTIVE In this paper, we aim to build a fine-scale population stratification model to efficiently infer individual genetic ancestry. METHODS Kernel Principal Component Analysis (PCA) and random forest are adopted to build the population stratification model, together with parameter optimization. We explore different PCA methods, including standard PCA and kernel PCA to extract relevant features from the genotype data that is transformed by vcf2geno, a pipeline from LASER software. These extracted features are fed into a random forest for ensemble learning. Parameter tuning is performed to jointly find the optimal number of principal components, kernel function for PCA and parameters of the random forest. RESULTS Experiments based on HGDP dataset show that kernel PCA with Sigmoid function and Gaussian function can achieve higher prediction accuracy than the standard PCA. Compared to standard PCA with the two principal components, the accuracy by using KPCA-Sigmoid with the optimal number of principal components can achieve around 100% and 200% improvement for East Asian and European populations, respectively. CONCLUSION With the optimal parameter configuration on both PCA and random forest, our proposed method can infer the individual genetic ancestry more accurately, given their variants.
Collapse
|
11
|
Abegaz F, Van Lishout F, Mahachie John JM, Chiachoompu K, Bhardwaj A, Duroux D, Gusareva ES, Wei Z, Hakonarson H, Van Steen K. Performance of model-based multifactor dimensionality reduction methods for epistasis detection by controlling population structure. BioData Min 2021; 14:16. [PMID: 33608043 PMCID: PMC7893746 DOI: 10.1186/s13040-021-00247-w] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/29/2020] [Accepted: 02/07/2021] [Indexed: 12/15/2022] Open
Abstract
Background In genome-wide association studies the extent and impact of confounding due to population structure have been well recognized. Inadequate handling of such confounding is likely to lead to spurious associations, hampering replication, and the identification of causal variants. Several strategies have been developed for protecting associations against confounding, the most popular one is based on Principal Component Analysis. In contrast, the extent and impact of confounding due to population structure in gene-gene interaction association epistasis studies are much less investigated and understood. In particular, the role of nonlinear genetic population substructure in epistasis detection is largely under-investigated, especially outside a regression framework. Methods To identify causal variants in synergy, to improve interpretability and replicability of epistasis results, we introduce three strategies based on a model-based multifactor dimensionality reduction approach for structured populations, namely MBMDR-PC, MBMDR-PG, and MBMDR-GC. Results Simulation results comparing the performance of various approaches show that in the presence of population structure MBMDR-PC and MBMDR-PG consistently better control type I error rate at the nominal level than MBMDR-GC. Moreover, our proposed three methods of population structure correction outperform MDR-SP in terms of statistical power. Conclusion We demonstrate through extensive simulation studies the effect of various degrees of genetic population structure and relatedness on epistasis detection and propose appropriate remedial measures based on linear and nonlinear sample genetic similarity. Supplementary Information The online version contains supplementary material available at 10.1186/s13040-021-00247-w.
Collapse
Affiliation(s)
- Fentaw Abegaz
- GIGA-R, Medical Genomics - BIO3, University of Liège, Liège, Belgium.
| | | | | | | | - Archana Bhardwaj
- GIGA-R, Medical Genomics - BIO3, University of Liège, Liège, Belgium
| | - Diane Duroux
- GIGA-R, Medical Genomics - BIO3, University of Liège, Liège, Belgium
| | - Elena S Gusareva
- GIGA-R, Medical Genomics - BIO3, University of Liège, Liège, Belgium
| | - Zhi Wei
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA
| | - Hakon Hakonarson
- Center for Applied Genomics, The Children's Hospital of Philadelphia, Philadelphia, PA, USA.,Department of Pediatrics, Division of Human Genetics, The Perelman School of Medicine, University of Pennsylvania, Philadelphia, PA, USA
| | - Kristel Van Steen
- GIGA-R, Medical Genomics - BIO3, University of Liège, Liège, Belgium.,WELBIO (Walloon Excellence in Lifesciences and Biotechnology), University of Liège, Liège, Belgium
| |
Collapse
|
12
|
Liu CC, Shringarpure S, Lange K, Novembre J. Exploring Population Structure with Admixture Models and Principal Component Analysis. Methods Mol Biol 2020; 2090:67-86. [PMID: 31975164 DOI: 10.1007/978-1-0716-0199-0_4] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/19/2023]
Abstract
Population structure is a commonplace feature of genetic variation data, and it has importance in numerous application areas, including evolutionary genetics, conservation genetics, and human genetics. Understanding the structure in a sample is necessary before more sophisticated analyses are undertaken. Here we provide a protocol for running principal component analysis (PCA) and admixture proportion inference-two of the most commonly used approaches in describing population structure. Along with hands-on examples with CEPH-Human Genome Diversity Panel and pragmatic caveats, readers will learn to analyze and visualize population structure on their own data.
Collapse
|
13
|
Richardson K. Polygenic scores are an even bigger social hazard: Commentary on: Baverstock, K. (2019) polygenic scores: Are they a public health hazard? Progress in Biophysics and Molecular Biology. Available online 6 August 2019. Prog Biophys Mol Biol 2020; 153:13-16. [PMID: 31887314 DOI: 10.1016/j.pbiomolbio.2019.12.003] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 11/09/2019] [Revised: 12/16/2019] [Accepted: 12/18/2019] [Indexed: 06/10/2023]
|
14
|
Abstract
Background Population stratification is a known confounder of genome-wide association studies, as it can lead to false positive results. Principal component analysis (PCA) method is widely applied in the analysis of population structure with common variants. However, it is still unclear about the analysis performance when rare variants are used. Results We derive a mathematical expectation of the genetic relationship matrix. Variance and covariance elements of the expected matrix depend explicitly on allele frequencies of the genetic markers used in the PCA analysis. We show that inter-population variance is solely contained in K principal components (PCs) and mostly in the largest K-1 PCs, where K is the number of populations in the samples. We propose FPC, ratio of the inter-population variance to the intra-population variance in the K population informative PCs, and d2, sum of squared distances among populations, as measures of population divergence. We show analytically that when allele frequencies become small, the ratio FPC abates, the population distance d2 decreases, and portion of variance explained by the K PCs diminishes. The results are validated in the analysis of the 1000 Genomes Project data. The ratio FPC is 93.85, population distance d2 is 444.38, and variance explained by the largest five PCs is 17.09% when using with common variants with allele frequencies between 0.4 and 0.5. However, the ratio, distance and percentage decrease to 1.83, 17.83 and 0.74%, respectively, with rare variants of frequencies between 0.0001 and 0.01. Conclusions The PCA of population stratification performs worse with rare variants than with common ones. It is necessary to restrict the selection to only the common variants when analyzing population stratification with sequencing data.
Collapse
Affiliation(s)
- Shengqing Ma
- State Key Laboratory of Integrated Services Networks, Xidian University, 2 South Taibai Road, Xi'an, 710071, Shaanxi, China
| | - Gang Shi
- State Key Laboratory of Integrated Services Networks, Xidian University, 2 South Taibai Road, Xi'an, 710071, Shaanxi, China.
| |
Collapse
|
15
|
Yuan V, Price EM, Del Gobbo G, Mostafavi S, Cox B, Binder AM, Michels KB, Marsit C, Robinson WP. Accurate ethnicity prediction from placental DNA methylation data. Epigenetics Chromatin 2019; 12:51. [PMID: 31399127 PMCID: PMC6688210 DOI: 10.1186/s13072-019-0296-3] [Citation(s) in RCA: 27] [Impact Index Per Article: 5.4] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/25/2019] [Accepted: 07/22/2019] [Indexed: 12/19/2022] Open
Abstract
Background The influence of genetics on variation in DNA methylation (DNAme) is well documented. Yet confounding from population stratification is often unaccounted for in DNAme association studies. Existing approaches to address confounding by population stratification using DNAme data may not generalize to populations or tissues outside those in which they were developed. To aid future placental DNAme studies in assessing population stratification, we developed an ethnicity classifier, PlaNET (Placental DNAme Elastic Net Ethnicity Tool), using five cohorts with Infinium Human Methylation 450k BeadChip array (HM450k) data from placental samples that is also compatible with the newer EPIC platform. Results Data from 509 placental samples were used to develop PlaNET and show that it accurately predicts (accuracy = 0.938, kappa = 0.823) major classes of self-reported ethnicity/race (African: n = 58, Asian: n = 53, Caucasian: n = 389), and produces ethnicity probabilities that are highly correlated with genetic ancestry inferred from genome-wide SNP arrays (> 2.5 million SNP) and ancestry informative markers (n = 50 SNPs). PlaNET’s ethnicity classification relies on 1860 HM450K microarray sites, and over half of these were linked to nearby genetic polymorphisms (n = 955). Our placental-optimized method outperforms existing approaches in assessing population stratification in placental samples from individuals of Asian, African, and Caucasian ethnicities. Conclusion PlaNET provides an improved approach to address population stratification in placental DNAme association studies. The method can be applied to predict ethnicity as a discrete or continuous variable and will be especially useful when self-reported ethnicity information is missing and genotyping markers are unavailable. Electronic supplementary material The online version of this article (10.1186/s13072-019-0296-3) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Victor Yuan
- Department of Medical Genetics, University of British Columbia, C201-4500 Oak Street, Vancouver, BC, V6H 3N1, Canada.,BC Children's Hospital Research Institute, 938 W 28th Ave, Vancouver, BC, V5Z 4H4, Canada
| | - E Magda Price
- Department of Medical Genetics, University of British Columbia, C201-4500 Oak Street, Vancouver, BC, V6H 3N1, Canada.,BC Children's Hospital Research Institute, 938 W 28th Ave, Vancouver, BC, V5Z 4H4, Canada
| | - Giulia Del Gobbo
- Department of Medical Genetics, University of British Columbia, C201-4500 Oak Street, Vancouver, BC, V6H 3N1, Canada.,BC Children's Hospital Research Institute, 938 W 28th Ave, Vancouver, BC, V5Z 4H4, Canada
| | - Sara Mostafavi
- Department of Medical Genetics, University of British Columbia, C201-4500 Oak Street, Vancouver, BC, V6H 3N1, Canada.,BC Children's Hospital Research Institute, 938 W 28th Ave, Vancouver, BC, V5Z 4H4, Canada.,Department of Statistics, University of British Columbia, 3182 Earth Sciences Building, 2207 Main Mall, Vancouver, BC, V6T 1Z4, Canada
| | - Brian Cox
- Department of Physiology, University of Toronto, Medical Sciences Building, 3rd Floor, 1 King's College Circle, Toronto, ON, M5S 1A8, Canada
| | - Alexandra M Binder
- Department of Epidemiology, Fielding School of Public Health, University of California, Los Angeles, CA, 90095, USA
| | - Karin B Michels
- Department of Epidemiology, Fielding School of Public Health, University of California, Los Angeles, CA, 90095, USA
| | - Carmen Marsit
- Department of Environmental Health, Emory University, 1518 Clifton Road NE, Atlanta, GA, 30322, USA
| | - Wendy P Robinson
- Department of Medical Genetics, University of British Columbia, C201-4500 Oak Street, Vancouver, BC, V6H 3N1, Canada. .,BC Children's Hospital Research Institute, 938 W 28th Ave, Vancouver, BC, V5Z 4H4, Canada.
| |
Collapse
|
16
|
Gaspar HA, Breen G. Probabilistic ancestry maps: a method to assess and visualize population substructures in genetics. BMC Bioinformatics 2019; 20:116. [PMID: 30845922 DOI: 10.1186/s12859-019-2680-1] [Citation(s) in RCA: 15] [Impact Index Per Article: 3.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2018] [Accepted: 02/14/2019] [Indexed: 02/07/2023] Open
Abstract
BACKGROUND Principal component analysis (PCA) is a standard method to correct for population stratification in ancestry-specific genome-wide association studies (GWASs) and is used to cluster individuals by ancestry. Using the 1000 genomes project data, we examine how non-linear dimensionality reduction methods such as t-distributed stochastic neighbor embedding (t-SNE) or generative topographic mapping (GTM) can be used to provide improved ancestry maps by accounting for a higher percentage of explained variance in ancestry, and how they can help to estimate the number of principal components necessary to account for population stratification. GTM generates posterior probabilities of class membership which can be used to assess the probability of an individual to belong to a given population - as opposed to t-SNE, GTM can be used for both clustering and classification. RESULTS PCA only partially identifies population clusters and does not separate most populations within a given continent, such as Japanese and Han Chinese in East Asia, or Mende and Yoruba in Africa. t-SNE and GTM, taking into account more data variance, can identify more fine-grained population clusters. GTM can be used to build probabilistic classification models, and is as efficient as support vector machine (SVM) for classifying 1000 Genomes Project populations. CONCLUSION The main interest of probabilistic GTM maps is to attain two objectives with only one map: provide a better visualization that separates populations efficiently, and infer genetic ancestry for individuals or populations. This paper is a first application of GTM for ancestry classification models. Our code ( https://github.com/hagax8/ancestry_viz ) and interactive visualizations ( https://lovingscience.com/ancestries ) are available online.
Collapse
|
17
|
Huerta-Chagoya A, Moreno-Macías H, Fernández-López JC, Ordóñez-Sánchez ML, Rodríguez-Guillén R, Contreras A, Hidalgo-Miranda A, Alfaro-Ruíz LA, Salazar-Fernandez EP, Moreno-Estrada A, Aguilar-Salinas CA, Tusié-Luna T. A panel of 32 AIMs suitable for population stratification correction and global ancestry estimation in Mexican mestizos. BMC Genet 2019; 20:5. [PMID: 30621578 PMCID: PMC6323778 DOI: 10.1186/s12863-018-0707-7] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/06/2018] [Accepted: 12/18/2018] [Indexed: 11/16/2022] Open
Abstract
BACKGROUND Association studies are useful to unravel the genetic basis of common human diseases. However, the presence of undetected population structure can lead to both false positive results and failures to detect genuine associations. Even when most of the approaches to deal with population stratification require genome-wide data, the use of a well-selected panel of ancestry informative markers (AIMs) may appropriately correct for population stratification. Few panels of AIMs have been developed for Latino populations and most contain a high number of markers (> 100 AIMs). For some association studies such as candidate gene approaches, it may be unfeasible to genotype a numerous set of markers to avoid false positive results. In such cases, methods that use fewer AIMs may be appropriate. RESULTS We validated an accurate and cost-effective panel of AIMs, for use in population stratification correction of association studies and global ancestry estimation in Mexicans, as well as in populations having large proportions of both European and Native American ancestries. Based on genome-wide data from 1953 Mexican individuals, we performed a PCA and SNP weights were calculated to select subsets of unlinked AIMs within percentiles 0.10 and 0.90, ensuring that all chromosomes were represented. Correlations between PC1 calculated using genome-wide data versus each subset of AIMs (16, 32, 48 and 64) were r2 = 0.923, 0.959, 0.972 and 0.978, respectively. When evaluating PCs performance as population stratification adjustment covariates, no correlation was found between P values obtained from uncorrected and genome-wide corrected association analyses (r2 = 0.141), highlighting that population stratification correction is compulsory for association analyses in admixed populations. In contrast, high correlations were found when adjusting for both PC1 and PC2 for either subset of AIMs (r2 > 0.900). After multiple validations, including an independent sample, we selected a minimal panel of 32 AIMs, which are highly informative of the major ancestral components of Mexican mestizos, namely European and Native American ancestries. Finally, the correlation between the global ancestry proportions calculated using genome-wide data and our panel of 32 AIMs was r2 = 0.972. CONCLUSIONS Our panel of 32 AIMs accurately estimated global ancestry and corrected for population stratification in association studies in Mexican individuals.
Collapse
Affiliation(s)
- Alicia Huerta-Chagoya
- CONACYT, Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Ciudad de Mexico, Mexico
| | | | | | - María Luisa Ordóñez-Sánchez
- Unidad de Biología Molecular y Medicina Genómica, Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Ciudad de Mexico, Mexico
| | - Rosario Rodríguez-Guillén
- Unidad de Biología Molecular y Medicina Genómica, Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Ciudad de Mexico, Mexico
| | - Alejandra Contreras
- Instituto Nacional de Medicina Genómica, Ciudad de Mexico, Mexico
- Fox Chase Cancer Center, Philadelphia, USA
| | - Alfredo Hidalgo-Miranda
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica, Ciudad de Mexico, Mexico
| | - Luis Alberto Alfaro-Ruíz
- Laboratorio de Genómica del Cáncer, Instituto Nacional de Medicina Genómica, Ciudad de Mexico, Mexico
| | | | - Andrés Moreno-Estrada
- Laboratorio Nacional de Genómica para la Biodiversidad (LANGEBIO-UGA), CINVESTAV, Iraputato, Guanajuato, Mexico
| | - Carlos Alberto Aguilar-Salinas
- Departamento de Endocrinología y Metabolismo, Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Ciudad de Mexico, Mexico
| | - Teresa Tusié-Luna
- Unidad de Biología Molecular y Medicina Genómica, Instituto Nacional de Ciencias Médicas y Nutrición Salvador Zubirán, Ciudad de Mexico, Mexico
- Departamento de Medicina Genómica y Toxicología Ambiental, Instituto de Investigaciones Biomédicas, UNAM, Ciudad de Mexico, Mexico
| |
Collapse
|
18
|
Jiang L, Wei YL, Zhao L, Li N, Liu T, Liu HB, Ren LJ, Li JL, Hao HF, Li Q, Li CX. Global analysis of population stratification using a smart panel of 27 continental ancestry-informative SNPs. Forensic Sci Int Genet 2018; 35:e10-e12. [PMID: 29803513 DOI: 10.1016/j.fsigen.2018.05.006] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/27/2018] [Revised: 05/03/2018] [Accepted: 05/14/2018] [Indexed: 12/21/2022]
Abstract
Over the last decade, several panels of ancestry-informative markers have been proposed for the analysis of population genetic structure. The differentiation efficiency depends on the discriminatory ability of the included markers and the reference population coverage. We previously developed a small set of 27 autosomal single nucleotide polymorphisms (SNPs) for analyzing African, European, and East Asian ancestries. In the current study, we gathered a high-coverage reference database of 110 populations (10,350 individuals) from across the globe. The discrimination power of the panel was re-evaluated using four continental ancestry groups (as well as Indigenous Americans). We observed that all the 27 SNPs demonstrated stratified population specificity leading to a striking ancestral discrimination. Five markers (rs728404, rs7170869, rs2470102, rs1448485, and rs4789193) showed differences (δ > 0.3) in the frequency profiles between East Asian and Indigenous American populations. Ancestry components of all involved populations were accurately accessed compared with those from previous genome-wide analyses, thereafter achieved broadly population separation. Thus, our ancestral inference panel of a small number of highly informative SNPs in combination with a large-scale reference database provides a high-resolution in estimating ancestry compositions and distinguishing individual origins. We propose extensive usage in biomedical studies and forensics.
Collapse
Affiliation(s)
- Li Jiang
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, Institute of Forensic Science, Beijing, 100038, People's Republic of China
| | - Yi-Liang Wei
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, Institute of Forensic Science, Beijing, 100038, People's Republic of China; Department of Immunology, Biochemistry and Molecular Biology, 2011 Collaborative Innovation Center of Tianjin for Medical Epigenetics, Tianjin Key Laboratory of Medical Epigenetics, Tianjin Medical University, Tianjin, 300070, People's Republic of China
| | - Lei Zhao
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, Institute of Forensic Science, Beijing, 100038, People's Republic of China
| | - Na Li
- Department of Pathology, Xingtai Medical College, Xingtai, Hebei, 054000, People's Republic of China
| | - Tao Liu
- Department of Neurology, Hebei Civil Administration General Hospital, Xingtai, 054000, Hebei, People's Republic of China
| | - Hai-Bo Liu
- Institution of Forensic Science of Bingtuan Public Security Bureau, Ürümqi, 830000, Xinjiang, People's Republic of China
| | - Li-Jie Ren
- The 519th Hospital of the People's Liberation Army, Wenchang, Hainan, 300457, People's Republic of China
| | - Jiu-Ling Li
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, Institute of Forensic Science, Beijing, 100038, People's Republic of China
| | - Hui-Fang Hao
- Department of Nephrology, Tianjin TEDA Hospital, Tianjin, 300457, People's Republic of China
| | - Qing Li
- Department of Nephrology, Tianjin TEDA Hospital, Tianjin, 300457, People's Republic of China
| | - Cai-Xia Li
- Key Laboratory of Forensic Genetics, Beijing Engineering Research Center of Crime Scene Evidence Examination, Institute of Forensic Science, Beijing, 100038, People's Republic of China.
| |
Collapse
|
19
|
Abstract
Relatedness within a sample can be of ancient (population stratification) or recent (familial structure) origin, and can either be known (pedigree data) or unknown (cryptic relatedness). All of these forms of familial relatedness have the potential to confound the results of genome-wide association studies. This chapter reviews the major methods available to researchers to adjust for the biases introduced by relatedness and maximize power to detect associations. The advantages and disadvantages of different methods are presented with reference to elements of study design, population characteristics, and computational requirements.
Collapse
Affiliation(s)
- Russell Thomson
- Centre for Research in Mathematics, School of Computing, Engineering and Mathematics, Western Sydney University, Parramatta, Australia.
| | - Rebekah McWhirter
- Menzies Institute for Medical Research, University of Tasmania, Hobart, TAS, Australia
| |
Collapse
|
20
|
Zhou YH, Marron JS, Wright FA. Eigenvalue significance testing for genetic association. Biometrics 2017; 74:439-447. [PMID: 28853138 DOI: 10.1111/biom.12767] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2016] [Revised: 05/01/2017] [Accepted: 07/01/2017] [Indexed: 11/26/2022]
Abstract
Genotype eigenvectors are widely used as covariates for control of spurious stratification in genetic association. Significance testing for the accompanying eigenvalues has typically been based on a standard Tracy-Widom limiting distribution for the largest eigenvalue, derived under white-noise assumptions. It is known that even modest local correlation among markers inflates the largest eigenvalues, even in the absence of true stratification. In addition, a few sample eigenvalues may be extreme, creating further complications in accurate testing. We explore several methods to identify appropriate null eigenvalue thresholds, while remaining sensitive to eigenvalues corresponding to population stratification. We introduce a novel block permutation approach, designed to produce an appropriate null eigenvalue distribution by eliminating long-range genomic correlation while preserving local correlation. We also propose a fast approach based on eigenvalue distribution modeling, using a simple fit criterion and the general Marčenko-Pastur equation under a simple discrete eigenvalue model. Block permutation and the model-based approach work well for pure simulations and for data resampled from the 1000 Genomes project. In contrast, we find that the standard approach of computing an "effective" number of markers does not perform well. The performance of the methods is also demonstrated for a motivating example from the International Cystic Fibrosis Consortium.
Collapse
Affiliation(s)
- Yi-Hui Zhou
- Bioinformatics Research Center and Department of Biological Sciences, North Carolina State University, North Carolina, U.S.A
| | - J S Marron
- Department of Statistics and Operations Research, University of North Carolina, North Carolina, U.S.A
| | - Fred A Wright
- Bioinformatics Research Center and Departments of Statistics and Biological Sciences, North Carolina State University, North Carolina, U.S.A
| |
Collapse
|
21
|
Lutz SM, Thwing A, Schmiege S, Kroehl M, Baker CD, Starling AP, Hokanson JE, Ghosh D. Examining the role of unmeasured confounding in mediation analysis with genetic and genomic applications. BMC Bioinformatics 2017; 18:344. [PMID: 28724417 PMCID: PMC5517807 DOI: 10.1186/s12859-017-1749-y] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2017] [Accepted: 07/04/2017] [Indexed: 11/23/2022] Open
Abstract
Background In mediation analysis if unmeasured confounding is present, the estimates for the direct and mediated effects may be over or under estimated. Most methods for the sensitivity analysis of unmeasured confounding in mediation have focused on the mediator-outcome relationship. Results The Umediation R package enables the user to simulate unmeasured confounding of the exposure-mediator, exposure-outcome, and mediator-outcome relationships in order to see how the results of the mediation analysis would change in the presence of unmeasured confounding. We apply the Umediation package to the Genetic Epidemiology of Chronic Obstructive Pulmonary Disease (COPDGene) study to examine the role of unmeasured confounding due to population stratification on the effect of a single nucleotide polymorphism (SNP) in the CHRNA5/3/B4 locus on pulmonary function decline as mediated by cigarette smoking. Conclusions Umediation is a flexible R package that examines the role of unmeasured confounding in mediation analysis allowing for normally distributed or Bernoulli distributed exposures, outcomes, mediators, measured confounders, and unmeasured confounders. Umediation also accommodates multiple measured confounders, multiple unmeasured confounders, and allows for a mediator-exposure interaction on the outcome. Umediation is available as an R package at https://github.com/SharonLutz/Umediation A tutorial on how to install and use the Umediation package is available in the Additional file 1. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1749-y) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sharon M Lutz
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, 13001 E. 17th Place, B119 Bldg. 500 W3128, Aurora, CO, 80045, USA.
| | - Annie Thwing
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, 13001 E. 17th Place, B119 Bldg. 500 W3128, Aurora, CO, 80045, USA
| | - Sarah Schmiege
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, 13001 E. 17th Place, B119 Bldg. 500 W3128, Aurora, CO, 80045, USA
| | - Miranda Kroehl
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, 13001 E. 17th Place, B119 Bldg. 500 W3128, Aurora, CO, 80045, USA
| | - Christopher D Baker
- Department of Pediatrics and Pulmonary Medicine, Children's Hospital Colorado, Aurora, CO, USA
| | - Anne P Starling
- Department of Epidemiology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - John E Hokanson
- Department of Epidemiology, University of Colorado Anschutz Medical Campus, Aurora, CO, USA
| | - Debashis Ghosh
- Department of Biostatistics and Informatics, University of Colorado Anschutz Medical Campus, 13001 E. 17th Place, B119 Bldg. 500 W3128, Aurora, CO, 80045, USA
| |
Collapse
|
22
|
Zhou YH, Marron JS, Wright FA. Computation of ancestry scores with mixed families and unrelated individuals. Biometrics 2017; 74:155-164. [PMID: 28452052 DOI: 10.1111/biom.12708] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/01/2016] [Revised: 03/01/2017] [Accepted: 03/01/2017] [Indexed: 01/03/2023]
Abstract
The issue of robustness to family relationships in computing genotype ancestry scores such as eigenvector projections has received increased attention in genetic association, and is particularly challenging when sets of both unrelated individuals and closely related family members are included. The current standard is to compute loadings (left singular vectors) using unrelated individuals and to compute projected scores for remaining family members. However, projected ancestry scores from this approach suffer from shrinkage toward zero. We consider two main novel strategies: (i) matrix substitution based on decomposition of a target family-orthogonalized covariance matrix, and (ii) using family-averaged data to obtain loadings. We illustrate the performance via simulations, including resampling from 1000 Genomes Project data, and analysis of a cystic fibrosis dataset. The matrix substitution approach has similar performance to the current standard, but is simple and uses only a genotype covariance matrix, while the family-average method shows superior performance. Our approaches are accompanied by novel ancillary approaches that provide considerable insight, including individual-specific eigenvalue scree plots.
Collapse
Affiliation(s)
- Yi-Hui Zhou
- Department of Biological Sciences, Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, U.S.A
| | - James S Marron
- Department of Statistics and Operations Research, University of North Carolina, Chapel Hill, U.S.A
| | - Fred A Wright
- Department of Biological Sciences and Statistics, Bioinformatics Research Center, North Carolina State University, Raleigh, U.S.A
| |
Collapse
|
23
|
Derks EM, Zwinderman AH, Gamazon ER. The Relation Between Inflation in Type-I and Type-II Error Rate and Population Divergence in Genome-Wide Association Analysis of Multi-Ethnic Populations. Behav Genet 2017; 47:360-8. [PMID: 28185111 DOI: 10.1007/s10519-017-9837-3] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/02/2016] [Accepted: 01/19/2017] [Indexed: 12/24/2022]
Abstract
Population divergence impacts the degree of population stratification in Genome Wide Association Studies. We aim to: (i) investigate type-I error rate as a function of population divergence (FST) in multi-ethnic (admixed) populations; (ii) evaluate the statistical power and effect size estimates; and (iii) investigate the impact of population stratification on the results of gene-based analyses. Quantitative phenotypes were simulated. Type-I error rate was investigated for Single Nucleotide Polymorphisms (SNPs) with varying levels of FST between the ancestral European and African populations. Type-II error rate was investigated for a SNP characterized by a high value of FST. In all tests, genomic MDS components were included to correct for population stratification. Type-I and type-II error rate was adequately controlled in a population that included two distinct ethnic populations but not in admixed samples. Statistical power was reduced in the admixed samples. Gene-based tests showed no residual inflation in type-I error rate.
Collapse
|
24
|
Abstract
The last decade has seen substantial advances in the understanding of the genetics of complex traits and disease. This has been largely driven by genome-wide association studies (GWAS), which have identified thousands of genetic loci associated with these traits and disease. This chapter provides a guide on how to perform GWAS on both binary (case-control) and quantitative traits. As poor data quality, through both genotyping failures and unobserved population structure, is a major cause of false-positive genetic associations, there is a particular focus on the crucial steps required to prepare the SNP data prior to analysis. This is followed by the methods used to perform the actual GWAS and visualization of the results.
Collapse
Affiliation(s)
- Allan F McRae
- Centre for Neurogenetics and Statistical Genomics, Queensland Brain Institute, The University of Queensland, St Lucia, QLD, 4072, Australia.
| |
Collapse
|
25
|
Abstract
The Hardy-Weinberg principle, one of the most important principles in population genetics, was originally developed for the study of allele frequency changes in a population over generations. It is now, however, widely used in studies of human diseases to detect inbreeding, population stratification, and genotyping errors. For assessment of deviation from Hardy-Weinberg proportions in data, the most popular approaches include the asymptotic Pearson's chi-squared goodness-of-fit test and the exact test. Pearson's chi-squared goodness-of-fit test is simple and straightforward, but is very sensitive to a small sample size or rare allele frequency. The exact test of Hardy-Weinberg proportions is preferable in these situations. The exact test can be performed through complete enumeration of heterozygote genotypes or on the basis of the Markov chain Monte Carlo procedure. In this chapter, we describe the Hardy-Weinberg principle and the commonly used Hardy-Weinberg proportion tests and their applications, and we demonstrate how the chi-squared test and exact test of Hardy-Weinberg proportions can be performed step-by-step using the popular software programs SAS, R, and PLINK, which have been widely used in genetic association studies, along with numerical examples. We also discuss approaches for testing Hardy-Weinberg proportions in case-control study designs that are better than traditional approaches for testing Hardy-Weinberg proportions in controls only. Finally, we note that deviation from the Hardy-Weinberg proportions in affected individuals can provide evidence for an association between genetic variants and diseases.
Collapse
Affiliation(s)
- Jian Wang
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA
| | - Sanjay Shete
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
- Department of Epidemiology, University of Texas MD Anderson Cancer Center, Houston, TX, 77030, USA.
| |
Collapse
|
26
|
Jorge-Nebert LF, Zhang G, Wilson KM, Jiang Z, Butler R, Gluckman JL, Pinney SM, Nebert DW. Head-and-neck squamous cell carcinoma risk in smokers: no association detected between phenotype and AHR, CYP1A1, CYP1A2, or CYP1B1 genotype. Hum Genomics 2016; 10:39. [PMID: 27894333 PMCID: PMC5127090 DOI: 10.1186/s40246-016-0094-y] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/12/2016] [Accepted: 11/04/2016] [Indexed: 12/23/2022] Open
Abstract
BACKGROUND Head-and-neck squamous cell carcinoma (HNSCC) differs between smokers and nonsmokers in etiology and clinical presentation. Because of demonstrated unequivocal involvement in smoking-induced cancer in laboratory animals, four candidate genes--AHR, CYP1A1, CYP1A2, and CYP1B1--were selected for a clinical genotype-phenotype association study of HNSCC risk in smokers. Thirty-six single-nucleotide variants (mostly tag-SNPs) within and near these four genes [16 (AHR), 4 (CYP1A1), 4 (CYP1A2), and 12 (CYP1B1)] were chosen. METHODS Extreme discordant phenotype (EDP) method of analysis was used to increase statistical power. HNSCC patients--having smoked 1-40 cigarette pack-years--represented the "highly-sensitive" (HS) population; heavy smokers having smoked ≥80 cigarette-pack-years without any type of cancer comprised the "highly-resistant" (HR) group. The vast majority of smokers were intermediate and discarded from consideration. Statistical tests were performed on N = 112 HS and N = 99 HR DNA samples from whole blood. CONCLUSIONS Among the four genes and flanking regions--one haploblock, ACTTGATC in the 5' portion of CYP1B1, retained statistical significance after 100,000 permutations (P = 0.0042); among our study population, this haploblock was found in 36.4% of African-American, but only 1.49% of Caucasian, HNSCC chromosomes. Interestingly, in the 1000 Genomes Project database, frequency of this haplotype (in 1322 African and 1006 Caucasian chromosomes) is 0.356 and 0.003, respectively. This study represents an excellent example of "spurious association by population stratification". Considering the cohort size, we therefore conclude that the variant alleles chosen for these four genes, alone or in combinations, are not statistically significantly associated with risk of cigarette-smoking-induced HNSCC.
Collapse
Affiliation(s)
- Lucia F Jorge-Nebert
- Department of Environmental Health and Center for Environmental Genetics, University of Cincinnati Medical Center, Cincinnati, OH, 45267-0056, USA
| | - Ge Zhang
- Department of Environmental Health and Center for Environmental Genetics, University of Cincinnati Medical Center, Cincinnati, OH, 45267-0056, USA.,Division of Human Genetics, Department of Pediatrics & Molecular Developmental Biology, Cincinnati Children's Hospital, Cincinnati, Ohio, 45229-2899, USA
| | - Keith M Wilson
- Department of Otolaryngology-Head and Neck Surgery, University of Cincinnati College of Medicine, Cincinnati, OH, 45267-0528, USA
| | - Zhengwen Jiang
- Department of Environmental Health and Center for Environmental Genetics, University of Cincinnati Medical Center, Cincinnati, OH, 45267-0056, USA.,Present address: Genesky Diagnostics, Suzhou, China
| | - Randall Butler
- Department of Pathology and Laboratory Medicine, University of Cincinnati College of Medicine, Cincinnati, OH, 45267-0533, USA
| | - Jack L Gluckman
- Department of Otolaryngology-Head and Neck Surgery, University of Cincinnati College of Medicine, Cincinnati, OH, 45267-0528, USA
| | - Susan M Pinney
- Department of Environmental Health and Center for Environmental Genetics, University of Cincinnati Medical Center, Cincinnati, OH, 45267-0056, USA
| | - Daniel W Nebert
- Department of Environmental Health and Center for Environmental Genetics, University of Cincinnati Medical Center, Cincinnati, OH, 45267-0056, USA.
| |
Collapse
|
27
|
Taudien S, Lausser L, Giamarellos-Bourboulis EJ, Sponholz C, Schöneweck F, Felder M, Schirra LR, Schmid F, Gogos C, Groth S, Petersen BS, Franke A, Lieb W, Huse K, Zipfel PF, Kurzai O, Moepps B, Gierschik P, Bauer M, Scherag A, Kestler HA, Platzer M. Genetic Factors of the Disease Course After Sepsis: Rare Deleterious Variants Are Predictive. EBioMedicine 2016; 12:227-238. [PMID: 27639823 PMCID: PMC5078585 DOI: 10.1016/j.ebiom.2016.08.037] [Citation(s) in RCA: 28] [Impact Index Per Article: 3.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/24/2016] [Revised: 08/19/2016] [Accepted: 08/24/2016] [Indexed: 12/20/2022] Open
Abstract
Sepsis is a life-threatening organ dysfunction caused by dysregulated host response to infection. For its clinical course, host genetic factors are important and rare genomic variants are suspected to contribute. We sequenced the exomes of 59 Greek and 15 German patients with bacterial sepsis divided into two groups with extremely different disease courses. Variant analysis was focusing on rare deleterious single nucleotide variants (SNVs). We identified significant differences in the number of rare deleterious SNVs per patient between the ethnic groups. Classification experiments based on the data of the Greek patients allowed discrimination between the disease courses with estimated sensitivity and specificity > 75%. By application of the trained model to the German patients we observed comparable discriminatory properties despite lower population-specific rare SNV load. Furthermore, rare SNVs in genes of cell signaling and innate immunity related pathways were identified as classifiers discriminating between the sepsis courses. Sepsis patients with favorable disease course after sepsis, even in the case of unfavorable preconditions, seem to be affected more often by rare deleterious SNVs in cell signaling and innate immunity related pathways, suggesting a protective role of impairments in these processes against a poor disease course. Rare SNV load is higher in the Greek vs. German population. Subsets of rare deleterious SNVs are predictive for the disease course after sepsis. Patients with favorable disease course seem to carry protective deleterious variants in sepsis related pathways.
Sepsis is a life-threatening disease caused by improper response to infection. Only little is known about the role of genetic factors. From > 4000 patients we selected the most extreme cases showing either a favorable or adverse disease course. We determined rare (< 1/200) protein-damaging genetic variants, as they may have a large effect. Using a computational model that includes knowledge on genes we can predict the disease course with > 75% accuracy. Surprisingly, favorable courses can be expected if defense mechanisms are damaged and prevented from overshooting. This underlines the relevance of rare variants for better understanding of sepsis and may offer new treatment options.
Collapse
Affiliation(s)
- Stefan Taudien
- Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Jena, Germany; Leibniz Institute on Aging - Fritz Lipmann Institute, Jena, Germany
| | - Ludwig Lausser
- Leibniz Institute on Aging - Fritz Lipmann Institute, Jena, Germany; Institute of Medical Systems Biology, Ulm University, Germany
| | - Evangelos J Giamarellos-Bourboulis
- Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Jena, Germany; 4th Department of Internal Medicine, National and Kapodistrian University of Athens, Athens, Greece
| | - Christoph Sponholz
- Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Jena, Germany; Leibniz Institute on Aging - Fritz Lipmann Institute, Jena, Germany; Department of Anaesthesiology and Intensive Care Therapy, Jena University Hospital, Jena, Germany
| | - Franziska Schöneweck
- Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Jena, Germany; Research group Clinical Epidemiology, CSCC, Jena University Hospital, Jena, Germany
| | - Marius Felder
- Leibniz Institute on Aging - Fritz Lipmann Institute, Jena, Germany
| | | | - Florian Schmid
- Institute of Medical Systems Biology, Ulm University, Germany
| | - Charalambos Gogos
- Department of Internal Medicine, University of Patras, Medical School, Greece
| | - Susann Groth
- Leibniz Institute on Aging - Fritz Lipmann Institute, Jena, Germany
| | - Britt-Sabina Petersen
- Institute of Clinical Molecular Biology, Christian-Albrechts-Universität Kiel, Kiel, Germany
| | - Andre Franke
- Institute of Clinical Molecular Biology, Christian-Albrechts-Universität Kiel, Kiel, Germany
| | - Wolfgang Lieb
- Institute of Epidemiology, Christian-Albrechts-Universität Kiel, Kiel, Germany
| | - Klaus Huse
- Leibniz Institute on Aging - Fritz Lipmann Institute, Jena, Germany
| | - Peter F Zipfel
- Leibniz Institute for Natural Product Research and Infection Biology - Hans-Knöll-Institute, Jena, Germany; Friedrich Schiller University Jena, Jena, Germany
| | - Oliver Kurzai
- Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Jena, Germany; Septomics Research Center Jena, Leibniz Institute for Natural Product Research and Infection Biology - Hans-Knöll-Institute, Jena, Germany
| | - Barbara Moepps
- Institute of Pharmacology and Toxicology, Ulm University Medical Center, Ulm, Germany
| | - Peter Gierschik
- Institute of Pharmacology and Toxicology, Ulm University Medical Center, Ulm, Germany
| | - Michael Bauer
- Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Jena, Germany; Department of Anaesthesiology and Intensive Care Therapy, Jena University Hospital, Jena, Germany
| | - André Scherag
- Integrated Research and Treatment Center, Center for Sepsis Control and Care (CSCC), Jena University Hospital, Jena, Germany; Research group Clinical Epidemiology, CSCC, Jena University Hospital, Jena, Germany
| | - Hans A Kestler
- Leibniz Institute on Aging - Fritz Lipmann Institute, Jena, Germany; Institute of Medical Systems Biology, Ulm University, Germany; Friedrich Schiller University Jena, Jena, Germany.
| | - Matthias Platzer
- Leibniz Institute on Aging - Fritz Lipmann Institute, Jena, Germany.
| |
Collapse
|
28
|
Vulturar R, Chiş A, Hambrich M, Kelemen B, Ungureanu L, Miu AC. Allelic distribution of BDNF Val66Met polymorphism in healthy Romanian volunteers. Transl Neurosci 2016; 7:31-34. [PMID: 28123819 PMCID: PMC5017592 DOI: 10.1515/tnsci-2016-0006] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/18/2015] [Accepted: 01/19/2016] [Indexed: 12/26/2022] Open
Abstract
Population stratification of functional gene polymorphisms is a potential confounding factor in genetic association studies. The Val66Met (rs6265) single-nucleotide polymorphism in the brain-derived neurotrophic factor gene (BDNF) exhibits one of the highest variabilities in terms of allelic distribution between populations. The present study reports the distribution of BDNF Val66Met alleles in a sample of healthy volunteers (N = 1124) selected from the Romanian population. Frequencies were 80.74% for the Val allele and 19.26% for the Met allele. The data from this study extends efforts to map the allelic distribution of BDNF Val66Met in populations around the world and emphasizes that population stratification should be controlled for in future studies that report phenotypic associations in samples from different populations.
Collapse
Affiliation(s)
- Romana Vulturar
- Discipline of Cell and Molecular Biology, Department of Molecular Sciences, "Iuliu Ha.ieganu" University of Medicine and Pharmacy, Cluj-Napoca, Romania; Cognitive Neuroscience Laboratory, Department of Psychology, Babeş-Bolyai University, Cluj-Napoca, Romania
| | - Adina Chiş
- Discipline of Cell and Molecular Biology, Department of Molecular Sciences, "Iuliu Ha.ieganu" University of Medicine and Pharmacy, Cluj-Napoca, Romania; Cognitive Neuroscience Laboratory, Department of Psychology, Babeş-Bolyai University, Cluj-Napoca, Romania
| | - Melinda Hambrich
- Discipline of Medical Psychology, Department of Neurosciences, "Iuliu Ha.ieganu" University of Medicine and Pharmacy, Cluj-Napoca, Romania
| | - Beatrice Kelemen
- Molecular Biology Center, Interdisciplinary Research Institute on Bio-Nano-Sciences, Babeş-Bolyai University, Cluj-Napoca, Romania
| | - Loredana Ungureanu
- Department of Dermatology, "Iuliu Ha.ieganu" University of Medicine and Pharmacy, Cluj-Napoca, Romania
| | - Andrei C Miu
- Cognitive Neuroscience Laboratory, Department of Psychology, Babeş-Bolyai University, Cluj-Napoca, Romania
| |
Collapse
|
29
|
Shen J, Li Z, Shi Y. SHEsisPCA: a GPU-based software to correct for population stratification that efficiently accelerates the process for handling genome-wide datasets. J Genet Genomics 2015; 42:445-53. [PMID: 26336801 DOI: 10.1016/j.jgg.2015.06.007] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/24/2015] [Revised: 05/30/2015] [Accepted: 06/10/2015] [Indexed: 11/24/2022]
Abstract
Population stratification is a problem in genetic association studies because it is likely to highlight loci that underlie the population structure rather than disease-related loci. At present, principal component analysis (PCA) has been proven to be an effective way to correct for population stratification. However, the conventional PCA algorithm is time-consuming when dealing with large datasets. We developed a Graphic processing unit (GPU)-based PCA software named SHEsisPCA (http://analysis.bio-x.cn/SHEsisMain.htm) that is highly parallel with a highest speedup greater than 100 compared with its CPU version. A cluster algorithm based on X-means was also implemented as a way to detect population subgroups and to obtain matched cases and controls in order to reduce the genomic inflation and increase the power. A study of both simulated and real datasets showed that SHEsisPCA ran at an extremely high speed while the accuracy was hardly reduced. Therefore, SHEsisPCA can help correct for population stratification much more efficiently than the conventional CPU-based algorithms.
Collapse
|
30
|
Shin J, Lee C. A mixed model reduces spurious genetic associations produced by population stratification in genome-wide association studies. Genomics 2015; 105:191-6. [PMID: 25640449 DOI: 10.1016/j.ygeno.2015.01.006] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/04/2014] [Revised: 01/21/2015] [Accepted: 01/23/2015] [Indexed: 01/06/2023]
Abstract
Population stratification can produce spurious genetic associations in genome-wide association studies (GWASs). Mixed model methodology has been regarded useful for correcting population stratification. This study explored statistical power and false discovery rate (FDR) with the data simulated for dichotomous traits. Empirical FDRs and powers were estimated using fixed models with and without genomic control and using mixed models with and without reflecting loci linked to the candidate marker in genetic relationships. Population stratification with admixture degree ranged from 1% to 10% resulted in inflated FDRs from the fixed model analysis without genomic control and decreased power from the fixed model analysis with genomic control (P<0.05). Meanwhile, population stratification could not change FDR and power estimates from the mixed model analyses (P>0.05). We suggest that the mixed model methodology was useful to reduce spurious genetic associations produced by population stratification in GWAS, even with a high degree of admixture (10%).
Collapse
Affiliation(s)
- Jimin Shin
- Department of Bioinformatics and Life Science, Soongsil University, Seoul 156-743, Republic of Korea
| | - Chaeyoung Lee
- Department of Bioinformatics and Life Science, Soongsil University, Seoul 156-743, Republic of Korea.
| |
Collapse
|
31
|
Oliveira AM, Domingues PM, Gomes V, Amorim A, Jannuzzi J, de Carvalho EF, Gusmão L. Male lineage strata of Brazilian population disclosed by the simultaneous analysis of STRs and SNPs. Forensic Sci Int Genet 2014; 13:264-8. [PMID: 25259770 DOI: 10.1016/j.fsigen.2014.08.017] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/22/2014] [Revised: 08/28/2014] [Accepted: 08/29/2014] [Indexed: 12/09/2022]
Abstract
Brazil has a large territory divided in five geographical regions harboring highly diverse populations that resulted from different degrees and modes of admixture between Native Americans, Europeans and Africans. In this study, a sample of 605 unrelated males was genotyped for 17 Y-STRs and 46 Y-SNPs aiming a deep characterization of the male gene pool of Rio de Janeiro and its comparison with other Brazilian populations. High values of Y-STR haplotype diversity (0.9999±0.0001) and Y-SNP haplogroup diversity (0.7589±0.0171) were observed. Population comparisons at both haplotype and haplogroup levels showed significant differences between Brazilian South Eastern and Northern populations that can be explained by differences in the proportion of African and Native American Y chromosomes. Statistical significant differences between admixed urban samples from the five regions of Brazil were not previously detected at haplotype level based on smaller size samples from South East, which emphasizes the importance of sample size to detected population stratification for an accurate interpretation of profile matches in kinship and forensic casework. Although not having an intra-population discrimination power as high as the Y-STRs, the Y-SNPs are more powerful to disclose differences in admixed populations. In this study, the combined analysis of these two types of markers proved to be a good strategy to predict population sub-structure, which should be taken into account when delineating forensic database strategies for Y chromosome haplotypes.
Collapse
Affiliation(s)
- Andréa M Oliveira
- DNA Diagnostic Laboratory (LDD), Institute of Biology, State University of Rio de Janeiro (UERJ), Rio de Janeiro, Brazil
| | - Patricia M Domingues
- DNA Diagnostic Laboratory (LDD), Institute of Biology, State University of Rio de Janeiro (UERJ), Rio de Janeiro, Brazil
| | - Verónica Gomes
- IPATIMUP - Institute of Molecular Pathology and Immunology of the University of Porto, Porto, Portugal
| | - António Amorim
- IPATIMUP - Institute of Molecular Pathology and Immunology of the University of Porto, Porto, Portugal; FCUP - Faculty of Sciences of the University of Porto, Porto, Portugal
| | - Juliana Jannuzzi
- DNA Diagnostic Laboratory (LDD), Institute of Biology, State University of Rio de Janeiro (UERJ), Rio de Janeiro, Brazil
| | - Elizeu F de Carvalho
- DNA Diagnostic Laboratory (LDD), Institute of Biology, State University of Rio de Janeiro (UERJ), Rio de Janeiro, Brazil
| | - Leonor Gusmão
- DNA Diagnostic Laboratory (LDD), Institute of Biology, State University of Rio de Janeiro (UERJ), Rio de Janeiro, Brazil; IPATIMUP - Institute of Molecular Pathology and Immunology of the University of Porto, Porto, Portugal.
| |
Collapse
|
32
|
Abstract
With the advent of modern genomic methods to adjust for population stratification, the use of external or publicly available controls has become an attractive option for reducing the cost of large-scale case-control genetic association studies. In this article, we study the estimation of joint effects of genetic and environmental exposures from a case-control study where data on genome-wide markers are available on the cases and a set of external controls while data on environmental exposures are available on the cases and a set of internal controls. We show that under such a design, one can exploit an assumption of gene-environment independence in the underlying population to estimate the gene-environment joint effects, after adjustment for population stratification. We develop a semiparametric profile likelihood method and related pseudolikelihood and working likelihood methods that are easy to implement in practice. We propose variance estimators for the methods based on asymptotic theory. Simulation is used to study the performance of the methods, and data from a multi-centre genome-wide association study of bladder cancer is further used to illustrate their application.
Collapse
Affiliation(s)
- YI-HAU CHEN
- Institute of Statistical Science, Academia Sinica, Taipei 11529, Taiwan
| | - NILANJAN CHATTERJEE
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, National Institutes of Health, Department of Health and Human Services, Rockville, Maryland 20852, U.S.A
| | - RAYMOND J. CARROLL
- Department of Statistics, Texas A&M University, College Station, Texas 77843-3143, U.S.A
| |
Collapse
|
33
|
Divers J, Redden DT, Carroll RJ, Allison DB. How to estimate the measurement error variance associated with ancestry proportion estimates. Stat Interface 2011; 4:327-337. [PMID: 24089627 PMCID: PMC3786624 DOI: 10.4310/sii.2011.v4.n3.a7] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/02/2023]
Abstract
To show how the variance of the measurement error (ME) associated with individual ancestry proportion estimates can be estimated, especially when the number of ancestral populations (k) is greater than 2. We extend existing internal consistency measures to estimate the ME variance, and we compare these estimates with the ME variance estimated by use of the repeated measurement (RM) approach. Both approaches work by dividing the genotyped markers into subsets. We examine the effect of the number of subsets and of the allocation of markers to each subset on the performance of each approach. We used simulated data for all comparisons. Independently of the value of k, the measures of internal reliability provided less biased and more precise estimates of the ME variance than did those obtained with the RM approach. Both methods tend to perform better when a large number of subsets of markers with similar sizes are considered. Our results will facilitate the use of ME correction methods to address the ME problem in individual ancestry proportion estimates. Our method will improve the ability to control for type I error inflation and loss of power in association tests and other genomic research involving ancestry estimates.
Collapse
Affiliation(s)
- Jasmin Divers
- Address correspondence to Jasmin Divers, Section on Statistical Genetics and Bioinformatics, Center for Public Health Genomics, Department of Biostatistical Sciences, Division of Public Health Services, Wake Forest University Health Sciences, WC-2326, Medical Center Blvd. Winston-Salem, North Carolina, 27157,
| | | | | | | |
Collapse
|
34
|
Chokkalingam AP, Aldrich MC, Bartley K, Hsu LI, Metayer C, Barcellos LF, Wiemels JL, Wiencke JK, Buffler PA, Selvin S. Matching on Race and Ethnicity in Case-Control Studies as a Means of Control for Population Stratification. ACTA ACUST UNITED AC 2011; 1:101. [PMID: 24683503 DOI: 10.4172/2161-1165.1000101] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [What about the content of this article? (0)] [Affiliation(s)] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
Some investigators argue that controlling for self-reported race or ethnicity, either in statistical analysis or in study design, is sufficient to mitigate unwanted influence from population stratification. In this report, we evaluated the effectiveness of a study design involving matching on self-reported ethnicity and race in minimizing bias due to population stratification within an ethnically admixed population in California. We estimated individual genetic ancestry using structured association methods and a panel of ancestry informative markers, and observed no statistically significant difference in distribution of genetic ancestry between cases and controls (P=0.46). Stratification by Hispanic ethnicity showed similar results. We evaluated potential confounding by genetic ancestry after adjustment for race and ethnicity for 1260 candidate gene SNPs, and found no major impact (>10%) on risk estimates. In conclusion, we found no evidence of confounding of genetic risk estimates by population substructure using this matched design. Our study provides strong evidence supporting the race- and ethnicity-matched case-control study design as an effective approach to minimizing systematic bias due to differences in genetic ancestry between cases and controls.
Collapse
Affiliation(s)
| | - Melinda C Aldrich
- Vanderbilt University Medical Center, Vanderbilt University, Nashville, Tennessee
| | - Karen Bartley
- School of Public Health, University of California Berkeley, Berkeley
| | - Ling-I Hsu
- School of Public Health, University of California Berkeley, Berkeley
| | - Catherine Metayer
- School of Public Health, University of California Berkeley, Berkeley
| | - Lisa F Barcellos
- School of Public Health, University of California Berkeley, Berkeley
| | - Joseph L Wiemels
- Department of Epidemiology and Biostatistics, University of California San Francisco, San Francisco, California
| | - John K Wiencke
- Department of Neurological Surgery, University of California, San Francisco San Francisco, California
| | | | - Steve Selvin
- School of Public Health, University of California Berkeley, Berkeley
| |
Collapse
|
35
|
Sinha S, Black ML, Agarwal S, Colah R, Das R, Ryan K, Bellgard M, Bittles AH. Profiling β-thalassaemia mutations in India at state and regional levels: implications for genetic education, screening and counselling programmes. Hugo J 2010; 3:51-62. [PMID: 21119755 DOI: 10.1007/s11568-010-9132-3] [Citation(s) in RCA: 56] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [What about the content of this article? (0)] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/28/2009] [Revised: 11/29/2009] [Accepted: 01/20/2010] [Indexed: 11/30/2022]
Abstract
UNLABELLED Thalassaemia and sickle cell disease have been recognized by the World Health Organization as important inherited disorders principally impacting on the populations of low income countries. To create a national and regional profile of β-thalassaemia mutations in the population of India, a meta-analysis was conducted on 17 selected studies comprising 8,505 alleles and offering near-national coverage for the disease. At the national level 52 mutations accounted for 97.5% of all β-thalassaemia alleles, with IVSI-5(G>C) the most common disease allele (54.7%). Population stratification was apparent in the mutation profiles at regional level with, for example, the prevalence of IVSI-5(G>C) varying from 44.8% in the North to 71.4% in the East. A number of major mutations, such as Poly A(T>C), were apparently restricted to a particular region of the country, although these findings may in part reflect the variant test protocols adopted by different centres. Given the size and genetic complexity of the Indian population, and with specific mutations for β-thalassaemia known to be strongly associated with individual communities, comprehensive disease registries need to be compiled at state, district and community levels to ensure the efficacy of genetic education, screening and counselling programmes. At the same, time appropriately designed community-based studies are required as a health priority to correct earlier sampling inequities which resulted in the under-representation of many communities, in particular rural and socioeconomically under-privileged groups. ELECTRONIC SUPPLEMENTARY MATERIAL The online version of this article (doi:10.1007/s11568-010-9132-3) contains supplementary material, which is available to authorized users.
Collapse
|