1
|
Frommlet F. A neutral comparison of algorithms to minimize L 0 penalties for high-dimensional variable selection. Biom J 2024; 66:e2200207. [PMID: 37421205 DOI: 10.1002/bimj.202200207] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2022] [Revised: 03/09/2023] [Accepted: 04/29/2023] [Indexed: 07/10/2023]
Abstract
Variable selection methods based on L0 penalties have excellent theoretical properties to select sparse models in a high-dimensional setting. There exist modifications of the Bayesian Information Criterion (BIC) which either control the familywise error rate (mBIC) or the false discovery rate (mBIC2) in terms of which regressors are selected to enter a model. However, the minimization of L0 penalties comprises a mixed-integer problem which is known to be NP-hard and therefore becomes computationally challenging with increasing numbers of regressor variables. This is one reason why alternatives like the LASSO have become so popular, which involve convex optimization problems that are easier to solve. The last few years have seen some real progress in developing new algorithms to minimize L0 penalties. The aim of this article is to compare the performance of these algorithms in terms of minimizing L0 -based selection criteria. Simulation studies covering a wide range of scenarios that are inspired by genetic association studies are used to compare the values of selection criteria obtained with different algorithms. In addition, some statistical characteristics of the selected models and the runtime of algorithms are compared. Finally, the performance of the algorithms is illustrated in a real data example concerned with expression quantitative trait loci (eQTL) mapping.
Collapse
Affiliation(s)
- Florian Frommlet
- Institute of Medical Statistics, Center for Medical Data Science, Medical University of Vienna, Vienna, Austria
| |
Collapse
|
2
|
Rocha J, Sastre J, Amengual-Cladera E, Hernandez-Rodriguez J, Asensio-Landa V, Heine-Suñer D, Capriotti E. Identification of Driver Epistatic Gene Pairs Combining Germline and Somatic Mutations in Cancer. Int J Mol Sci 2023; 24:ijms24119323. [PMID: 37298272 DOI: 10.3390/ijms24119323] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/23/2023] [Revised: 05/20/2023] [Accepted: 05/22/2023] [Indexed: 06/12/2023] Open
Abstract
Cancer arises from the complex interplay of various factors. Traditionally, the identification of driver genes focuses primarily on the analysis of somatic mutations. We describe a new method for the detection of driver gene pairs based on an epistasis analysis that considers both germline and somatic variations. Specifically, the identification of significantly mutated gene pairs entails the calculation of a contingency table, wherein one of the co-mutated genes can exhibit a germline variant. By adopting this approach, it is possible to select gene pairs in which the individual genes do not exhibit significant associations with cancer. Finally, a survival analysis is used to select clinically relevant gene pairs. To test the efficacy of the new algorithm, we analyzed the colon adenocarcinoma (COAD) and lung adenocarcinoma (LUAD) samples available at The Cancer Genome Atlas (TCGA). In the analysis of the COAD and LUAD samples, we identify epistatic gene pairs significantly mutated in tumor tissue with respect to normal tissue. We believe that further analysis of the gene pairs detected by our method will unveil new biological insights, enhancing a better description of the cancer mechanism.
Collapse
Affiliation(s)
- Jairo Rocha
- Department of Mathematics and Computer Science, University of the Balearic Islands, 07122 Palma de Majorca, Spain
- Genomics of Health Group, Health Research Institute of the Balearic Islands (IDISBA), 07120 Palma de Majorca, Spain
| | - Jaume Sastre
- Department of Mathematics and Computer Science, University of the Balearic Islands, 07122 Palma de Majorca, Spain
| | - Emilia Amengual-Cladera
- Genomics of Health Group, Health Research Institute of the Balearic Islands (IDISBA), 07120 Palma de Majorca, Spain
| | - Jessica Hernandez-Rodriguez
- Genomics of Health Group, Health Research Institute of the Balearic Islands (IDISBA), 07120 Palma de Majorca, Spain
| | - Victor Asensio-Landa
- Genomics of Health Group, Health Research Institute of the Balearic Islands (IDISBA), 07120 Palma de Majorca, Spain
| | - Damià Heine-Suñer
- Genomics of Health Group, Health Research Institute of the Balearic Islands (IDISBA), 07120 Palma de Majorca, Spain
| | - Emidio Capriotti
- BioFolD Unit, Department of Pharmacy and Biotechnology (FaBiT), University of Bologna, 40126 Bologna, Italy
| |
Collapse
|
3
|
Williams J, Ferreira MAR, Ji T. BICOSS: Bayesian iterative conditional stochastic search for GWAS. BMC Bioinformatics 2022; 23:475. [DOI: 10.1186/s12859-022-05030-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/22/2022] [Accepted: 10/31/2022] [Indexed: 11/15/2022] Open
Abstract
Abstract
Background
Single marker analysis (SMA) with linear mixed models for genome wide association studies has uncovered the contribution of genetic variants to many observed phenotypes. However, SMA has weak false discovery control. In addition, when a few variants have large effect sizes, SMA has low statistical power to detect small and medium effect sizes, leading to low recall of true causal single nucleotide polymorphisms (SNPs).
Results
We present the Bayesian Iterative Conditional Stochastic Search (BICOSS) method that controls false discovery rate and increases recall of variants with small and medium effect sizes. BICOSS iterates between a screening step and a Bayesian model selection step. A simulation study shows that, when compared to SMA, BICOSS dramatically reduces false discovery rate and allows for smaller effect sizes to be discovered. Finally, two real world applications show the utility and flexibility of BICOSS.
Conclusions
When compared to widely used SMA, BICOSS provides higher recall of true SNPs while dramatically reducing false discovery rate.
Collapse
|
4
|
Asiimwe IG, Blockman M, Cohen K, Cupido C, Hutchinson C, Jacobson B, Lamorde M, Morgan J, Mouton JP, Nakagaayi D, Okello E, Schapkaitz E, Sekaggya-Wiltshire C, Semakula JR, Waitt C, Zhang EJ, Jorgensen AL, Pirmohamed M. A genome-wide association study of plasma concentrations of warfarin enantiomers and metabolites in sub-Saharan black-African patients. Front Pharmacol 2022; 13:967082. [PMID: 36210801 PMCID: PMC9537548 DOI: 10.3389/fphar.2022.967082] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2022] [Accepted: 08/23/2022] [Indexed: 11/26/2022] Open
Abstract
Diversity in pharmacogenomic studies is poor, especially in relation to the inclusion of black African patients. Lack of funding and difficulties in recruitment, together with the requirement for large sample sizes because of the extensive genetic diversity in Africa, are amongst the factors which have hampered pharmacogenomic studies in Africa. Warfarin is widely used in sub-Saharan Africa, but as in other populations, dosing is highly variable due to genetic and non-genetic factors. In order to identify genetic factors determining warfarin response variability, we have conducted a genome-wide association study (GWAS) of plasma concentrations of warfarin enantiomers/metabolites in sub-Saharan black-Africans. This overcomes the issue of non-adherence and may have greater sensitivity at genome-wide level, to identify pharmacokinetic gene variants than focusing on mean weekly dose, the usual end-point used in previous studies. Participants recruited at 12 outpatient sites in Uganda and South Africa on stable warfarin dose were genotyped using the Illumina Infinium H3Africa Consortium Array v2. Imputation was conducted using the 1,000 Genomes Project phase III reference panel. Warfarin/metabolite plasma concentrations were determined by high-performance liquid chromatography with tandem mass spectrometry. Multivariable linear regression was undertaken, with adjustment made for five non-genetic covariates and ten principal components of genetic ancestry. After quality control procedures, 548 participants and 17,268,054 SNPs were retained. CYP2C9*8, CYP2C9*9, CYP2C9*11, and the CYP2C cluster SNP rs12777823 passed the Bonferroni-adjusted replication significance threshold (p < 3.21E-04) for warfarin/metabolite ratios. In an exploratory GWAS analysis, 373 unique SNPs in 13 genes, including CYP2C9*8, passed the Bonferroni-adjusted genome-wide significance threshold (p < 3.846E-9), with 325 (87%, all located on chromosome 10) SNPs being associated with the S-warfarin/R-warfarin outcome (top SNP rs11188082, CYP2C19 intron variant, p = 1.55E-17). Approximately 69% of these SNPs were in linkage disequilibrium (r2 > 0.8) with CYP2C9*8 (n = 216) and rs12777823 (n = 8). Using a pharmacokinetic approach, we have shown that variants other than CYP2C9*2 and CYP2C9*3 are more important in sub-Saharan black-Africans, mainly due to the allele frequencies. In exploratory work, we conducted the first warfarin pharmacokinetics-related GWAS in sub-Saharan Africans and identified novel SNPs that will require external replication and functional characterization before they can be considered for inclusion in warfarin dosing algorithms.
Collapse
Affiliation(s)
- Innocent G. Asiimwe
- The Wolfson Centre for Personalized Medicine, Department of Pharmacology and Therapeutics, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
- *Correspondence: Innocent G. Asiimwe, ; Munir Pirmohamed,
| | - Marc Blockman
- Division of Clinical Pharmacology, Department of Medicine, University of Cape Town, Cape Town, South Africa
| | - Karen Cohen
- Division of Clinical Pharmacology, Department of Medicine, University of Cape Town, Cape Town, South Africa
| | - Clint Cupido
- Victoria Hospital Internal Medicine Research Initiative, Victoria Hospital Wynberg and Department of Medicine, University of Cape Town, Cape Town, South Africa
| | - Claire Hutchinson
- The Wolfson Centre for Personalized Medicine, Department of Pharmacology and Therapeutics, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - Barry Jacobson
- Department of Molecular Medicine and Haematology, University of the Witwatersrand, Johannesburg, South Africa
| | - Mohammed Lamorde
- Infectious Diseases Institute, Makerere University College of Health Sciences, Kampala, Uganda
| | - Jennie Morgan
- Metro District Health Services, Western Cape Department of Health, Cape Town, South Africa
| | - Johannes P. Mouton
- Division of Clinical Pharmacology, Department of Medicine, University of Cape Town, Cape Town, South Africa
| | | | | | - Elise Schapkaitz
- Department of Molecular Medicine and Hematology, Charlotte Maxeke Johannesburg Academic Hospital National Health Laboratory System Complex and University of Witwatersrand, Johannesburg, South Africa
| | | | - Jerome R. Semakula
- Infectious Diseases Institute, Makerere University College of Health Sciences, Kampala, Uganda
| | - Catriona Waitt
- The Wolfson Centre for Personalized Medicine, Department of Pharmacology and Therapeutics, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
- Infectious Diseases Institute, Makerere University College of Health Sciences, Kampala, Uganda
| | - Eunice J. Zhang
- The Wolfson Centre for Personalized Medicine, Department of Pharmacology and Therapeutics, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
| | - Andrea L. Jorgensen
- Department of Health Data Science, Institute of Population Health Sciences, University of Liverpool, Liverpool, United Kingdom
| | - Munir Pirmohamed
- The Wolfson Centre for Personalized Medicine, Department of Pharmacology and Therapeutics, Institute of Systems, Molecular and Integrative Biology, University of Liverpool, Liverpool, United Kingdom
- *Correspondence: Innocent G. Asiimwe, ; Munir Pirmohamed,
| |
Collapse
|
5
|
Frommlet F, Szulc P, König F, Bogdan M. Selecting predictive biomarkers from genomic data. PLoS One 2022; 17:e0269369. [PMID: 35709188 PMCID: PMC9202896 DOI: 10.1371/journal.pone.0269369] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/29/2021] [Accepted: 05/13/2022] [Indexed: 11/18/2022] Open
Abstract
Recently there have been tremendous efforts to develop statistical procedures which allow to determine subgroups of patients for which certain treatments are effective. This article focuses on the selection of prognostic and predictive genetic biomarkers based on a relatively large number of candidate Single Nucleotide Polymorphisms (SNPs). We consider models which include prognostic markers as main effects and predictive markers as interaction effects with treatment. We compare different high-dimensional selection approaches including adaptive lasso, a Bayesian adaptive version of the Sorted L-One Penalized Estimator (SLOBE) and a modified version of the Bayesian Information Criterion (mBIC2). These are compared with classical multiple testing procedures for individual markers. Having identified predictive markers we consider several different approaches how to specify subgroups susceptible to treatment. Our main conclusion is that selection based on mBIC2 and SLOBE has similar predictive performance as the adaptive lasso while including substantially fewer biomarkers.
Collapse
Affiliation(s)
- Florian Frommlet
- Department of Medical Statistics, CEMSIIS, Medical University of Vienna, Vienna, Austria
- * E-mail:
| | - Piotr Szulc
- Institute of Mathematics, University of Wroclaw, Wroclaw, Poland
| | - Franz König
- Department of Medical Statistics, CEMSIIS, Medical University of Vienna, Vienna, Austria
| | - Malgorzata Bogdan
- Institute of Mathematics, University of Wroclaw, Wroclaw, Poland
- Department of Statistics, Lund University, Lund, Sweden
| |
Collapse
|
6
|
Hasseb NM, Sallam A, Karam MA, Gao L, Wang RRC, Moursi YS. High-LD SNP markers exhibiting pleiotropic effects on salt tolerance at germination and seedlings stages in spring wheat. PLANT MOLECULAR BIOLOGY 2022; 108:585-603. [PMID: 35217965 PMCID: PMC8967789 DOI: 10.1007/s11103-022-01248-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/02/2021] [Accepted: 01/25/2022] [Indexed: 06/01/2023]
Abstract
Salt tolerance at germination and seedling growth stages was investigated. GWAS revealed nine genomic regions with pleiotropic effects on salt tolerance. Salt tolerant genotypes were identified for future breeding program. With 20% of the irrigated land worldwide affected by it, salinity is a serious threat to plant development and crop production. While wheat is the most stable food source worldwide, it has been classified as moderately tolerant to salinity. In several crop plants; such as barley, maize and rice, it has been shown that salinity tolerance at seed germination and seedling establishment is under polygenic control. As yield was the ultimate goal of breeders and geneticists, less attention has been paid to understanding the genetic architecture of salt tolerance at early stages. Thus, the genetic control of salt tolerance at these stages is poorly understood relative to the late stages. In the current study, 176 genotypes of spring wheat were tested for salinity tolerance at seed germination and seedling establishment. Genome-Wide Association Study (GWAS) has been used to identify the genomic regions/genes conferring salt tolerance at seed germination and seedling establishment. Salinity stress negatively impacted all germination and seedling development parameters. A set of 137 SNPs showed significant association with the traits of interest. Across the whole genome, 33 regions showed high linkage disequilibrium (LD). These high LD regions harbored 15 SNPs with pleiotropic effect (i.e. SNPs that control more than one trait). Nine genes belonging to different functional groups were found to be associated with the pleiotropic SNPs. Noteworthy, chromosome 2B harbored the gene TraesCS2B02G135900 that acts as a potassium transporter. Remarkably, one SNP marker, reported in an early study, associated with salt tolerance was validated in this study. Our findings represent potential targets of genetic manipulation to understand and improve salinity tolerance in wheat.
Collapse
Affiliation(s)
- Nouran M Hasseb
- Department of Botany, Faculty of Science, Fayoum University, Fayoum, 63514, Egypt
| | - Ahmed Sallam
- Department of Genetics, Faculty of Agriculture, Assiut University, Assiut, 71526, Egypt.
| | - Mohamed A Karam
- Department of Botany, Faculty of Science, Fayoum University, Fayoum, 63514, Egypt
| | - Liangliang Gao
- Department of Plant Pathology and Wheat Genetics Resource Center, Kansas State Univ, Manhattan, KS, 66502, USA
- Agricultural Genomics Institute at Shenzhen, Chinese Academy of Agricultural Sciences, Shenzhen, Buxin Road 97, Dapeng-District, Shenzhen, 518120, Guangdong, China
| | - Richard R C Wang
- USDA-ARS Forage and Range Research Lab, Utah State University, Logan, UT, 84322-6300, USA
| | - Yasser S Moursi
- Department of Botany, Faculty of Science, Fayoum University, Fayoum, 63514, Egypt
| |
Collapse
|
7
|
Huang J, Jiao Y, Kang L, Liu J, Liu Y, Lu X. GSDAR: a fast Newton algorithm for $$\ell _0$$ regularized generalized linear models with statistical guarantee. Comput Stat 2021. [DOI: 10.1007/s00180-021-01098-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/21/2022]
|
8
|
Renaux C, Buzdugan L, Kalisch M, Bühlmann P. Hierarchical inference for genome-wide association studies: a view on methodology with software. Comput Stat 2020. [DOI: 10.1007/s00180-019-00939-2] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
|
9
|
Kennedy AE, Ozbek U, Dorak MT. What has GWAS done for HLA and disease associations? Int J Immunogenet 2018; 44:195-211. [PMID: 28877428 DOI: 10.1111/iji.12332] [Citation(s) in RCA: 42] [Impact Index Per Article: 7.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/09/2017] [Revised: 06/16/2017] [Accepted: 07/20/2017] [Indexed: 12/14/2022]
Abstract
The major histocompatibility complex (MHC) is located in chromosome 6p21 and contains crucial regulators of immune response, including human leucocyte antigen (HLA) genes, alongside other genes with nonimmunological roles. More recently, a repertoire of noncoding RNA genes, including expressed pseudogenes, has also been identified. The MHC is the most gene dense and most polymorphic part of the human genome. The region exhibits haplotype-specific linkage disequilibrium patterns, contains the strongest cis- and trans-eQTLs/meQTLs in the genome and is known as a hot spot for disease associations. Another layer of complexity is provided to the region by the extreme structural variation and copy number variations. While the HLA-B gene has the highest number of alleles, the HLA-DR/DQ subregion is structurally most variable and shows the highest number of disease associations. Reliance on a single reference sequence has complicated the design, execution and analysis of GWAS for the MHC region and not infrequently, the MHC region has even been excluded from the analysis of GWAS data. Here, we contrast features of the MHC region with the rest of the genome and highlight its complexities, including its functional polymorphisms beyond those determined by single nucleotide polymorphisms or single amino acid residues. One of the several issues with customary GWAS analysis is that it does not address this additional layer of polymorphisms unique to the MHC region. We highlight alternative approaches that may assist with the analysis of GWAS data from the MHC region and unravel associations with all functional polymorphisms beyond single SNPs. We suggest that despite already showing the highest number of disease associations, the true extent of the involvement of the MHC region in disease genetics may not have been uncovered.
Collapse
Affiliation(s)
- A E Kennedy
- Center for Research Strategy, National Cancer Institute, National Institutes of Health, Bethesda, MD, USA
| | - U Ozbek
- Department of Population Health Science and Policy, Tisch Cancer Institute, Icahn School of Medicine at Mount Sinai, New York, NY, USA.,Department of Population Health Science and Policy, Icahn School of Medicine at Mount Sinai, New York, NY, USA
| | - M T Dorak
- Head of School of Life Sciences, Pharmacy and Chemistry, Kingston University London, Kingston-upon-Thames, UK
| |
Collapse
|
10
|
Bayesian and frequentist analysis of an Austrian genome-wide association study of colorectal cancer and advanced adenomas. Oncotarget 2017; 8:98623-98634. [PMID: 29228715 PMCID: PMC5716755 DOI: 10.18632/oncotarget.21697] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/26/2017] [Accepted: 09/03/2017] [Indexed: 12/17/2022] Open
Abstract
Most genome-wide association studies (GWAS) were analyzed using single marker tests in combination with stringent correction procedures for multiple testing. Thus, a substantial proportion of associated single nucleotide polymorphisms (SNPs) remained undetected and may account for missing heritability in complex traits. Model selection procedures present a powerful alternative to identify associated SNPs in high-dimensional settings. In this GWAS including 1060 colorectal cancer cases, 689 cases of advanced colorectal adenomas and 4367 controls we pursued a dual approach to investigate genome-wide associations with disease risk applying both, single marker analysis and model selection based on the modified Bayesian information criterion, mBIC2, implemented in the software package MOSGWA. For different case-control comparisons, we report models including between 1-14 candidate SNPs. A genome-wide significant association of rs17659990 (P=5.43×10-9, DOCK3, chromosome 3p21.2) with colorectal cancer risk was observed. Furthermore, 56 SNPs known to influence susceptibility to colorectal cancer and advanced adenoma were tested in a hypothesis-driven approach and several of them were found to be relevant in our Austrian cohort. After correction for multiple testing (α=8.9×10-4), the most significant associations were observed for SNPs rs10505477 (P=6.08×10-4) and rs6983267 (P=7.35×10-4) of CASC8, rs3802842 (P=8.98×10-5, COLCA1,2), and rs12953717 (P=4.64×10-4, SMAD7). All previously unreported SNPs demand replication in additional samples. Reanalysis of existing GWAS datasets using model selection as tool to detect SNPs associated with a complex trait may present a promising resource to identify further genetic risk variants not only for colorectal cancer.
Collapse
|
11
|
Szulc P, Bogdan M, Frommlet F, Tang H. Joint genotype- and ancestry-based genome-wide association studies in admixed populations. Genet Epidemiol 2017; 41:555-566. [PMID: 28657151 DOI: 10.1002/gepi.22056] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/25/2016] [Revised: 04/01/2017] [Accepted: 04/25/2017] [Indexed: 12/21/2022]
Abstract
In genome-wide association studies (GWAS) genetic loci that influence complex traits are localized by inspecting associations between genotypes of genetic markers and the values of the trait of interest. On the other hand, admixture mapping, which is performed in case of populations consisting of a recent mix of two ancestral groups, relies on the ancestry information at each locus (locus-specific ancestry). Recently it has been proposed to jointly model genotype and locus-specific ancestry within the framework of single marker tests. Here, we extend this approach for population-based GWAS in the direction of multimarker models. A modified version of the Bayesian information criterion is developed for building a multilocus model that accounts for the differential correlation structure due to linkage disequilibrium (LD) and admixture LD. Simulation studies and a real data example illustrate the advantages of this new approach compared to single-marker analysis or modern model selection strategies based on separately analyzing genotype and ancestry data, as well as to single-marker analysis combining genotypic and ancestry information. Depending on the signal strength, our procedure automatically chooses whether genotypic or locus-specific ancestry markers are added to the model. This results in a good compromise between the power to detect causal mutations and the precision of their localization. The proposed method has been implemented in R and is available at http://www.math.uni.wroc.pl/~mbogdan/admixtures/.
Collapse
Affiliation(s)
- Piotr Szulc
- Faculty of Mathematics, Wroclaw University of Technology, Wroclaw, Poland
| | - Malgorzata Bogdan
- Faculty of Mathematics and Computer Science, University of Wroclaw, Wroclaw, Poland
| | - Florian Frommlet
- Department of Medical Statistics, CEMSIIS, Medical University of Vienna, Vienna, Austria
| | - Hua Tang
- Departments of Genetics and Statistics, Stanford University, Stanford, California, United States of America
| |
Collapse
|
12
|
Brzyski D, Peterson CB, Sobczyk P, Candès EJ, Bogdan M, Sabatti C. Controlling the Rate of GWAS False Discoveries. Genetics 2017; 205:61-75. [PMID: 27784720 PMCID: PMC5223524 DOI: 10.1534/genetics.116.193987] [Citation(s) in RCA: 72] [Impact Index Per Article: 10.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/18/2016] [Accepted: 10/11/2016] [Indexed: 01/13/2023] Open
Abstract
With the rise of both the number and the complexity of traits of interest, control of the false discovery rate (FDR) in genetic association studies has become an increasingly appealing and accepted target for multiple comparison adjustment. While a number of robust FDR-controlling strategies exist, the nature of this error rate is intimately tied to the precise way in which discoveries are counted, and the performance of FDR-controlling procedures is satisfactory only if there is a one-to-one correspondence between what scientists describe as unique discoveries and the number of rejected hypotheses. The presence of linkage disequilibrium between markers in genome-wide association studies (GWAS) often leads researchers to consider the signal associated to multiple neighboring SNPs as indicating the existence of a single genomic locus with possible influence on the phenotype. This a posteriori aggregation of rejected hypotheses results in inflation of the relevant FDR. We propose a novel approach to FDR control that is based on prescreening to identify the level of resolution of distinct hypotheses. We show how FDR-controlling strategies can be adapted to account for this initial selection both with theoretical results and simulations that mimic the dependence structure to be expected in GWAS. We demonstrate that our approach is versatile and useful when the data are analyzed using both tests based on single markers and multiple regression. We provide an R package that allows practitioners to apply our procedure on standard GWAS format data, and illustrate its performance on lipid traits in the North Finland Birth Cohort 66 cohort study.
Collapse
Affiliation(s)
- Damian Brzyski
- Institute of Mathematics, Jagiellonian University, 30-348 Kraków, Poland
- Department of Epidemiology and Biostatistics, Indiana University, Bloomington, Indiana 47405
| | - Christine B Peterson
- Department of Biostatistics, University of Texas MD Anderson Cancer Center, Houston, Texas 77030
| | - Piotr Sobczyk
- Faculty of Pure and Applied Mathematics, Wrocław University of Science and Technology, 50-370 Wroclaw, Poland
| | | | - Malgorzata Bogdan
- Institute of Mathematics, University of Wrocław, 50-384 Wroclaw, Poland
| | - Chiara Sabatti
- Department of Biomedical Data Science, Stanford University, California
| |
Collapse
|
13
|
Frommlet F, Nuel G. An Adaptive Ridge Procedure for L0 Regularization. PLoS One 2016; 11:e0148620. [PMID: 26849123 PMCID: PMC4743917 DOI: 10.1371/journal.pone.0148620] [Citation(s) in RCA: 31] [Impact Index Per Article: 3.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/18/2015] [Accepted: 01/21/2016] [Indexed: 11/18/2022] Open
Abstract
Penalized selection criteria like AIC or BIC are among the most popular methods for variable selection. Their theoretical properties have been studied intensively and are well understood, but making use of them in case of high-dimensional data is difficult due to the non-convex optimization problem induced by L0 penalties. In this paper we introduce an adaptive ridge procedure (AR), where iteratively weighted ridge problems are solved whose weights are updated in such a way that the procedure converges towards selection with L0 penalties. After introducing AR its specific shrinkage properties are studied in the particular case of orthogonal linear regression. Based on extensive simulations for the non-orthogonal case as well as for Poisson regression the performance of AR is studied and compared with SCAD and adaptive LASSO. Furthermore an efficient implementation of AR in the context of least-squares segmentation is presented. The paper ends with an illustrative example of applying AR to analyze GWAS data.
Collapse
Affiliation(s)
- Florian Frommlet
- Department of Medical Statistics (CEMSIIS), Medical University of Vienna, Spitalgasse 23, A-1090 Vienna, Austria
| | - Grégory Nuel
- National Institute for Mathematical Sciences (INSMI), CNRS, Stochastics and Biology Group (PSB), LPMA UMR CNRS 7599, Université Pierre et Marie Curie, 4 place Jussieu, 75005 Paris, France
| |
Collapse
|
14
|
Cheng L, Wang X, Wong PK, Lee KY, Li L, Xu B, Wang D, Leung KS. ICN: a normalization method for gene expression data considering the over-expression of informative genes. MOLECULAR BIOSYSTEMS 2016; 12:3057-66. [DOI: 10.1039/c6mb00386a] [Citation(s) in RCA: 19] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/19/2022]
Abstract
The global increase of gene expression has been frequently established in cancer microarray studies.
Collapse
Affiliation(s)
- Lixin Cheng
- Department of Computer Science and Engineering
- The Chinese University of Hong Kong
- New Territories
- China
| | - Xuan Wang
- College of Pharmacy
- Harbin Medical University
- Harbin
- China
| | - Pak-Kan Wong
- Department of Computer Science and Engineering
- The Chinese University of Hong Kong
- New Territories
- China
| | - Kwan-Yeung Lee
- Department of Computer Science and Engineering
- The Chinese University of Hong Kong
- New Territories
- China
| | - Le Li
- Department of Computer Science and Engineering
- The Chinese University of Hong Kong
- New Territories
- China
| | - Bin Xu
- School of Internet of Things
- Nanjing University of Posts and Telecommunications
- Nanjing
- China
| | - Dong Wang
- College of Bioinformatics Science and Technology
- Harbin Medical University
- Harbin
- China
| | - Kwong-Sak Leung
- Department of Computer Science and Engineering
- The Chinese University of Hong Kong
- New Territories
- China
| |
Collapse
|
15
|
Widmer C, Lippert C, Weissbrod O, Fusi N, Kadie C, Davidson R, Listgarten J, Heckerman D. Further improvements to linear mixed models for genome-wide association studies. Sci Rep 2014; 4:6874. [PMID: 25387525 PMCID: PMC4230738 DOI: 10.1038/srep06874] [Citation(s) in RCA: 45] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2013] [Accepted: 10/14/2014] [Indexed: 11/09/2022] Open
Abstract
We examine improvements to the linear mixed model (LMM) that better correct for population structure and family relatedness in genome-wide association studies (GWAS). LMMs rely on the estimation of a genetic similarity matrix (GSM), which encodes the pairwise similarity between every two individuals in a cohort. These similarities are estimated from single nucleotide polymorphisms (SNPs) or other genetic variants. Traditionally, all available SNPs are used to estimate the GSM. In empirical studies across a wide range of synthetic and real data, we find that modifications to this approach improve GWAS performance as measured by type I error control and power. Specifically, when only population structure is present, a GSM constructed from SNPs that well predict the phenotype in combination with principal components as covariates controls type I error and yields more power than the traditional LMM. In any setting, with or without population structure or family relatedness, a GSM consisting of a mixture of two component GSMs, one constructed from all SNPs and another constructed from SNPs that well predict the phenotype again controls type I error and yields more power than the traditional LMM. Software implementing these improvements and the experimental comparisons are available at http://microsoft.com/science.
Collapse
Affiliation(s)
- Christian Widmer
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite
PH1, Los Angeles, CA, 90024, United States
| | - Christoph Lippert
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite
PH1, Los Angeles, CA, 90024, United States
| | - Omer Weissbrod
- Computer Science Department, Technion - Israel Institute of
Technology, Haifa 32000, Israel
| | - Nicolo Fusi
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite
PH1, Los Angeles, CA, 90024, United States
| | - Carl Kadie
- eScience Group, Microsoft Research, One Microsoft Way, Redmond,
WA, 98052, United States
| | - Robert Davidson
- eScience Group, Microsoft Research, One Microsoft Way, Redmond,
WA, 98052, United States
| | - Jennifer Listgarten
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite
PH1, Los Angeles, CA, 90024, United States
| | - David Heckerman
- eScience Group, Microsoft Research, 1100 Glendon Avenue, Suite
PH1, Los Angeles, CA, 90024, United States
| |
Collapse
|