1
|
Epistasis Detection via the Joint Cumulant. STATISTICS IN BIOSCIENCES 2022. [DOI: 10.1007/s12561-022-09336-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/19/2022]
|
2
|
Grinberg NF, Orhobor OI, King RD. An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat. Mach Learn 2019; 109:251-277. [PMID: 32174648 PMCID: PMC7048706 DOI: 10.1007/s10994-019-05848-5] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Revised: 09/17/2019] [Accepted: 09/19/2019] [Indexed: 11/01/2022]
Abstract
In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.
Collapse
Affiliation(s)
- Nastasiya F. Grinberg
- School of Computer Science, University of Manchester, Oxford Road, Manchester, M13 9PL UK
- Present Address: Department of Medicine, Cambridge Institute of Therapeutic Immunology & Infectious Disease, Jeffrey Cheah Biomedical Centre, Cambridge Biomedical Campus, University of Cambridge, Cambridge, CB2 0AW UK
| | | | - Ross D. King
- Department of Biology and Biological Engineering, Division of Systems and Synthetic Biology, Chalmers University of Technology, Kemivägen 10, SE-412 96 Gothenburg, Sweden
| |
Collapse
|
3
|
Romagnoni A, Jégou S, Van Steen K, Wainrib G, Hugot JP. Comparative performances of machine learning methods for classifying Crohn Disease patients using genome-wide genotyping data. Sci Rep 2019; 9:10351. [PMID: 31316157 PMCID: PMC6637191 DOI: 10.1038/s41598-019-46649-z] [Citation(s) in RCA: 53] [Impact Index Per Article: 10.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/11/2019] [Accepted: 07/03/2019] [Indexed: 02/08/2023] Open
Abstract
Crohn Disease (CD) is a complex genetic disorder for which more than 140 genes have been identified using genome wide association studies (GWAS). However, the genetic architecture of the trait remains largely unknown. The recent development of machine learning (ML) approaches incited us to apply them to classify healthy and diseased people according to their genomic information. The Immunochip dataset containing 18,227 CD patients and 34,050 healthy controls enrolled and genotyped by the international Inflammatory Bowel Disease genetic consortium (IIBDGC) has been re-analyzed using a set of ML methods: penalized logistic regression (LR), gradient boosted trees (GBT) and artificial neural networks (NN). The main score used to compare the methods was the Area Under the ROC Curve (AUC) statistics. The impact of quality control (QC), imputing and coding methods on LR results showed that QC methods and imputation of missing genotypes may artificially increase the scores. At the opposite, neither the patient/control ratio nor marker preselection or coding strategies significantly affected the results. LR methods, including Lasso, Ridge and ElasticNet provided similar results with a maximum AUC of 0.80. GBT methods like XGBoost, LightGBM and CatBoost, together with dense NN with one or more hidden layers, provided similar AUC values, suggesting limited epistatic effects in the genetic architecture of the trait. ML methods detected near all the genetic variants previously identified by GWAS among the best predictors plus additional predictors with lower effects. The robustness and complementarity of the different methods are also studied. Compared to LR, non-linear models such as GBT or NN may provide robust complementary approaches to identify and classify genetic markers.
Collapse
Affiliation(s)
- Alberto Romagnoni
- Centre de recherche sur l'inflammation UMR 1149, Inserm - Université Paris Diderot, 75018, Paris, France.,Data Team, Département d'informatique de l'ENS, École normale supérieure, CNRS, PSL Research University, 75005, Paris, France
| | | | - Kristel Van Steen
- WELBIO, GIGA-R Medical Genomics - BIO3, University of Liège, Liège, Belgium.,Department of Human Genetics, University of Leuven, Leuven, Belgium
| | - Gilles Wainrib
- Data Team, Département d'informatique de l'ENS, École normale supérieure, CNRS, PSL Research University, 75005, Paris, France.,Owkin, 75011, Paris, France
| | - Jean-Pierre Hugot
- Centre de recherche sur l'inflammation UMR 1149, Inserm - Université Paris Diderot, 75018, Paris, France. .,Hôpital Robert Debré, Assistance Publique-Hôpitaux de Paris, 75019, Paris, France.
| | | |
Collapse
|
4
|
Boulesteix AL, Wright MN, Hoffmann S, König IR. Statistical learning approaches in the genetic epidemiology of complex diseases. Hum Genet 2019; 139:73-84. [PMID: 31049651 DOI: 10.1007/s00439-019-01996-9] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2018] [Accepted: 03/04/2019] [Indexed: 02/07/2023]
Abstract
In this paper, we give an overview of methodological issues related to the use of statistical learning approaches when analyzing high-dimensional genetic data. The focus is set on regression models and machine learning algorithms taking genetic variables as input and returning a classification or a prediction for the target variable of interest; for example, the present or future disease status, or the future course of a disease. After briefly explaining the basic motivation and principle of these methods, we review different procedures that can be used to evaluate the accuracy of the obtained models and discuss common flaws that may lead to over-optimistic conclusions with respect to their prediction performance and usefulness.
Collapse
Affiliation(s)
- Anne-Laure Boulesteix
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-University, Munich, Germany.
| | - Marvin N Wright
- Leibniz Institute for Prevention Research and Epidemiology-BIPS, Bremen, Germany.,Section of Biostatistics, Department of Public Health, University of Copenhagen, Copenhagen, Denmark
| | - Sabine Hoffmann
- Institute for Medical Information Processing, Biometry and Epidemiology, Ludwig-Maximilians-University, Munich, Germany
| | - Inke R König
- Institute of Medical Biometry and Statistics, University of Lübeck, Lübeck, Germany
| |
Collapse
|
5
|
Dorani F, Hu T, Woods MO, Zhai G. Ensemble learning for detecting gene-gene interactions in colorectal cancer. PeerJ 2018; 6:e5854. [PMID: 30397551 PMCID: PMC6211269 DOI: 10.7717/peerj.5854] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/05/2018] [Accepted: 09/28/2018] [Indexed: 11/20/2022] Open
Abstract
Colorectal cancer (CRC) has a high incident rate in both men and women and is affecting millions of people every year. Genome-wide association studies (GWAS) on CRC have successfully revealed common single-nucleotide polymorphisms (SNPs) associated with CRC risk. However, they can only explain a very limited fraction of the disease heritability. One reason may be the common uni-variable analyses in GWAS where genetic variants are examined one at a time. Given the complexity of cancers, the non-additive interaction effects among multiple genetic variants have a potential of explaining the missing heritability. In this study, we employed two powerful ensemble learning algorithms, random forests and gradient boosting machine (GBM), to search for SNPs that contribute to the disease risk through non-additive gene-gene interactions. We were able to find 44 possible susceptibility SNPs that were ranked most significant by both algorithms. Out of those 44 SNPs, 29 are in coding regions. The 29 genes include ARRDC5, DCC, ALK, and ITGA1, which have been found previously associated with CRC, and E2F3 and NID2, which are potentially related to CRC since they have known associations with other types of cancer. We performed pairwise and three-way interaction analysis on the 44 SNPs using information theoretical techniques and found 17 pairwise (p < 0.02) and 16 three-way (p ≤ 0.001) interactions among them. Moreover, functional enrichment analysis suggested 16 functional terms or biological pathways that may help us better understand the etiology of the disease.
Collapse
Affiliation(s)
- Faramarz Dorani
- Department of Computer Science, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Ting Hu
- Department of Computer Science, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Michael O Woods
- Faculty of Medicine, Memorial University, St. John's, Newfoundland and Labrador, Canada
| | - Guangju Zhai
- Faculty of Medicine, Memorial University, St. John's, Newfoundland and Labrador, Canada
| |
Collapse
|
6
|
Predictors of surgical site infection after open lower extremity revascularization. J Vasc Surg 2017; 65:1769-1778.e3. [PMID: 28527931 DOI: 10.1016/j.jvs.2016.11.053] [Citation(s) in RCA: 47] [Impact Index Per Article: 6.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/26/2016] [Accepted: 11/19/2016] [Indexed: 11/23/2022]
Abstract
OBJECTIVE Surgical site infection (SSI) after open lower extremity bypass (LEB) is a serious complication leading to an increased rate of graft failure, hospital readmission, and health care costs. This study sought to identify predictors of SSI after LEB for arterial occlusive disease and also potential modifiable factors to improve outcomes. METHODS Data from a statewide cardiovascular consortium of 35 hospitals were used to obtain demographic, procedural, and hospital risk factors for patients undergoing elective or urgent open LEB between January 2012 and June 2015. Bivariate comparisons and targeted maximum likelihood estimation were used to identify independent risk factors of SSI. Adjusted odds ratios (ORs) were calculated for patient demographics, comorbidities, operative details, and hospital-level factors. RESULTS Our study population included 3033 patients who underwent 703 femoral-femoral bypasses, 1431 femoral-popliteal bypasses, and 899 femoral-distal vessel bypasses. An SSI was diagnosed in 320 patients (10.6%) ≤30 days after the index operation. Adjusted patient and procedural predictors of SSI included renal failure currently requiring dialysis (OR, 4.35; 95% confidence interval [CI], 3.45-5.47; P < .001), hypertension (OR, 4.29; 95% CI, 2.74-6.72; P < .001), body mass index ≥25 kg/m2 (OR, 1.78; 95% CI, 1.23-2.57; P = .002), procedural time >240 minutes (OR, 2.95; 95% CI, 1.89-4.62; P < .001), and iodine-only skin preparation (OR, 1.73; 95% CI, 1.02-2.91; P = .04). Hospital factors associated with increased SSI included hospital size <500 beds (OR, 2.22; 95% CI, 1.09-4.55; P = .028) and major teaching hospital (OR, 1.66; 95% CI, 1.07-2.58; P = .024). SSI resulted in increased risk of major amputation and surgical reoperation (P < .01), but did not affect 30-day mortality. CONCLUSIONS SSI after LEB is associated with an increase in rate of amputation and reoperation. Several patient, operative, and hospital-related risk factors that predict postoperative SSI were identified, suggesting that targeted improvements in perioperative care may decrease complications and improve vascular patient outcomes.
Collapse
|
7
|
Wright MN, Ziegler A, König IR. Do little interactions get lost in dark random forests? BMC Bioinformatics 2016; 17:145. [PMID: 27029549 PMCID: PMC4815164 DOI: 10.1186/s12859-016-0995-8] [Citation(s) in RCA: 51] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/12/2016] [Accepted: 03/21/2016] [Indexed: 12/16/2022] Open
Abstract
Background Random forests have often been claimed to uncover interaction effects. However, if and how interaction effects can be differentiated from marginal effects remains unclear. In extensive simulation studies, we investigate whether random forest variable importance measures capture or detect gene-gene interactions. With capturing interactions, we define the ability to identify a variable that acts through an interaction with another one, while detection is the ability to identify an interaction effect as such. Results Of the single importance measures, the Gini importance captured interaction effects in most of the simulated scenarios, however, they were masked by marginal effects in other variables. With the permutation importance, the proportion of captured interactions was lower in all cases. Pairwise importance measures performed about equal, with a slight advantage for the joint variable importance method. However, the overall fraction of detected interactions was low. In almost all scenarios the detection fraction in a model with only marginal effects was larger than in a model with an interaction effect only. Conclusions Random forests are generally capable of capturing gene-gene interactions, but current variable importance measures are unable to detect them as interactions. In most of the cases, interactions are masked by marginal effects and interactions cannot be differentiated from marginal effects. Consequently, caution is warranted when claiming that random forests uncover interactions. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0995-8) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Marvin N Wright
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, 23562, Germany
| | - Andreas Ziegler
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, 23562, Germany.,Zentrum für Klinische Studien, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.,School of Mathematics, Statistics and Computer Science, University of KwaZulu-Natal, Pietermaritzburg, South Africa
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, 23562, Germany.
| |
Collapse
|
8
|
Grinberg NF, Lovatt A, Hegarty M, Lovatt A, Skøt KP, Kelly R, Blackmore T, Thorogood D, King RD, Armstead I, Powell W, Skøt L. Implementation of Genomic Prediction in Lolium perenne (L.) Breeding Populations. FRONTIERS IN PLANT SCIENCE 2016; 7:133. [PMID: 26904088 PMCID: PMC4751346 DOI: 10.3389/fpls.2016.00133] [Citation(s) in RCA: 36] [Impact Index Per Article: 4.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 09/11/2015] [Accepted: 01/25/2016] [Indexed: 05/23/2023]
Abstract
Perennial ryegrass (Lolium perenne L.) is one of the most widely grown forage grasses in temperate agriculture. In order to maintain and increase its usage as forage in livestock agriculture, there is a continued need for improvement in biomass yield, quality, disease resistance, and seed yield. Genetic gain for traits such as biomass yield has been relatively modest. This has been attributed to its long breeding cycle, and the necessity to use population based breeding methods. Thanks to recent advances in genotyping techniques there is increasing interest in genomic selection from which genomically estimated breeding values are derived. In this paper we compare the classical RRBLUP model with state-of-the-art machine learning techniques that should yield themselves easily to use in GS and demonstrate their application to predicting quantitative traits in a breeding population of L. perenne. Prediction accuracies varied from 0 to 0.59 depending on trait, prediction model and composition of the training population. The BLUP model produced the highest prediction accuracies for most traits and training populations. Forage quality traits had the highest accuracies compared to yield related traits. There appeared to be no clear pattern to the effect of the training population composition on the prediction accuracies. The heritability of the forage quality traits was generally higher than for the yield related traits, and could partly explain the difference in accuracy. Some population structure was evident in the breeding populations, and probably contributed to the varying effects of training population on the predictions. The average linkage disequilibrium between adjacent markers ranged from 0.121 to 0.215. Higher marker density and larger training population closely related with the test population are likely to improve the prediction accuracy.
Collapse
Affiliation(s)
| | - Alan Lovatt
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Matt Hegarty
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Andi Lovatt
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Kirsten P. Skøt
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Rhys Kelly
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Tina Blackmore
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Danny Thorogood
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Ross D. King
- Manchester Institute of Biotechnology, University of ManchesterManchester, UK
| | - Ian Armstead
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| | - Wayne Powell
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
- CGIAR Consortium, CGIAR Consortium OfficeMontpellier, France
| | - Leif Skøt
- Institute of Biological, Environmental and Rural Sciences, Aberystwyth UniversityAberystwyth, UK
| |
Collapse
|
9
|
König IR, Auerbach J, Gola D, Held E, Holzinger ER, Legault MA, Sun R, Tintle N, Yang HC. Machine learning and data mining in complex genomic data--a review on the lessons learned in Genetic Analysis Workshop 19. BMC Genet 2016; 17 Suppl 2:1. [PMID: 26866367 PMCID: PMC4895282 DOI: 10.1186/s12863-015-0315-8] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/03/2022] Open
Abstract
In the analysis of current genomic data, application of machine learning and data mining techniques has become more attractive given the rising complexity of the projects. As part of the Genetic Analysis Workshop 19, approaches from this domain were explored, mostly motivated from two starting points. First, assuming an underlying structure in the genomic data, data mining might identify this and thus improve downstream association analyses. Second, computational methods for machine learning need to be developed further to efficiently deal with the current wealth of data.In the course of discussing results and experiences from the machine learning and data mining approaches, six common messages were extracted. These depict the current state of these approaches in the application to complex genomic data. Although some challenges remain for future studies, important forward steps were taken in the integration of different data types and the evaluation of the evidence. Mining the data for underlying genetic or phenotypic structure and using this information in subsequent analyses proved to be extremely helpful and is likely to become of even greater use with more complex data sets.
Collapse
Affiliation(s)
- Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.
| | - Jonathan Auerbach
- Department of Statistics, Columbia University, New York, NY, 10027, USA.
| | - Damian Gola
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Lübeck, Germany.
| | - Elizabeth Held
- Department of Mathematics, Iowa State University, Ames, IA, 50011, USA.
| | - Emily R Holzinger
- Computational and Statistical Genomics Branch, National Human Genome Research Institute, National Institutes of Health, Baltimore, MD, 21224, USA.
| | - Marc-André Legault
- Université de Montréal, Faculty of Medicine, 2900 Chemin de la Tour, Montreal, QC, H3T 1N8, Canada.
| | - Rui Sun
- Division of Biostatistics, School of Public Health and Primary Care, the Chinese University of Hong Kong, Shatin, Hong Kong SAR.
| | - Nathan Tintle
- Department of Mathematics, Statistics and Computer Science, Dordt College, Sioux Center, IA, 51250, USA.
| | - Hsin-Chou Yang
- Institute of Statistical Science, Academia Sinica, Nankang 115, Taipei, Taiwan.
| |
Collapse
|
10
|
Mitchell L, Sloan TM, Mewissen M, Ghazal P, Forster T, Piotrowski M, Trew A. Parallel classification and feature selection in microarray data using SPRINT. CONCURRENCY AND COMPUTATION : PRACTICE & EXPERIENCE 2014; 26:854-865. [PMID: 24883047 PMCID: PMC4038771 DOI: 10.1002/cpe.2928] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/03/2023]
Abstract
The statistical language R is favoured by many biostatisticians for processing microarray data. In recent times, the quantity of data that can be obtained in experiments has risen significantly, making previously fast analyses time consuming or even not possible at all with the existing software infrastructure. High performance computing (HPC) systems offer a solution to these problems but at the expense of increased complexity for the end user. The Simple Parallel R Interface is a library for R that aims to reduce the complexity of using HPC systems by providing biostatisticians with drop-in parallelised replacements of existing R functions. In this paper we describe parallel implementations of two popular techniques: exploratory clustering analyses using the random forest classifier and feature selection through identification of differentially expressed genes using the rank product method.
Collapse
Affiliation(s)
- Lawrence Mitchell
- EPCC, School of Physics and Astronomy, University of Edinburgh, Edinburgh, EH9 3JZ, UK
| | - Terence M Sloan
- EPCC, School of Physics and Astronomy, University of Edinburgh, Edinburgh, EH9 3JZ, UK
| | - Muriel Mewissen
- Division of Pathway Medicine, University of Edinburgh, Medical School, 49 Little France Crescent, Edinburgh, EH16 4SB, UK
| | - Peter Ghazal
- Division of Pathway Medicine, University of Edinburgh, Medical School, 49 Little France Crescent, Edinburgh, EH16 4SB, UK
| | - Thorsten Forster
- Division of Pathway Medicine, University of Edinburgh, Medical School, 49 Little France Crescent, Edinburgh, EH16 4SB, UK
| | - Michal Piotrowski
- EPCC, School of Physics and Astronomy, University of Edinburgh, Edinburgh, EH9 3JZ, UK
| | - Arthur Trew
- EPCC, School of Physics and Astronomy, University of Edinburgh, Edinburgh, EH9 3JZ, UK
| |
Collapse
|
11
|
Jensen TM, Witte DR, Pieragostino D, McGuire JN, Schjerning ED, Nardi C, Urbani A, Kivimäki M, Brunner EJ, Tabàk AG, Vistisen D. Association between protein signals and type 2 diabetes incidence. Acta Diabetol 2013; 50:697-704. [PMID: 22310914 PMCID: PMC4181558 DOI: 10.1007/s00592-012-0376-3] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/18/2011] [Accepted: 01/18/2012] [Indexed: 01/04/2023]
Abstract
Understanding early determinants of type 2 diabetes is essential for refining disease prevention strategies. Proteomic technology may provide a useful approach to identify novel protein patterns potentially related to pathophysiological changes that lead up to diabetes. In this study, we sought to identify protein signals that are associated with diabetes incidence in a middle-aged population. Serum samples from 519 participants in a nested case-control selection (167 cases and 352 age-, sex- and BMI-matched normoglycemic control subjects, median follow-up 14.0 years) within the Whitehall-II cohort were analyzed by linear matrix-assisted laser desorption/ionization time-of-flight mass spectrometry (MALDI-TOF-MS). Nine protein peaks were found to be associated with incident diabetes. Rate ratios for high peak intensity ranged between 0.4 (95% CI, 0.2-0.8) and 4.0 (95% CI, 1.7-9.2) and were robust to adjustment for main potential confounders, including obesity, lipids and C-reactive protein. The proteins associated with these peaks may reflect diabetes pathogenesis. Our study exemplifies the utility of an approach that combines proteomic and epidemiological data.
Collapse
|
12
|
Rose S. Mortality risk score prediction in an elderly population using machine learning. Am J Epidemiol 2013; 177:443-52. [PMID: 23364879 DOI: 10.1093/aje/kws241] [Citation(s) in RCA: 105] [Impact Index Per Article: 9.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
Standard practice for prediction often relies on parametric regression methods. Interesting new methods from the machine learning literature have been introduced in epidemiologic studies, such as random forest and neural networks. However, a priori, an investigator will not know which algorithm to select and may wish to try several. Here I apply the super learner, an ensembling machine learning approach that combines multiple algorithms into a single algorithm and returns a prediction function with the best cross-validated mean squared error. Super learning is a generalization of stacking methods. I used super learning in the Study of Physical Performance and Age-Related Changes in Sonomans (SPPARCS) to predict death among 2,066 residents of Sonoma, California, aged 54 years or more during the period 1993-1999. The super learner for predicting death (risk score) improved upon all single algorithms in the collection of algorithms, although its performance was similar to that of several algorithms. Super learner outperformed the worst algorithm (neural networks) by 44% with respect to estimated cross-validated mean squared error and had an R2 value of 0.201. The improvement of super learner over random forest with respect to R2 was approximately 2-fold. Alternatives for risk score prediction include the super learner, which can provide improved performance.
Collapse
Affiliation(s)
- Sherri Rose
- Department of Biostatistics, Bloomberg School of Public Health, Johns Hopkins University, Baltimore, MD 21205, USA.
| |
Collapse
|
13
|
Schunkert H, König IR, Erdmann J. Molecular Signatures of Cardiovascular Disease Risk. Mol Diagn Ther 2012; 12:281-7. [DOI: 10.1007/bf03256293] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/21/2022]
|
14
|
Schwarz DF, König IR, Ziegler A. On safari to Random Jungle: a fast implementation of Random Forests for high-dimensional data. ACTA ACUST UNITED AC 2010; 26:1752-8. [PMID: 20505004 DOI: 10.1093/bioinformatics/btq257] [Citation(s) in RCA: 184] [Impact Index Per Article: 13.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
MOTIVATION Genome-wide association (GWA) studies have proven to be a successful approach for helping unravel the genetic basis of complex genetic diseases. However, the identified associations are not well suited for disease prediction, and only a modest portion of the heritability can be explained for most diseases, such as Type 2 diabetes or Crohn's disease. This may partly be due to the low power of standard statistical approaches to detect gene-gene and gene-environment interactions when small marginal effects are present. A promising alternative is Random Forests, which have already been successfully applied in candidate gene analyses. Important single nucleotide polymorphisms are detected by permutation importance measures. To this day, the application to GWA data was highly cumbersome with existing implementations because of the high computational burden. RESULTS Here, we present the new freely available software package Random Jungle (RJ), which facilitates the rapid analysis of GWA data. The program yields valid results and computes up to 159 times faster than the fastest alternative implementation, while still maintaining all options of other programs. Specifically, it offers the different permutation importance measures available. It includes new options such as the backward elimination method. We illustrate the application of RJ to a GWA of Crohn's disease. The most important single nucleotide polymorphisms (SNPs) validate recent findings in the literature and reveal potential interactions. AVAILABILITY The RJ software package is freely available at http://www.randomjungle.org
Collapse
Affiliation(s)
- Daniel F Schwarz
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Maria-Goeppert-Strasse 1, 23562 Lübeck, Germany
| | | | | |
Collapse
|
15
|
Ziegler A. Genome-wide association studies: quality control and population-based measures. Genet Epidemiol 2010; 33 Suppl 1:S45-50. [PMID: 19924716 DOI: 10.1002/gepi.20472] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
Genome-wide association studies, using hundreds of thousands of single-nucleotide polymorphism (SNP) markers, have become a standard approach for identifying disease susceptibility genes. The change in the technology poses substantial computational and statistical challenges that have been addressed in the quality control, imputation, and population-based measure groups of the Genetic Analysis Workshop 16. The computational challenges pertain to efficient memory management and computational speed of the statistical procedures, and we discuss an approach for efficient SNP storage. Accuracy and computational speed is relevant for genotype calling, and the results from a comparison of three calling algorithms are discussed. The first statistical challenge is related to statistical quality control, and we discuss two novel quality control procedures. These low-level analyses have an effect on subsequent preparatory steps for high-level analyses, e.g., the quality of genotype imputation approaches. After the conduct of a genome-wide association study with successful replication and/or validation, measures of diagnostic accuracy, including the area under the curve, are investigated. The area under the curve can be constructed from summary data in some situations. Finally, we discuss how the population-attributable risk of a genetic variant that is only measured in a reference data set can be determined.
Collapse
Affiliation(s)
- Andreas Ziegler
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Germany.
| |
Collapse
|
16
|
Tang R, Sinnwell JP, Li J, Rider DN, de Andrade M, Biernacka JM. Identification of genes and haplotypes that predict rheumatoid arthritis using random forests. BMC Proc 2009; 3 Suppl 7:S68. [PMID: 20018062 PMCID: PMC2795969 DOI: 10.1186/1753-6561-3-s7-s68] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/17/2022] Open
Abstract
Random forest (RF) analysis of genetic data does not require specification of the mode of inheritance, and provides measures of variable importance that incorporate interaction effects. In this paper we describe RF-based approaches for assessment of gene and haplotype importance, and apply these approaches to a subset of the North American Rheumatoid Arthritis Consortium case-control data provided by Genetic Analysis Workshop 16. The RF analyses of 37 genes identified many of the same genes as logistic regression, but also suggested importance of certain single-nucleotide polymorphism and genes that were not ranked highly by logistic regression. A new permutation method did not reveal strong evidence of gene-gene interaction effects in these data. Although RFs are a promising approach for genetic data analysis, extensions beyond simple single-nucleotide polymorphism analyses and modifications to improve computational feasibility are needed.
Collapse
Affiliation(s)
- Rui Tang
- Department of Health Sciences Research, 200 First Street Southwest, Mayo Clinic, Rochester, Minnesota 55905, USA.
| | | | | | | | | | | |
Collapse
|
17
|
Szymczak S, Biernacka JM, Cordell HJ, González-Recio O, König IR, Zhang H, Sun YV. Machine learning in genome-wide association studies. Genet Epidemiol 2009; 33 Suppl 1:S51-7. [PMID: 19924717 DOI: 10.1002/gepi.20473] [Citation(s) in RCA: 103] [Impact Index Per Article: 6.9] [Reference Citation Analysis] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023]
Affiliation(s)
- Silke Szymczak
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Lübeck, Germany.
| | | | | | | | | | | | | |
Collapse
|
18
|
Ziegler A, König IR, Thompson JR. Biostatistical Aspects of Genome-Wide Association Studies. Biom J 2008; 50:8-28. [DOI: 10.1002/bimj.200710398] [Citation(s) in RCA: 113] [Impact Index Per Article: 7.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
|
19
|
König I, Malley J, Pajevic S, Weimar C, Diener HC, Ziegler A. Patient-centered yes/no prognosis using learning machines. INT J DATA MIN BIOIN 2008; 2:289-341. [PMID: 19216340 PMCID: PMC2754835 DOI: 10.1504/ijdmb.2008.022149] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/21/2022]
Abstract
In the last 15 years several machine learning approaches have been developed for classification and regression. In an intuitive manner we introduce the main ideas of classification and regression trees, support vector machines, bagging, boosting and random forests. We discuss differences in the use of machine learning in the biomedical community and the computer sciences. We propose methods for comparing machines on a sound statistical basis. Data from the German Stroke Study Collaboration is used for illustration. We compare the results from learning machines to those obtained by a published logistic regression and discuss similarities and differences.
Collapse
Affiliation(s)
- I.R. König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Ratzeburger Allee 160, 23538 Lübeck, Germany
| | - J.D. Malley
- Center for Information Technology, National Institutes of Health, Bethesda, MD, USA
| | - S. Pajevic
- Center for Information Technology, National Institutes of Health, Bethesda, MD, USA
| | - C. Weimar
- Klinik und Poliklinik für Neurologie, Universität Duisburg-Essen, Germany
| | - H-C. Diener
- Klinik und Poliklinik für Neurologie, Universität Duisburg-Essen, Germany
| | - A. Ziegler
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Ratzeburger Allee 160, 23538 Lübeck, Germany, E-mail:
| |
Collapse
|
20
|
Falk CT, Finch SJ, Kim W, Mukhopadhyay ND, Gong B, Hinrichs A, Li X, Liu X, Malhotra A, Mehta T, Page G, Rao S, Saccone N, Shete S, Yang Y, Yu R, Zhao JH, Zhou X. Data mining of RNA expression and DNA genotype data: presentation group 5 contributions to Genetic Analysis Workshop 15. Genet Epidemiol 2007; 31 Suppl 1:S43-50. [PMID: 18046764 DOI: 10.1002/gepi.20279] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022]
Abstract
The complexity of data available in human genetics continues to grow at an explosive rate. With that growth, the challenges to understanding the meaning of the underlying information also grow. A currently popular approach to dissecting such information falls under the broad category of data mining. This can apply to any approach that tries to extract relevant information from large amounts of data, but often refers to methods that deal, in a non-linear fashion, with very large numbers of variables that cannot be simultaneously handled by more conventional statistical methods. To explore the usefulness of some of these approaches, 13 groups applied a variety of strategies to the first dataset provided to GAW 15 participants. With the extensive microarray and SNP data provided for 14 CEPH families, these groups explored multistage analyses, machine learning methods, network construction, and other techniques to try to answer questions about gene-gene interaction, functional similarities, co-regulated gene expression and the mapping of gene expression determinants, among others. In general, the methods offered strategies to provide a better understanding of the complex pathways involved in gene expression and function. These are still "works in progress," often exploratory in nature, but they provide insights into ways in which the data might be interpreted. Despite the still preliminary nature of some of these methods and the diversity of the approaches, some common themes emerged. The collection of papers and methods offer a starting point for further exploration of complex interactions in human genetic data now readily available.
Collapse
|
21
|
de Andrade M, Allen AS. Summary of contributions to GAW15 Group 13: candidate gene association studies. Genet Epidemiol 2007; 31 Suppl 1:S110-7. [DOI: 10.1002/gepi.20287] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
|