1
|
Evaluating the detection ability of a range of epistasis detection methods on simulated data for pure and impure epistatic models. PLoS One 2022; 17:e0263390. [PMID: 35180244 PMCID: PMC8856572 DOI: 10.1371/journal.pone.0263390] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/24/2021] [Accepted: 01/18/2022] [Indexed: 11/19/2022] Open
Abstract
Background Numerous approaches have been proposed for the detection of epistatic interactions within GWAS datasets in order to better understand the drivers of disease and genetics. Methods A selection of state-of-the-art approaches were assessed. These included the statistical tests, fast-epistasis, BOOST, logistic regression and wtest; swarm intelligence methods, namely AntEpiSeeker, epiACO and CINOEDV; and data mining approaches, including MDR, GSS, SNPRuler and MPI3SNP. Data were simulated to provide randomly generated models with no individual main effects at different heritabilities (pure epistasis) as well as models based on penetrance tables with some main effects (impure epistasis). Detection of both two and three locus interactions were assessed across a total of 1,560 simulated datasets. The different methods were also applied to a section of the UK biobank cohort for Atrial Fibrillation. Results For pure, two locus interactions, PLINK’s implementation of BOOST recovered the highest number of correct interactions, with 53.9% and significantly better performing than the other methods (p = 4.52e − 36). For impure two locus interactions, MDR exhibited the best performance, recovering 62.2% of the most significant impure epistatic interactions (p = 6.31e − 90 for all but one test). The assessment of three locus interaction prediction revealed that wtest recovered the highest number (17.2%) of pure epistatic interactions(p = 8.49e − 14). wtest also recovered the highest number of three locus impure epistatic interactions (p = 6.76e − 48) while AntEpiSeeker ranked as the most significant the highest number of such interactions (40.5%). Finally, when applied to a real dataset for Atrial Fibrillation, most notably finding an interaction between SYNE2 and DTNB.
Collapse
|
2
|
Nodzenski M, Shi M, Krahn JM, Wise AS, Li Y, Li L, Umbach DM, Weinberg CR. GADGETS: a genetic algorithm for detecting epistasis using nuclear families. Bioinformatics 2022; 38:1052-1058. [PMID: 34788792 PMCID: PMC10060691 DOI: 10.1093/bioinformatics/btab766] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/26/2021] [Revised: 10/08/2021] [Accepted: 11/03/2021] [Indexed: 02/03/2023] Open
Abstract
MOTIVATION Epistasis may play an etiologic role in complex diseases, but research has been hindered because identification of interactions among sets of single nucleotide polymorphisms (SNPs) requires exploration of immense search spaces. Current approaches using nuclear families accommodate at most several hundred candidate SNPs. RESULTS GADGETS detects epistatic SNP-sets by applying a genetic algorithm to case-parent or case-sibling data. To allow for multiple epistatic sets, island subpopulations of SNP-sets evolve separately under selection for evident joint relevance to disease risk. The software evaluates the identified SNP-sets via permutation testing and provides graphical visualization. GADGETS correctly identified epistatic SNP-sets in realistically simulated case-parent triads with 10 000 candidate SNPs, far more SNPs than competitors can handle, and it outperformed competitors in simulations with many fewer SNPs. Applying GADGETS to family-based oral-clefting data from dbGaP identified SNP-sets with possible epistatic effects on risk. AVAILABILITY AND IMPLEMENTATION GADGETS is part of the epistasisGA package at https://github.com/mnodzenski/epistasisGA. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Michael Nodzenski
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA
| | - Min Shi
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA
| | - Juno M Krahn
- Genome Integrity and Structural Biology Laboratory, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA
| | - Alison S Wise
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA
| | - Yuanyuan Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA
| | - Leping Li
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA
| | - David M Umbach
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA
| | - Clarice R Weinberg
- Biostatistics and Computational Biology Branch, National Institute of Environmental Health Sciences, NIH, Research Triangle Park, NC 27709, USA
| |
Collapse
|
3
|
Li X, Shi W, Zhang R, Zhang S, Hou W, Wu Y, Lu R, Feng Y, Tian J, Sun L. Integrate Molecular Phenome and Polygenic Interaction to Detect the Genetic Risk of Ischemic Stroke. Front Cell Dev Biol 2020; 8:453. [PMID: 32671063 PMCID: PMC7326764 DOI: 10.3389/fcell.2020.00453] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/18/2020] [Accepted: 05/15/2020] [Indexed: 12/02/2022] Open
Abstract
Ischemic stroke (IS) is one of the leading causes of death, and the genetic risk of which are continuously calculated and detected by association study of single nucleotide polymorphism (SNP) and the phenotype relations. However, the systematic assessment of IS risk still needs the accumulation of molecular phenotype and function from the level of omics. In this study, we integrated IS phenome, polygenic interaction gene expression and molecular function to screen the risk gene and molecular function. Then, we performed a case-control study including 507 cases and 503 controls to verify the genetic associated relationship among the candidate functional genes and the IS phenotype in a northern Chinese Han population. Mediation analysis revealed that the blood pressure, high density lipoprotein (HDL) and glucose mediated the potential effect of SOCS1, CD137, ALOX5AP, RNLS, and KALRN in IS, both for the functional analysis and genetic association. And the SNP-SNP interactions analysis by multifactor dimensionality reduction (MDR) approach also presented a combination effect of IS risk. The further interaction network and gene ontology (GO) enrichment analysis suggested that CD137 and KALRN functioning in inflammatory could play an expanded role during the pathogenesis and progression of IS. The present study opens a new avenue to evaluate the underlying mechanisms and biomarkers of IS through integrating multiple omics information.
Collapse
Affiliation(s)
- Xiaoying Li
- Department of Ultrasound, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Weilin Shi
- Department of Physical Diagnosis, The Fourth Affiliated Hospital of Heilongjiang University of Traditional Chinese Medicine, Harbin, China
| | - Ruyou Zhang
- Department of Neurosurgery, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Shuang Zhang
- Department of Ultrasound, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Wenying Hou
- Department of Ultrasound, Xuanwu Hospital Capital Medical University, Beijing, China
| | - Yingnan Wu
- Department of Ultrasound, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Rui Lu
- Department of Ultrasound, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Yanan Feng
- Department of Ultrasound, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Jiawei Tian
- Department of Ultrasound, The Second Affiliated Hospital of Harbin Medical University, Harbin, China
| | - Litao Sun
- Department of Ultrasound, Shenzhen University General Hospital, Shenzhen, China
| |
Collapse
|
4
|
Ansarifar J, Wang L. New algorithms for detecting multi-effect and multi-way epistatic interactions. Bioinformatics 2019; 35:5078-5085. [DOI: 10.1093/bioinformatics/btz463] [Citation(s) in RCA: 12] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2019] [Revised: 04/14/2019] [Accepted: 05/31/2019] [Indexed: 11/14/2022] Open
Abstract
AbstractMotivationEpistasis, which is the phenomenon of genetic interactions, plays a central role in many scientific discoveries. However, due to the combinatorial nature of the problem, it is extremely challenging to decipher the exact combinations of genes that trigger the epistatic effects. Many existing methods only focus on two-way interactions. Some of the most effective methods used machine learning techniques, but many were designed for special case-and-control studies or suffer from overfitting. We propose three new algorithms for multi-effect and multi-way epistases detection, with one guaranteeing global optimality and the other two being local optimization oriented heuristics.ResultsThe computational performance of the proposed heuristic algorithm was compared with several state-of-the-art methods using a yeast dataset. Results suggested that searching for the global optimal solution could be extremely time consuming, but the proposed heuristic algorithm was much more effective and efficient than others at finding a close-to-optimal solution. Moreover, it was able to provide biological insight on the exact configurations of epistases, besides achieving a higher prediction accuracy than the state-of-the-art methods.Availability and implementationData source was publicly available and details are provided in the text.
Collapse
|
5
|
Vivian‐Griffiths T, Baker E, Schmidt KM, Bracher‐Smith M, Walters J, Artemiou A, Holmans P, O'Donovan MC, Owen MJ, Pocklington A, Escott‐Price V. Predictive modeling of schizophrenia from genomic data: Comparison of polygenic risk score with kernel support vector machines approach. Am J Med Genet B Neuropsychiatr Genet 2019; 180:80-85. [PMID: 30516002 PMCID: PMC6492016 DOI: 10.1002/ajmg.b.32705] [Citation(s) in RCA: 23] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/20/2018] [Revised: 09/03/2018] [Accepted: 11/09/2018] [Indexed: 11/07/2022]
Abstract
A major controversy in psychiatric genetics is whether nonadditive genetic interaction effects contribute to the risk of highly polygenic disorders. We applied a support vector machines (SVMs) approach, which is capable of building linear and nonlinear models using kernel methods, to classify cases from controls in a large schizophrenia case-control sample of 11,853 subjects (5,554 cases and 6,299 controls) and compared its prediction accuracy with the polygenic risk score (PRS) approach. We also investigated whether SVMs are a suitable approach to detecting nonlinear genetic effects, that is, interactions. We found that PRS provided more accurate case/control classification than either linear or nonlinear SVMs, and give a tentative explanation why PRS outperforms both multivariate regression and linear kernel SVMs. In addition, we observe that nonlinear kernel SVMs showed higher classification accuracy than linear SVMs when a large number of SNPs are entered into the model. We conclude that SVMs are a potential tool for assessing the presence of interactions, prior to searching for them explicitly.
Collapse
Affiliation(s)
- Timothy Vivian‐Griffiths
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Emily Baker
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Karl M. Schmidt
- School of MathematicsCardiff UniversityCardiffUnited Kingdom
| | - Matthew Bracher‐Smith
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - James Walters
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | | | - Peter Holmans
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Michael C. O'Donovan
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Michael J. Owen
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Andrew Pocklington
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| | - Valentina Escott‐Price
- Medical Research Council Centre for Neuropsychiatric Genetics and Genomics, Division of Psychological Medicine and Clinical NeurosciencesCardiff UniversityCardiffUnited Kingdom
| |
Collapse
|
6
|
Uppu S, Krishna A. A deep hybrid model to detect multi-locus interacting SNPs in the presence of noise. Int J Med Inform 2018; 119:134-151. [DOI: 10.1016/j.ijmedinf.2018.09.003] [Citation(s) in RCA: 9] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/26/2017] [Revised: 04/13/2018] [Accepted: 09/03/2018] [Indexed: 01/17/2023]
|
7
|
Choi S, Lee S, Kim Y, Hwang H, Park T. HisCoM-GGI: Hierarchical structural component analysis of gene-gene interactions. J Bioinform Comput Biol 2018; 16:1840026. [PMID: 30567476 DOI: 10.1142/s0219720018400267] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/23/2022]
Abstract
Although genome-wide association studies (GWAS) have successfully identified thousands of single nucleotide polymorphisms (SNPs) associated with common diseases, these observations are limited for fully explaining "missing heritability". Determining gene-gene interactions (GGI) are one possible avenue for addressing the missing heritability problem. While many statistical approaches have been proposed to detect GGI, most of these focus primarily on SNP-to-SNP interactions. While there are many advantages of gene-based GGI analyses, such as reducing the burden of multiple-testing correction, and increasing power by aggregating multiple causal signals across SNPs in specific genes, only a few methods are available. In this study, we proposed a new statistical approach for gene-based GGI analysis, "Hierarchical structural CoMponent analysis of Gene-Gene Interactions" (HisCoM-GGI). HisCoM-GGI is based on generalized structured component analysis, and can consider hierarchical structural relationships between genes and SNPs. For a pair of genes, HisCoM-GGI first effectively summarizes all possible pairwise SNP-SNP interactions into a latent variable, from which it then performs GGI analysis. HisCoM-GGI can evaluate both gene-level and SNP-level interactions. Through simulation studies, HisCoM-GGI demonstrated higher statistical power than existing gene-based GGI methods, in analyzing a GWAS of a Korean population for identifying GGI associated with body mass index. Resultantly, HisCoM-GGI successfully identified 14 potential GGI, two of which, (NCOR2 <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mo>×</mml:mo></mml:math> SPOCK1) and (LINGO2 <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mo>×</mml:mo></mml:math> ZNF385D) were successfully replicated in independent datasets. We conclude that HisCoM-GGI method may be a valuable tool for genome to identify GGI in missing heritability, allowing us to better understand the biological genetic mechanisms of complex traits. We conclude that HisCoM-GGI method may be a valuable tool for genome to identify GGI in missing heritability, allowing us to better understand biological genetic mechanisms of complex traits. An implementation of HisCoM-GGI can be downloaded from the website ( http://statgen.snu.ac.kr/software/hiscom-ggi ).
Collapse
Affiliation(s)
- Sungkyoung Choi
- Department of Pharmacology, Yonsei University College of Medicine, 50-1 Yonsei-ro Seodaemun-gu, Seoul 03722, Republic of Korea
| | - Sungyoung Lee
- Center for Precision Medicine, Seoul National University Hospital, 71 Daehak-ro Jongno-gu, Seoul 03082, Republic of Korea
| | - Yongkang Kim
- Department of Statistics, Seoul National University, 1 Gwanak-ro Gwanak-gu, Seoul 08826, Republic of Korea.,Department of Psychology, McGill University, 2001 Avenue McGill College, Montreal, Quebec H3A 1G1, Canada
| | - Heungsun Hwang
- Department of Psychology, McGill University, 2001 Avenue McGill College, Montreal, Quebec H3A 1G1, Canada
| | - Taesung Park
- Department of Statistics, Seoul National University, 1 Gwanak-ro Gwanak-gu, Seoul 08826, Republic of Korea.,Interdisciplinary Program in Bioinformatics, Seoul National University, 1 Gwanak-ro Gwanak-gu, Seoul 08826, Republic of Korea
| |
Collapse
|
8
|
Wang JH, Chen YH. Overlapping group screening for detection of gene-gene interactions: application to gene expression profiles with survival trait. BMC Bioinformatics 2018; 19:335. [PMID: 30241463 PMCID: PMC6150983 DOI: 10.1186/s12859-018-2372-2] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/29/2018] [Accepted: 09/12/2018] [Indexed: 01/29/2023] Open
Abstract
Background The development of a disease is a complex process that may result from joint effects of multiple genes. In this article, we propose the overlapping group screening (OGS) approach to determining active genes and gene-gene interactions incorporating prior pathway information. The OGS method is developed to overcome the challenges in genome-wide data analysis that the number of the genes and gene-gene interactions is far greater than the sample size, and the pathways generally overlap with one another. The OGS method is further proposed for patients’ survival prediction based on gene expression data. Results Simulation studies demonstrate that the performance of the OGS approach in identifying the true main and interaction effects is good and the survival prediction accuracy of OGS with the Lasso penalty is better than the ordinary Lasso method. In real data analysis, we identify several significant genes and/or epistasis interactions that are associated with clinical survival outcomes of diffuse large B-cell lymphoma (DLBCL) and non-small-cell lung cancer (NSCLC) by utilizing prior pathway information from the KEGG pathway and the GO biological process databases, respectively. Conclusions The OGS approach is useful for selecting important genes and epistasis interactions in the ultra-high dimensional feature space. The prediction ability of OGS with the Lasso penalty is better than existing methods. The OGS approach is generally applicable to various types of outcome data (quantitative, qualitative, censored event time data) and regression models (e.g. linear, logistic, and Cox’s regression models). Electronic supplementary material The online version of this article (10.1186/s12859-018-2372-2) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Jie-Huei Wang
- Institute of Statistical Science, Academia Sinica, Nankang, Taipei, Taiwan
| | - Yi-Hau Chen
- Institute of Statistical Science, Academia Sinica, Nankang, Taipei, Taiwan.
| |
Collapse
|
9
|
Fang YH, Wang JH, Hsiung CA. TSGSIS: a high-dimensional grouped variable selection approach for detection of whole-genome SNP-SNP interactions. Bioinformatics 2018. [PMID: 28651334 DOI: 10.1093/bioinformatics/btx409] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Motivation Identification of single nucleotide polymorphism (SNP) interactions is an important and challenging topic in genome-wide association studies (GWAS). Many approaches have been applied to detecting whole-genome interactions. However, these approaches to interaction analysis tend to miss causal interaction effects when the individual marginal effects are uncorrelated to trait, while their interaction effects are highly associated with the trait. Results A grouped variable selection technique, called two-stage grouped sure independence screening (TS-GSIS), is developed to study interactions that may not have marginal effects. The proposed TS-GSIS is shown to be very helpful in identifying not only causal SNP effects that are uncorrelated to trait but also their corresponding SNP-SNP interaction effects. The benefit of TS-GSIS are gaining detection of interaction effects by taking the joint information among the SNPs and determining the size of candidate sets in the model. Simulation studies under various scenarios are performed to compare performance of TS-GSIS and current approaches. We also apply our approach to a real rheumatoid arthritis (RA) dataset. Both the simulation and real data studies show that the TS-GSIS performs very well in detecting SNP-SNP interactions. Availability and implementation R-package is delivered through CRAN and is available at: https://cran.r-project.org/web/packages/TSGSIS/index.html. Contact hsiung@nhri.org.tw. Supplementary information Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Yao-Hwei Fang
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan 35053, Taiwan
| | - Jie-Huei Wang
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan 35053, Taiwan
| | - Chao A Hsiung
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Zhunan 35053, Taiwan
| |
Collapse
|
10
|
Uppu S, Krishna A, Gopalan RP. A Review on Methods for Detecting SNP Interactions in High-Dimensional Genomic Data. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:599-612. [PMID: 28060710 DOI: 10.1109/tcbb.2016.2635125] [Citation(s) in RCA: 20] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/06/2023]
Abstract
In this era of genome-wide association studies (GWAS), the quest for understanding the genetic architecture of complex diseases is rapidly increasing more than ever before. The development of high throughput genotyping and next generation sequencing technologies enables genetic epidemiological analysis of large scale data. These advances have led to the identification of a number of single nucleotide polymorphisms (SNPs) responsible for disease susceptibility. The interactions between SNPs associated with complex diseases are increasingly being explored in the current literature. These interaction studies are mathematically challenging and computationally complex. These challenges have been addressed by a number of data mining and machine learning approaches. This paper reviews the current methods and the related software packages to detect the SNP interactions that contribute to diseases. The issues that need to be considered when developing these models are addressed in this review. The paper also reviews the achievements in data simulation to evaluate the performance of these models. Further, it discusses the future of SNP interaction analysis.
Collapse
|
11
|
Abo Alchamlat S, Farnir F. KNN-MDR: a learning approach for improving interactions mapping performances in genome wide association studies. BMC Bioinformatics 2017; 18:184. [PMID: 28327091 PMCID: PMC5361736 DOI: 10.1186/s12859-017-1599-7] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/06/2016] [Accepted: 03/11/2017] [Indexed: 12/30/2022] Open
Abstract
Background Finding epistatic interactions in large association studies like genome-wide association studies (GWAS) with the nowadays-available large volume of genomic data is a challenging and largely unsolved issue. Few previous studies could handle genome-wide data due to the intractable difficulties met in searching a combinatorial explosive search space and statistically evaluating epistatic interactions given a limited number of samples. Our work is a contribution to this field. We propose a novel approach combining K-Nearest Neighbors (KNN) and Multi Dimensional Reduction (MDR) methods for detecting gene-gene interactions as a possible alternative to existing algorithms, e especially in situations where the number of involved determinants is high. After describing the approach, a comparison of our method (KNN-MDR) to a set of the other most performing methods (i.e., MDR, BOOST, BHIT, MegaSNPHunter and AntEpiSeeker) is carried on to detect interactions using simulated data as well as real genome-wide data. Results Experimental results on both simulated data and real genome-wide data show that KNN-MDR has interesting properties in terms of accuracy and power, and that, in many cases, it significantly outperforms its recent competitors. Conclusions The presented methodology (KNN-MDR) is valuable in the context of loci and interactions mapping and can be seen as an interesting addition to the arsenal used in complex traits analyses. Electronic supplementary material The online version of this article (doi:10.1186/s12859-017-1599-7) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Sinan Abo Alchamlat
- Department of Biostatistics, Faculty of Veterinary Medicine, FARAH, University of Liège, Sart Tilman B43, 4000, Liege, Belgium
| | - Frédéric Farnir
- Department of Biostatistics, Faculty of Veterinary Medicine, FARAH, University of Liège, Sart Tilman B43, 4000, Liege, Belgium.
| |
Collapse
|
12
|
|
13
|
Software for detecting gene-gene interactions in genome wide association studies. BIOTECHNOL BIOPROC E 2015. [DOI: 10.1007/s12257-015-0064-6] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/23/2022]
|
14
|
Gola D, Mahachie John JM, van Steen K, König IR. A roadmap to multifactor dimensionality reduction methods. Brief Bioinform 2015; 17:293-308. [PMID: 26108231 PMCID: PMC4793893 DOI: 10.1093/bib/bbv038] [Citation(s) in RCA: 56] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/12/2015] [Indexed: 02/02/2023] Open
Abstract
Complex diseases are defined to be determined by multiple genetic and environmental factors alone as well as in interactions. To analyze interactions in genetic data, many statistical methods have been suggested, with most of them relying on statistical regression models. Given the known limitations of classical methods, approaches from the machine-learning community have also become attractive. From this latter family, a fast-growing collection of methods emerged that are based on the Multifactor Dimensionality Reduction (MDR) approach. Since its first introduction, MDR has enjoyed great popularity in applications and has been extended and modified multiple times. Based on a literature search, we here provide a systematic and comprehensive overview of these suggested methods. The methods are described in detail, and the availability of implementations is listed. Most recent approaches offer to deal with large-scale data sets and rare variants, which is why we expect these methods to even gain in popularity.
Collapse
|
15
|
A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology. BIOMED RESEARCH INTERNATIONAL 2013; 2013:432375. [PMID: 24228248 PMCID: PMC3818807 DOI: 10.1155/2013/432375] [Citation(s) in RCA: 44] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/17/2013] [Revised: 08/26/2013] [Accepted: 08/27/2013] [Indexed: 01/04/2023]
Abstract
Recently, the greatest statistical computational challenge in genetic epidemiology is to identify and characterize the genes that interact with other genes and environment factors that bring the effect on complex multifactorial disease. These gene-gene interactions are also denoted as epitasis in which this phenomenon cannot be solved by traditional statistical method due to the high dimensionality of the data and the occurrence of multiple polymorphism. Hence, there are several machine learning methods to solve such problems by identifying such susceptibility gene which are neural networks (NNs), support vector machine (SVM), and random forests (RFs) in such common and multifactorial disease. This paper gives an overview on machine learning methods, describing the methodology of each machine learning methods and its application in detecting gene-gene and gene-environment interactions. Lastly, this paper discussed each machine learning method and presents the strengths and weaknesses of each machine learning method in detecting gene-gene interactions in complex human disease.
Collapse
|
16
|
Fang YH, Chiu YF. A novel support vector machine-based approach for rare variant detection. PLoS One 2013; 8:e71114. [PMID: 23940698 PMCID: PMC3737136 DOI: 10.1371/journal.pone.0071114] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/11/2013] [Accepted: 06/24/2013] [Indexed: 01/06/2023] Open
Abstract
Advances in next-generation sequencing technologies have enabled the identification of multiple rare single nucleotide polymorphisms involved in diseases or traits. Several strategies for identifying rare variants that contribute to disease susceptibility have recently been proposed. An important feature of many of these statistical methods is the pooling or collapsing of multiple rare single nucleotide variants to achieve a reasonably high frequency and effect. However, if the pooled rare variants are associated with the trait in different directions, then the pooling may weaken the signal, thereby reducing its statistical power. In the present paper, we propose a backward support vector machine (BSVM)-based variant selection procedure to identify informative disease-associated rare variants. In the selection procedure, the rare variants are weighted and collapsed according to their positive or negative associations with the disease, which may be associated with common variants and rare variants with protective, deleterious, or neutral effects. This nonparametric variant selection procedure is able to account for confounding factors and can also be adopted in other regression frameworks. The results of a simulation study and a data example show that the proposed BSVM approach is more powerful than four other approaches under the considered scenarios, while maintaining valid type I errors.
Collapse
Affiliation(s)
- Yao-Hwei Fang
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli County, Taiwan, ROC
| | - Yen-Feng Chiu
- Division of Biostatistics and Bioinformatics, Institute of Population Health Sciences, National Health Research Institutes, Miaoli County, Taiwan, ROC
- * E-mail:
| |
Collapse
|