1
|
Mielniczuk J. Information Theoretic Methods for Variable Selection-A Review. ENTROPY (BASEL, SWITZERLAND) 2022; 24:1079. [PMID: 36010742 PMCID: PMC9407310 DOI: 10.3390/e24081079] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Subscribe] [Scholar Register] [Received: 06/24/2022] [Revised: 08/02/2022] [Accepted: 08/02/2022] [Indexed: 02/05/2023]
Abstract
We review the principal information theoretic tools and their use for feature selection, with the main emphasis on classification problems with discrete features. Since it is known that empirical versions of conditional mutual information perform poorly for high-dimensional problems, we focus on various ways of constructing its counterparts and the properties and limitations of such methods. We present a unified way of constructing such measures based on truncation, or truncation and weighing, for the Möbius expansion of conditional mutual information. We also discuss the main approaches to feature selection which apply the introduced measures of conditional dependence, together with the ways of assessing the quality of the obtained vector of predictors. This involves discussion of recent results on asymptotic distributions of empirical counterparts of criteria, as well as advances in resampling.
Collapse
Affiliation(s)
- Jan Mielniczuk
- Institute of Computer Science, Polish Academy of Sciences, Jana Kazimierza 5, 01-248 Warsaw, Poland;
- Faculty of Mathematics and Information Science, Warsaw University of Technology, Koszykowa 75, 00-662 Warsaw, Poland
| |
Collapse
|
2
|
Kunert-Graf JM, Sakhanenko NA, Galas DJ. Optimized permutation testing for information theoretic measures of multi-gene interactions. BMC Bioinformatics 2021; 22:180. [PMID: 33827420 PMCID: PMC8028212 DOI: 10.1186/s12859-021-04107-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 03/29/2021] [Indexed: 11/17/2022] Open
Abstract
Background Permutation testing is often considered the “gold standard” for multi-test significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is re-run after permuting the phenotype labels. This can become intractable in multi-locus genome-wide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large. Results In this paper, we develop an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speed-up by a factor of over 103 for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples. Conclusions The proliferation of large-scale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multi-locus genotype-phenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permute-counts.
Collapse
Affiliation(s)
- James M Kunert-Graf
- Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA.
| | | | - David J Galas
- Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA
| |
Collapse
|
3
|
Kubkowski M, Mielniczuk J. Asymptotic Distributions of Empirical Interaction Information. Methodol Comput Appl Probab 2021. [DOI: 10.1007/s11009-020-09783-0] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
AbstractInteraction Information is one of the most promising interaction strength measures with many desirable properties. However, its use for interaction detection was hindered by the fact that apart from the simple case of overall independence, asymptotic distribution of its estimate has not been known. In the paper we provide asymptotic distributions of its empirical versions which are needed for formal testing of interactions. We prove that for three-dimensional nominal vector normalized empirical interaction information converges to the normal law unless the distribution coincides with its Kirkwood approximation. In the opposite case the convergence is to the distribution of weighted centred chi square random variables. This case is of special importance as it roughly corresponds to interaction information being zero and the asymptotic distribution can be used for construction of formal tests for interaction detection. The result generalizes result in Han (Inf Control 46(1):26–45 1980) for the case when all coordinate random variables are independent. The derivation relies on studying structure of covariance matrix of asymptotic distribution and its eigenvalues. For the case of 3 × 3 × 2 contingency table corresponding to study of two interacting Single Nucleotide Polymorphisms (SNPs) for prediction of binary outcome, we provide complete description of the asymptotic law and construct approximate critical regions for testing of interactions when two SNPs are possibly dependent. We show in numerical experiments that the test based on the derived asymptotic distribution is easy to implement and yields actual significance levels consistently closer to the nominal ones than the test based on chi square reference distribution.
Collapse
|
4
|
Zhou X, Chan KCC, Huang Z, Wang J. Determining dependency and redundancy for identifying gene-gene interaction associated with complex disease. J Bioinform Comput Biol 2020; 18:2050035. [PMID: 33064052 DOI: 10.1142/s0219720020500353] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
As interactions among genetic variants in different genes can be an important factor for predicting complex diseases, many computational methods have been proposed to detect if a particular set of genes has interaction with a particular complex disease. However, even though many such methods have been shown to be useful, they can be made more effective if the properties of gene-gene interactions can be better understood. Towards this goal, we have attempted to uncover patterns in gene-gene interactions and the patterns reveal an interesting property that can be reflected in an inequality that describes the relationship between two genotype variables and a disease-status variable. We show, in this paper, that this inequality can be generalized to [Formula: see text] genotype variables. Based on this inequality, we establish a conditional independence and redundancy (CIR)-based definition of gene-gene interaction and the concept of an interaction group. From these new definitions, a novel measure of gene-gene interaction is then derived. We discuss the properties of these concepts and explain how they can be used in a novel algorithm to detect high-order gene-gene interactions. Experimental results using both simulated and real datasets show that the proposed method can be very promising.
Collapse
Affiliation(s)
- Xiangdong Zhou
- College of Mathematics and Computer Science, Fuzhou University Fuzhou, Fujian 350108, P. R. China
| | - Keith C C Chan
- Department of Computing, The Hong Kong Polytechnic University, Kowloon, Hong Kong, P. R. China
| | - Zhihua Huang
- College of Mathematics and Computer Science, Fuzhou University Fuzhou, Fujian 350108, P. R. China
| | - Jingbin Wang
- College of Mathematics and Computer Science, Fuzhou University Fuzhou, Fujian 350108, P. R. China
| |
Collapse
|
5
|
Chanda P, Costa E, Hu J, Sukumar S, Van Hemert J, Walia R. Information Theory in Computational Biology: Where We Stand Today. ENTROPY (BASEL, SWITZERLAND) 2020; 22:E627. [PMID: 33286399 PMCID: PMC7517167 DOI: 10.3390/e22060627] [Citation(s) in RCA: 22] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 04/30/2020] [Revised: 05/31/2020] [Accepted: 06/03/2020] [Indexed: 12/30/2022]
Abstract
"A Mathematical Theory of Communication" was published in 1948 by Claude Shannon to address the problems in the field of data compression and communication over (noisy) communication channels. Since then, the concepts and ideas developed in Shannon's work have formed the basis of information theory, a cornerstone of statistical learning and inference, and has been playing a key role in disciplines such as physics and thermodynamics, probability and statistics, computational sciences and biological sciences. In this article we review the basic information theory based concepts and describe their key applications in multiple major areas of research in computational biology-gene expression and transcriptomics, alignment-free sequence comparison, sequencing and error correction, genome-wide disease-gene association mapping, metabolic networks and metabolomics, and protein sequence, structure and interaction analysis.
Collapse
Affiliation(s)
- Pritam Chanda
- Corteva Agriscience™, Indianapolis, IN 46268, USA
- Computer and Information Science, Indiana University-Purdue University, Indianapolis, IN 46202, USA
| | - Eduardo Costa
- Corteva Agriscience™, Mogi Mirim, Sao Paulo 13801-540, Brazil
| | - Jie Hu
- Corteva Agriscience™, Indianapolis, IN 46268, USA
| | | | | | - Rasna Walia
- Corteva Agriscience™, Johnston, IA 50131, USA
| |
Collapse
|
6
|
Testing the Significance of Interactions in Genetic Studies Using Interaction Information and Resampling Technique. LECTURE NOTES IN COMPUTER SCIENCE 2020. [PMCID: PMC7304020 DOI: 10.1007/978-3-030-50420-5_38] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/01/2022]
Abstract
Interaction information is a model-free, non-parametric measure used for detection of interaction among variables. It frequently finds interactions which remain undetected by standard model-based methods. However in the previous studies application of interaction information was limited by lack of appropriate statistical tests. We study a challenging problem of testing the positiveness of interaction information which allows to confirm the statistical significance of the investigated interactions. It turns out that commonly used chi-squared test detects too many spurious interactions when the dependence between the variables (e.g. between two genetic markers) is strong. To overcome this problem we consider permutation test and also propose a novel HYBRID method that combines permutation and chi-squared tests and takes into account dependence between studied variables. We show in numerical experiments that, in contrast to chi-squared based test, the proposed method controls well the actual significance level and in many situations detects interactions which are undetected by standard methods. Moreover HYBRID method outperforms permutation test with respect to power and computational efficiency. The method is applied to find interactions among Single Nucleotide Polymorphisms as well as among gene expression levels of human immune cells.
Collapse
|
7
|
Abstract
Genome-wide association studies are moving to genome-wide interaction studies, as the genetic background of many diseases appears to be more complex than previously supposed. Thus, many statistical approaches have been proposed to detect gene-gene (GxG) interactions, among them numerous information theory-based methods, inspired by the concept of entropy. These are suggested as particularly powerful and, because of their nonlinearity, as better able to capture nonlinear relationships between genetic variants and/or variables. However, the introduced entropy-based estimators differ to a surprising extent in their construction and even with respect to the basic definition of interactions. Also, not every entropy-based measure for interaction is accompanied by a proper statistical test. To shed light on this, a systematic review of the literature is presented answering the following questions: (1) How are GxG interactions defined within the framework of information theory? (2) Which entropy-based test statistics are available? (3) Which underlying distribution do the test statistics follow? (4) What are the given strengths and limitations of these test statistics?
Collapse
Affiliation(s)
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, Germany
- Corresponding author. Inke R. Konig, Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany. Tel.: ++49 451 500 50610; Fax: ++49 451 500 50604; E-Mail:
| |
Collapse
|
8
|
Mielniczuk J, Teisseyre P. A deeper look at two concepts of measuring gene-gene interactions: logistic regression and interaction information revisited. Genet Epidemiol 2017; 42:187-200. [PMID: 29265411 DOI: 10.1002/gepi.22108] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/11/2017] [Revised: 10/23/2017] [Accepted: 11/15/2017] [Indexed: 11/09/2022]
Abstract
Detection of gene-gene interactions is one of the most important challenges in genome-wide case-control studies. Besides traditional logistic regression analysis, recently the entropy-based methods attracted a significant attention. Among entropy-based methods, interaction information is one of the most promising measures having many desirable properties. Although both logistic regression and interaction information have been used in several genome-wide association studies, the relationship between them has not been thoroughly investigated theoretically. The present paper attempts to fill this gap. We show that although certain connections between the two methods exist, in general they refer two different concepts of dependence and looking for interactions in those two senses leads to different approaches to interaction detection. We introduce ordering between interaction measures and specify conditions for independent and dependent genes under which interaction information is more discriminative measure than logistic regression. Moreover, we show that for so-called perfect distributions those measures are equivalent. The numerical experiments illustrate the theoretical findings indicating that interaction information and its modified version are more universal tools for detecting various types of interaction than logistic regression and linkage disequilibrium measures.
Collapse
Affiliation(s)
- Jan Mielniczuk
- Institute of Computer Science, Polish Academy of Sciences, Poland.,Faculty of Mathematics and Information Science, Warsaw University of Technology, Poland
| | - Paweł Teisseyre
- Institute of Computer Science, Polish Academy of Sciences, Poland
| |
Collapse
|
9
|
Sherwin WB, Chao A, Jost L, Smouse PE. Information Theory Broadens the Spectrum of Molecular Ecology and Evolution. Trends Ecol Evol 2017; 32:948-963. [PMID: 29126564 DOI: 10.1016/j.tree.2017.09.012] [Citation(s) in RCA: 54] [Impact Index Per Article: 6.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/06/2017] [Revised: 09/22/2017] [Accepted: 09/26/2017] [Indexed: 01/18/2023]
Abstract
Information or entropy analysis of diversity is used extensively in community ecology, and has recently been exploited for prediction and analysis in molecular ecology and evolution. Information measures belong to a spectrum (or q profile) of measures whose contrasting properties provide a rich summary of diversity, including allelic richness (q=0), Shannon information (q=1), and heterozygosity (q=2). We present the merits of information measures for describing and forecasting molecular variation within and among groups, comparing forecasts with data, and evaluating underlying processes such as dispersal. Importantly, information measures directly link causal processes and divergence outcomes, have straightforward relationship to allele frequency differences (including monotonicity that q=2 lacks), and show additivity across hierarchical layers such as ecology, behaviour, cellular processes, and nongenetic inheritance.
Collapse
Affiliation(s)
- W B Sherwin
- Evolution and Ecology Research Centre, School of Biological Earth and Environmental Science, University of New South Wales, Sydney, NSW 2052, Australia; Murdoch University Cetacean Research Unit, Murdoch University, South Road, Murdoch, WA 6150, Australia.
| | - A Chao
- Institute of Statistics, National Tsing Hua University, Hsin-Chu 30043, Taiwan
| | - L Jost
- EcoMinga Foundation, Via a Runtun, Baños, Tungurahua, Ecuador
| | - P E Smouse
- Department of Ecology, Evolution and Natural Resources, School of Environmental and Biological Sciences, Rutgers University, New Brunswick, NJ 08901-8551, USA
| |
Collapse
|
10
|
Use of Information Measures and Their Approximations to Detect Predictive Gene-Gene Interaction. ENTROPY 2017. [DOI: 10.3390/e19010023] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/11/2022]
|
11
|
CINOEDV: a co-information based method for detecting and visualizing n-order epistatic interactions. BMC Bioinformatics 2016; 17:214. [PMID: 27184783 PMCID: PMC4869388 DOI: 10.1186/s12859-016-1076-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/26/2015] [Accepted: 05/07/2016] [Indexed: 12/13/2022] Open
Abstract
BACKGROUND Detecting and visualizing nonlinear interaction effects of single nucleotide polymorphisms (SNPs) or epistatic interactions are important topics in bioinformatics since they play an important role in unraveling the mystery of "missing heritability". However, related studies are almost limited to pairwise epistatic interactions due to their methodological and computational challenges. RESULTS We develop CINOEDV (Co-Information based N-Order Epistasis Detector and Visualizer) for the detection and visualization of epistatic interactions of their orders from 1 to n (n ≥ 2). CINOEDV is composed of two stages, namely, detecting stage and visualizing stage. In detecting stage, co-information based measures are employed to quantify association effects of n-order SNP combinations to the phenotype, and two types of search strategies are introduced to identify n-order epistatic interactions: an exhaustive search and a particle swarm optimization based search. In visualizing stage, all detected n-order epistatic interactions are used to construct a hypergraph, where a real vertex represents the main effect of a SNP and a virtual vertex denotes the interaction effect of an n-order epistatic interaction. By deeply analyzing the constructed hypergraph, some hidden clues for better understanding the underlying genetic architecture of complex diseases could be revealed. CONCLUSIONS Experiments of CINOEDV and its comparison with existing state-of-the-art methods are performed on both simulation data sets and a real data set of age-related macular degeneration. Results demonstrate that CINOEDV is promising in detecting and visualizing n-order epistatic interactions. CINOEDV is implemented in R and is freely available from R CRAN: http://cran.r-project.org and https://sourceforge.net/projects/cinoedv/files/ .
Collapse
|
12
|
An Improved Opposition-Based Learning Particle Swarm Optimization for the Detection of SNP-SNP Interactions. BIOMED RESEARCH INTERNATIONAL 2015; 2015:524821. [PMID: 26236727 PMCID: PMC4509494 DOI: 10.1155/2015/524821] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/25/2014] [Revised: 12/30/2014] [Accepted: 01/02/2015] [Indexed: 12/22/2022]
Abstract
SNP-SNP interactions have been receiving increasing attention in understanding the mechanism underlying susceptibility to complex diseases. Though many works have been done for the detection of SNP-SNP interactions, the algorithmic development is still ongoing. In this study, an improved opposition-based learning particle swarm optimization (IOBLPSO) is proposed for the detection of SNP-SNP interactions. Highlights of IOBLPSO are the introduction of three strategies, namely, opposition-based learning, dynamic inertia weight, and a postprocedure. Opposition-based learning not only enhances the global explorative ability, but also avoids premature convergence. Dynamic inertia weight allows particles to cover a wider search space when the considered SNP is likely to be a random one and converges on promising regions of the search space while capturing a highly suspected SNP. The postprocedure is used to carry out a deep search in highly suspected SNP sets. Experiments of IOBLPSO are performed on both simulation data sets and a real data set of age-related macular degeneration, results of which demonstrate that IOBLPSO is promising in detecting SNP-SNP interactions. IOBLPSO might be an alternative to existing methods for detecting SNP-SNP interactions.
Collapse
|
13
|
Gusareva ES, Van Steen K. Practical aspects of genome-wide association interaction analysis. Hum Genet 2014; 133:1343-58. [DOI: 10.1007/s00439-014-1480-y] [Citation(s) in RCA: 26] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/21/2014] [Accepted: 08/18/2014] [Indexed: 12/31/2022]
|
14
|
Discovering pair-wise genetic interactions: an information theory-based approach. PLoS One 2014; 9:e92310. [PMID: 24670935 PMCID: PMC3966778 DOI: 10.1371/journal.pone.0092310] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/24/2013] [Accepted: 02/20/2014] [Indexed: 11/19/2022] Open
Abstract
Phenotypic variation, including that which underlies health and disease in humans, results in part from multiple interactions among both genetic variation and environmental factors. While diseases or phenotypes caused by single gene variants can be identified by established association methods and family-based approaches, complex phenotypic traits resulting from multi-gene interactions remain very difficult to characterize. Here we describe a new method based on information theory, and demonstrate how it improves on previous approaches to identifying genetic interactions, including both synthetic and modifier kinds of interactions. We apply our measure, called interaction distance, to previously analyzed data sets of yeast sporulation efficiency, lipid related mouse data and several human disease models to characterize the method. We show how the interaction distance can reveal novel gene interaction candidates in experimental and simulated data sets, and outperforms other measures in several circumstances. The method also allows us to optimize case/control sample composition for clinical studies.
Collapse
|
15
|
SYMPHONY, an information-theoretic method for gene-gene and gene-environment interaction analysis of disease syndromes. Heredity (Edinb) 2013; 110:548-59. [PMID: 23423149 DOI: 10.1038/hdy.2012.123] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/15/2022] Open
Abstract
We develop an information-theoretic method for gene-gene (GGI) and gene-environmental interactions (GEI) analysis of syndromes, defined as a phenotype vector comprising multiple quantitative traits (QTs). The K-way interaction information (KWII), an information-theoretic metric, was derived for multivariate normal distributed phenotype vectors. The utility of the method was challenged with three simulated data sets, the Genetic Association Workshop-15 (GAW15) rheumatoid arthritis data set, a high-density lipoprotein (HDL) and atherosclerosis data set from a mouse QT locus study, and the 1000 Genomes data. The dependence of the KWII on effect size, minor allele frequency, linkage disequilibrium, population stratification/admixture, as well as the power and computational time requirements of the novel method was systematically assessed in simulation studies. In these studies, phenotype vectors containing two and three constituent multivariate normally distributed QTs were used and the KWII was found to be effective at detecting GEI associated with the phenotype. High KWII values were observed for variables and variable combinations associated with the syndrome phenotype compared with uninformative variables not associated with the phenotype. The KWII values for the phenotype-associated combinations increased monotonically with increasing effect size values. The KWII also exhibited utility in simulations with non-linear dependence between the constituent QTs. Analysis of the HDL and atherosclerosis data set indicated that the simultaneous analysis of both phenotypes identified interactions not detected in the analysis of the individual traits. The information-theoretic approach may be useful for non-parametric analysis of GGI and GEI of complex syndromes.
Collapse
|
16
|
Hu T, Chen Y, Kiralis JW, Collins RL, Wejse C, Sirugo G, Williams SM, Moore JH. An information-gain approach to detecting three-way epistatic interactions in genetic association studies. J Am Med Inform Assoc 2013; 20:630-6. [PMID: 23396514 PMCID: PMC3721169 DOI: 10.1136/amiajnl-2012-001525] [Citation(s) in RCA: 51] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/01/2023] Open
Abstract
Background Epistasis has been historically used to describe the phenomenon that the effect of a given gene on a phenotype can be dependent on one or more other genes, and is an essential element for understanding the association between genetic and phenotypic variations. Quantifying epistasis of orders higher than two is very challenging due to both the computational complexity of enumerating all possible combinations in genome-wide data and the lack of efficient and effective methodologies. Objectives In this study, we propose a fast, non-parametric, and model-free measure for three-way epistasis. Methods Such a measure is based on information gain, and is able to separate all lower order effects from pure three-way epistasis. Results Our method was verified on synthetic data and applied to real data from a candidate-gene study of tuberculosis in a West African population. In the tuberculosis data, we found a statistically significant pure three-way epistatic interaction effect that was stronger than any lower-order associations. Conclusion Our study provides a methodological basis for detecting and characterizing high-order gene-gene interactions in genetic association studies.
Collapse
Affiliation(s)
- Ting Hu
- Computational Genetics Laboratory, Geisel School of Medicine, Dartmouth College, Hanover, New Hampshire, USA
| | | | | | | | | | | | | | | |
Collapse
|
17
|
Vertical Integration of Pharmacogenetics in Population PK/PD Modeling: A Novel Information Theoretic Method. CPT-PHARMACOMETRICS & SYSTEMS PHARMACOLOGY 2013. [PMCID: PMC3600754 DOI: 10.1038/psp.2012.25] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Abstract
To critically evaluate an information-theoretic method for identifying gene–environmental interactions (GEI) associated with pharmacokinetic (PK), pharmacodynamic (PD), and clinical outcomes from genome-wide pharmacogenetic data. Our approach, which is built on the K-way interaction information (KWII) metric, was challenged with simulated data and clinical PK/PD data sets from the International Warfarin Pharmacogenetics Consortium (IWPC) and a gemcitabine clinical trial. The KWII efficiently identified both novel and known interactions for warfarin and gemcitabine. Interactions between herbal supplementation and VKORC1 genotype were associated with warfarin response. For gemcitabine-associated neutropenia, combination treatment with carboplatin and cytidine deaminase (CDA) 208G→A genotypes were identified as risk factors. Gemcitabine disposition was associated with drug metabolism–transporter interactions between deoxycytidine kinase (DCK) and the equilibrative nucleoside transporter (ENT). This novel approach is effective for detecting GEI involved in drug exposure and response and could enable integration of genome-wide pharmacogenetic data into the population PK/PD analysis paradigm.
Collapse
|
18
|
Aschard H, Lutz S, Maus B, Duell EJ, Fingerlin TE, Chatterjee N, Kraft P, Van Steen K. Challenges and opportunities in genome-wide environmental interaction (GWEI) studies. Hum Genet 2012; 131:1591-613. [PMID: 22760307 DOI: 10.1007/s00439-012-1192-0] [Citation(s) in RCA: 107] [Impact Index Per Article: 8.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2012] [Accepted: 06/11/2012] [Indexed: 02/03/2023]
Abstract
The interest in performing gene-environment interaction studies has seen a significant increase with the increase of advanced molecular genetics techniques. Practically, it became possible to investigate the role of environmental factors in disease risk and hence to investigate their role as genetic effect modifiers. The understanding that genetics is important in the uptake and metabolism of toxic substances is an example of how genetic profiles can modify important environmental risk factors to disease. Several rationales exist to set up gene-environment interaction studies and the technical challenges related to these studies-when the number of environmental or genetic risk factors is relatively small-has been described before. In the post-genomic era, it is now possible to study thousands of genes and their interaction with the environment. This brings along a whole range of new challenges and opportunities. Despite a continuing effort in developing efficient methods and optimal bioinformatics infrastructures to deal with the available wealth of data, the challenge remains how to best present and analyze genome-wide environmental interaction (GWEI) studies involving multiple genetic and environmental factors. Since GWEIs are performed at the intersection of statistical genetics, bioinformatics and epidemiology, usually similar problems need to be dealt with as for genome-wide association gene-gene interaction studies. However, additional complexities need to be considered which are typical for large-scale epidemiological studies, but are also related to "joining" two heterogeneous types of data in explaining complex disease trait variation or for prediction purposes.
Collapse
Affiliation(s)
- Hugues Aschard
- Department of Epidemiology, Harvard School of Public Health, Boston, MA, USA.
| | | | | | | | | | | | | | | |
Collapse
|
19
|
Wang Y, Liu G, Feng M, Wong L. An empirical comparison of several recent epistatic interaction detection methods. Bioinformatics 2011; 27:2936-43. [DOI: 10.1093/bioinformatics/btr512] [Citation(s) in RCA: 51] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
|
20
|
Van Steen K. Perspectives on genome-wide multi-stage family-based association studies. Stat Med 2011; 30:2201-21. [DOI: 10.1002/sim.4259] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/16/2010] [Accepted: 03/07/2011] [Indexed: 01/03/2023]
|
21
|
Abstract
Over the last few years, main effect genetic association analysis has proven to be a successful tool to unravel genetic risk components to a variety of complex diseases. In the quest for disease susceptibility factors and the search for the 'missing heritability', supplementary and complementary efforts have been undertaken. These include the inclusion of several genetic inheritance assumptions in model development, the consideration of different sources of information, and the acknowledgement of disease underlying pathways of networks. The search for epistasis or gene-gene interaction effects on traits of interest is marked by an exponential growth, not only in terms of methodological development, but also in terms of practical applications, translation of statistical epistasis to biological epistasis and integration of omics information sources. The current popularity of the field, as well as its attraction to interdisciplinary teams, each making valuable contributions with sometimes rather unique viewpoints, renders it impossible to give an exhaustive review of to-date available approaches for epistasis screening. The purpose of this work is to give a perspective view on a selection of currently active analysis strategies and concerns in the context of epistasis detection, and to provide an eye to the future of gene-gene interaction analysis.
Collapse
Affiliation(s)
- Kristel Van Steen
- Department of Electrical Engineering and Computer Science (Montefiore Institute), Grande Traverse, Bioinformatique 4000 Liège 1, Belgium.
| |
Collapse
|
22
|
Modeling of environmental and genetic interactions with AMBROSIA, an information-theoretic model synthesis method. Heredity (Edinb) 2011; 107:320-7. [PMID: 21427755 DOI: 10.1038/hdy.2011.18] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022] Open
Abstract
To develop a model synthesis method for parsimoniously modeling gene-environmental interactions (GEI) associated with clinical outcomes and phenotypes. The AMBROSIA model synthesis approach utilizes the k-way interaction information (KWII), an information-theoretic metric capable of identifying variable combinations associated with GEI. For model synthesis, AMBROSIA considers relevance of combinations to the phenotype, it precludes entry of combinations with redundant information, and penalizes for unjustifiable complexity; each step is KWII based. The performance and power of AMBROSIA were evaluated with simulations and Genetic Association Workshop 15 (GAW15) data sets of rheumatoid arthritis (RA). AMBROSIA identified parsimonious models in data sets containing multiple interactions with linkage disequilibrium present. For the GAW15 data set containing 9187 single-nucleotide polymorphisms, the parsimonious AMBROSIA model identified nine RA-associated combinations with power >90%. AMBROSIA was compared with multifactor dimensionality reduction across several diverse models and had satisfactory power. Software source code is available from http://www.cse.buffalo.edu/DBGROUP/bioinformatics/resources.html. AMBROSIA is a promising method for GEI model synthesis.
Collapse
|
23
|
Roshan U, Chikkagoudar S, Wei Z, Wang K, Hakonarson H. Ranking causal variants and associated regions in genome-wide association studies by the support vector machine and random forest. Nucleic Acids Res 2011; 39:e62. [PMID: 21317188 PMCID: PMC3089490 DOI: 10.1093/nar/gkr064] [Citation(s) in RCA: 33] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022] Open
Abstract
We study the number of causal variants and associated regions identified by top SNPs in rankings given by the popular 1 df chi-squared statistic, support vector machine (SVM) and the random forest (RF) on simulated and real data. If we apply the SVM and RF to the top 2r chi-square-ranked SNPs, where r is the number of SNPs with P-values within the Bonferroni correction, we find that both improve the ranks of causal variants and associated regions and achieve higher power on simulated data. These improvements, however, as well as stability of the SVM and RF rankings, progressively decrease as the cutoff increases to 5r and 10r. As applications we compare the ranks of previously replicated SNPs in real data, associated regions in type 1 diabetes, as provided by the Type 1 Diabetes Consortium, and disease risk prediction accuracies as given by top ranked SNPs by the three methods. Software and webserver are available at http://svmsnps.njit.edu.
Collapse
Affiliation(s)
- Usman Roshan
- Department of Computer Science, New Jersey Institute of Technology, Newark, NJ, USA.
| | | | | | | | | |
Collapse
|
24
|
Comparison of information-theoretic to statistical methods for gene-gene interactions in the presence of genetic heterogeneity. BMC Genomics 2010; 11:487. [PMID: 20815886 PMCID: PMC2996983 DOI: 10.1186/1471-2164-11-487] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/09/2010] [Accepted: 09/03/2010] [Indexed: 11/10/2022] Open
Abstract
Background Multifactorial diseases such as cancer and cardiovascular diseases are caused by the complex interplay between genes and environment. The detection of these interactions remains challenging due to computational limitations. Information theoretic approaches use computationally efficient directed search strategies and thus provide a feasible solution to this problem. However, the power of information theoretic methods for interaction analysis has not been systematically evaluated. In this work, we compare power and Type I error of an information-theoretic approach to existing interaction analysis methods. Methods The k-way interaction information (KWII) metric for identifying variable combinations involved in gene-gene interactions (GGI) was assessed using several simulated data sets under models of genetic heterogeneity driven by susceptibility increasing loci with varying allele frequency, penetrance values and heritability. The power and proportion of false positives of the KWII was compared to multifactor dimensionality reduction (MDR), restricted partitioning method (RPM) and logistic regression. Results The power of the KWII was considerably greater than MDR on all six simulation models examined. For a given disease prevalence at high values of heritability, the power of both RPM and KWII was greater than 95%. For models with low heritability and/or genetic heterogeneity, the power of the KWII was consistently greater than RPM; the improvements in power for the KWII over RPM ranged from 4.7% to 14.2% at for α = 0.001 in the three models at the lowest heritability values examined. KWII performed similar to logistic regression. Conclusions Information theoretic models are flexible and have excellent power to detect GGI under a variety of conditions that characterize complex diseases.
Collapse
|
25
|
Entropy and Information Approaches to Genetic Diversity and its Expression: Genomic Geography. ENTROPY 2010. [DOI: 10.3390/e12071765] [Citation(s) in RCA: 49] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/03/2023]
|
26
|
Abstract
Because little is currently known about how genes interact with environmental factors in human diseases, and because of the large number of possible interactions between and within genetic and environmental factors, it is difficult to simulate samples for a disease caused by multiple interacting genetic and environmental factors. A recent article by Amato and colleagues in BMC Bioinformatics describes a mathematical model to characterize gene-environment interactions and a computer program that simulates them using biologically meaningful inputs. Here, I evaluate the advantages and limitations of the authors' approach in terms of its usefulness for simulating genetic samples for real-world studies of gene-environment interactions in complex human diseases.
Collapse
Affiliation(s)
- Bo Peng
- Department of Epidemiology, The University of Texas MD Anderson Cancer Center, Houston, TX 77030, USA.
| |
Collapse
|
27
|
Chanda P, Zhang A, Sucheston L, Ramanathan M. A two-stage search strategy for detecting multiple loci associated with rheumatoid arthritis. BMC Proc 2009; 3 Suppl 7:S72. [PMID: 20018067 PMCID: PMC2795974 DOI: 10.1186/1753-6561-3-s7-s72] [Citation(s) in RCA: 8] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
Gene x gene interactions play important roles in the etiology of complex multi-factorial diseases like rheumatoid arthritis (RA). In this paper, we describe our use of a two-stage search strategy consisting of information theoretic methods and logistic regression to detect gene x gene interactions associated with RA using the data in Problem 1 of Genetic Analysis Workshop 16. Our method detected interactions of several SNPs (single-SNP and SNP x SNP) that are located on chromosomal regions linked to RA and related diseases in previous studies.
Collapse
Affiliation(s)
- Pritam Chanda
- Departments of Computer Science and Engineering, State University of New York, Buffalo, New York 14260, USA.
| | | | | | | |
Collapse
|
28
|
Chanda P, Sucheston L, Liu S, Zhang A, Ramanathan M. Information-theoretic gene-gene and gene-environment interaction analysis of quantitative traits. BMC Genomics 2009; 10:509. [PMID: 19889230 PMCID: PMC2779196 DOI: 10.1186/1471-2164-10-509] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2008] [Accepted: 11/04/2009] [Indexed: 12/30/2022] Open
Abstract
Background The purpose of this research was to develop a novel information theoretic method and an efficient algorithm for analyzing the gene-gene (GGI) and gene-environmental interactions (GEI) associated with quantitative traits (QT). The method is built on two information-theoretic metrics, the k-way interaction information (KWII) and phenotype-associated information (PAI). The PAI is a novel information theoretic metric that is obtained from the total information correlation (TCI) information theoretic metric by removing the contributions for inter-variable dependencies (resulting from factors such as linkage disequilibrium and common sources of environmental pollutants). Results The KWII and the PAI were critically evaluated and incorporated within an algorithm called CHORUS for analyzing QT. The combinations with the highest values of KWII and PAI identified each known GEI associated with the QT in the simulated data sets. The CHORUS algorithm was tested using the simulated GAW15 data set and two real GGI data sets from QTL mapping studies of high-density lipoprotein levels/atherosclerotic lesion size and ultra-violet light-induced immunosuppression. The KWII and PAI were found to have excellent sensitivity for identifying the key GEI simulated to affect the two quantitative trait variables in the GAW15 data set. In addition, both metrics showed strong concordance with the results of the two different QTL mapping data sets. Conclusion The KWII and PAI are promising metrics for analyzing the GEI of QT.
Collapse
Affiliation(s)
- Pritam Chanda
- Department of Pharmaceutical Sciences, State University of New York, Buffalo, NY, USA.
| | | | | | | | | |
Collapse
|
29
|
Chanda P, Sucheston L, Zhang A, Ramanathan M. The interaction index, a novel information-theoretic metric for prioritizing interacting genetic variations and environmental factors. Eur J Hum Genet 2009; 17:1274-86. [PMID: 19293841 PMCID: PMC2952438 DOI: 10.1038/ejhg.2009.38] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2008] [Revised: 01/22/2009] [Accepted: 02/12/2009] [Indexed: 11/09/2022] Open
Abstract
We developed an information-theoretic metric called the Interaction Index for prioritizing genetic variations and environmental variables for follow-up in detailed sequencing studies. The Interaction Index was found to be effective for prioritizing the genetic and environmental variables involved in GEI for a diverse range of simulated data sets. The metric was also evaluated for a 103-SNP Crohn's disease dataset and a simulated data set containing 9187 SNPs and multiple covariates that was modeled on a rheumatoid arthritis data set. Our results demonstrate that the Interaction Index algorithm is effective and efficient for prioritizing interacting variables for a diverse range of epidemiologic data sets containing complex combinations of direct effects, multiple GGI and GEI.
Collapse
Affiliation(s)
- Pritam Chanda
- Department of Computer Science and Engineering, State University of New York, Buffalo, NY, USA
| | - Lara Sucheston
- Department of Biostatistics, State University of New York, Buffalo, NY, USA
| | - Aidong Zhang
- Department of Computer Science and Engineering, State University of New York, Buffalo, NY, USA
| | - Murali Ramanathan
- Department of Pharmaceutical Sciences, State University of New York, Buffalo, NY, USA
| |
Collapse
|