1
|
Banman A, Sakhanenko NA, Kunert-Graf J, Galas DJ. ApoE Modifier Alleles for Alzheimer's Disease Discovered by Information Theory Dependency Measures: MIST Software Package. J Comput Biol 2023; 30:323-336. [PMID: 36322888 PMCID: PMC9993164 DOI: 10.1089/cmb.2022.0185] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2023] Open
Abstract
Information theory-based measures of variable dependency (previously published) have been implemented into a software package, MIST. The design of the software and its potential uses are described, and a demonstration is presented in the discovery of modifier alleles of the ApoE gene in affecting Alzheimer's disease (AD) by analyzing the UK Biobank dataset. The modifier genes uncovered overlap strongly with genes found to be associated with AD. Others include many known to influence AD. We discuss a range of uses of the dependency calculations using MIST that can uncover additional genetic effects in similar complex datasets, like higher degrees of interaction and phenotypic pleiotropy.
Collapse
Affiliation(s)
- Andrew Banman
- Pacific Northwest Research Institute, Seattle, Washington, USA
| | | | | | - David J Galas
- Pacific Northwest Research Institute, Seattle, Washington, USA
| |
Collapse
|
2
|
Environmental neuroscience linking exposome to brain structure and function underlying cognition and behavior. Mol Psychiatry 2023; 28:17-27. [PMID: 35790874 DOI: 10.1038/s41380-022-01669-6] [Citation(s) in RCA: 15] [Impact Index Per Article: 7.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 11/07/2021] [Revised: 06/02/2022] [Accepted: 06/09/2022] [Indexed: 01/07/2023]
Abstract
Individual differences in human brain structure, function, and behavior can be attributed to genetic variations, environmental exposures, and their interactions. Although genome-wide association studies have identified many genetic variants associated with brain imaging phenotypes, environmental exposures associated with these phenotypes remain largely unknown. Here, we propose that environmental neuroscience should pay more attention on exploring the associations between lifetime environmental exposures (exposome) and brain imaging phenotypes and identifying both cumulative environmental effects and their vulnerable age windows during the life course. Exposome-neuroimaging association studies face several challenges including the accurate measurement of the totality of environmental exposures varied in space and time, the highly correlated structure of the exposome, and the lack of standardized approaches for exposome-wide association studies. By agnostically scanning the effects of environmental exposures on brain imaging phenotypes and their interactions with genomic variations, exposome-neuroimaging association analyses will improve our understanding of causal factors associated with individual differences in brain structure and function as well as their relations with cognitive abilities and neuropsychiatric disorders.
Collapse
|
3
|
Walakira A, Ocira J, Duroux D, Fouladi R, Moškon M, Rozman D, Van Steen K. Detecting gene-gene interactions from GWAS using diffusion kernel principal components. BMC Bioinformatics 2022; 23:57. [PMID: 35105309 PMCID: PMC8805268 DOI: 10.1186/s12859-022-04580-7] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/30/2021] [Accepted: 01/18/2022] [Indexed: 11/10/2022] Open
Abstract
Genes and gene products do not function in isolation but as components of complex networks of macromolecules through physical or biochemical interactions. Dependencies of gene mutations on genetic background (i.e., epistasis) are believed to play a role in understanding molecular underpinnings of complex diseases such as inflammatory bowel disease (IBD). However, the process of identifying such interactions is complex due to for instance the curse of high dimensionality, dependencies in the data and non-linearity. Here, we propose a novel approach for robust and computationally efficient epistasis detection. We do so by first reducing dimensionality, per gene via diffusion kernel principal components (kpc). Subsequently, kpc gene summaries are used for downstream analysis including the construction of a gene-based epistasis network. We show that our approach is not only able to recover known IBD associated genes but also additional genes of interest linked to this difficult gastrointestinal disease.
Collapse
Affiliation(s)
- Andrew Walakira
- Centre for Functional Genomics and Bio-Chips, Institute for Biochemistry and Molecular Genetics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Junior Ocira
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
| | - Diane Duroux
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
| | - Ramouna Fouladi
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
| | - Miha Moškon
- Faculty of Computer and Information Science, University of Ljubljana, Ljubljana, Slovenia
| | - Damjana Rozman
- Centre for Functional Genomics and Bio-Chips, Institute for Biochemistry and Molecular Genetics, Faculty of Medicine, University of Ljubljana, Ljubljana, Slovenia
| | - Kristel Van Steen
- BIO3 - Laboratory for Systems Genetics, GIGA-R Medical Genomics, University of Liège, Liège, Belgium
- BIO3 - Laboratory for Systems Medicine, Department of Human Genetics, KU Leuven, Leuven, Belgium
| |
Collapse
|
4
|
Diaz-Gallo LM, Brynedal B, Westerlind H, Sandberg R, Ramsköld D. Understanding interactions between risk factors, and assessing the utility of the additive and multiplicative models through simulations. PLoS One 2021; 16:e0250282. [PMID: 33901204 PMCID: PMC8075235 DOI: 10.1371/journal.pone.0250282] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Accepted: 04/02/2021] [Indexed: 01/04/2023] Open
Abstract
Understanding the genetic background of complex diseases requires the expansion of studies beyond univariate associations. Therefore, it is important to use interaction assessments of risk factors in order to discover whether, and how genetic risk variants act together on disease development. The principle of interaction analysis is to explore the magnitude of the combined effect of risk factors on disease causation. In this study, we use simulations to investigate different scenarios of causation to show how the magnitude of the effect of two risk factors interact. We mainly focus on the two most commonly used interaction models, the additive and multiplicative risk scales, since there is often confusion regarding their use and interpretation. Our results show that the combined effect is multiplicative when two risk factors are involved in the same chain of events, an interaction called synergism. Synergism is often described as a deviation from additivity, which is a broader term. Our results also confirm that it is often relevant to estimate additive effect relationships, because they correspond to independent risk factors at low disease prevalence. Importantly, we evaluate the threshold of more than two required risk factors for disease causation, called the multifactorial threshold model. We found a simple mathematical relationship (square root) between the threshold and an additive-to-multiplicative linear effect scale (AMLES), where 0 corresponds to an additive effect and 1 to a multiplicative. We propose AMLES as a metric that could be used to test different effects relationships at the same time, given that it can simultaneously reveal additive, multiplicative and intermediate risk effects relationships. Finally, the utility of our simulation study was demonstrated using real data by analyzing and interpreting gene-gene interaction odds ratios from a rheumatoid arthritis case-control cohort.
Collapse
Affiliation(s)
- Lina-Marcela Diaz-Gallo
- Division of Rheumatology, Department of Medicine Solna, Center for Molecular Medicine, Karolinska Institutet, Karolinska University Hospital, Stockholm, Sweden
| | - Boel Brynedal
- Institute of Environmental Medicine, Karolinska Institutet, Stockholm, Sweden
| | - Helga Westerlind
- Clinical Epidemiology Division, Department of Medicine Solna, Karolinska Institutet, Stockholm, Sweden
| | - Rickard Sandberg
- Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden
| | - Daniel Ramsköld
- Department of Cell and Molecular Biology, Karolinska Institutet, Stockholm, Sweden
- * E-mail:
| |
Collapse
|
5
|
Kunert-Graf JM, Sakhanenko NA, Galas DJ. Optimized permutation testing for information theoretic measures of multi-gene interactions. BMC Bioinformatics 2021; 22:180. [PMID: 33827420 PMCID: PMC8028212 DOI: 10.1186/s12859-021-04107-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/01/2020] [Accepted: 03/29/2021] [Indexed: 11/17/2022] Open
Abstract
Background Permutation testing is often considered the “gold standard” for multi-test significance analysis, as it is an exact test requiring few assumptions about the distribution being computed. However, it can be computationally very expensive, particularly in its naive form in which the full analysis pipeline is re-run after permuting the phenotype labels. This can become intractable in multi-locus genome-wide association studies (GWAS), in which the number of potential interactions to be tested is combinatorially large. Results In this paper, we develop an approach for permutation testing in multi-locus GWAS, specifically focusing on SNP–SNP-phenotype interactions using multivariable measures that can be computed from frequency count tables, such as those based in Information Theory. We find that the computational bottleneck in this process is the construction of the count tables themselves, and that this step can be eliminated at each iteration of the permutation testing by transforming the count tables directly. This leads to a speed-up by a factor of over 103 for a typical permutation test compared to the naive approach. Additionally, this approach is insensitive to the number of samples making it suitable for datasets with large number of samples. Conclusions The proliferation of large-scale datasets with genotype data for hundreds of thousands of individuals enables new and more powerful approaches for the detection of multi-locus genotype-phenotype interactions. Our approach significantly improves the computational tractability of permutation testing for these studies. Moreover, our approach is insensitive to the large number of samples in these modern datasets. The code for performing these computations and replicating the figures in this paper is freely available at https://github.com/kunert/permute-counts.
Collapse
Affiliation(s)
- James M Kunert-Graf
- Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA.
| | | | - David J Galas
- Pacific Northwest Research Institute, 720 Broadway, Seattle, WA, 98122, USA
| |
Collapse
|
6
|
MINERVA, A Platform for the Exploration of Disease Maps. SYSTEMS MEDICINE 2021. [DOI: 10.1016/b978-0-12-801238-3.11685-x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
|
7
|
Abstract
Genome-wide association studies are moving to genome-wide interaction studies, as the genetic background of many diseases appears to be more complex than previously supposed. Thus, many statistical approaches have been proposed to detect gene-gene (GxG) interactions, among them numerous information theory-based methods, inspired by the concept of entropy. These are suggested as particularly powerful and, because of their nonlinearity, as better able to capture nonlinear relationships between genetic variants and/or variables. However, the introduced entropy-based estimators differ to a surprising extent in their construction and even with respect to the basic definition of interactions. Also, not every entropy-based measure for interaction is accompanied by a proper statistical test. To shed light on this, a systematic review of the literature is presented answering the following questions: (1) How are GxG interactions defined within the framework of information theory? (2) Which entropy-based test statistics are available? (3) Which underlying distribution do the test statistics follow? (4) What are the given strengths and limitations of these test statistics?
Collapse
Affiliation(s)
| | - Inke R König
- Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, Lübeck, Germany
- Corresponding author. Inke R. Konig, Institut für Medizinische Biometrie und Statistik, Universität zu Lübeck, Universitätsklinikum Schleswig-Holstein, Campus Lübeck, Ratzeburger Allee 160, 23562 Lübeck, Germany. Tel.: ++49 451 500 50610; Fax: ++49 451 500 50604; E-Mail:
| |
Collapse
|
8
|
Timme NM, Lapish C. A Tutorial for Information Theory in Neuroscience. eNeuro 2018; 5:ENEURO.0052-18.2018. [PMID: 30211307 PMCID: PMC6131830 DOI: 10.1523/eneuro.0052-18.2018] [Citation(s) in RCA: 112] [Impact Index Per Article: 16.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/19/2018] [Revised: 04/10/2018] [Accepted: 05/30/2018] [Indexed: 11/21/2022] Open
Abstract
Understanding how neural systems integrate, encode, and compute information is central to understanding brain function. Frequently, data from neuroscience experiments are multivariate, the interactions between the variables are nonlinear, and the landscape of hypothesized or possible interactions between variables is extremely broad. Information theory is well suited to address these types of data, as it possesses multivariate analysis tools, it can be applied to many different types of data, it can capture nonlinear interactions, and it does not require assumptions about the structure of the underlying data (i.e., it is model independent). In this article, we walk through the mathematics of information theory along with common logistical problems associated with data type, data binning, data quantity requirements, bias, and significance testing. Next, we analyze models inspired by canonical neuroscience experiments to improve understanding and demonstrate the strengths of information theory analyses. To facilitate the use of information theory analyses, and an understanding of how these analyses are implemented, we also provide a free MATLAB software package that can be applied to a wide range of data from neuroscience experiments, as well as from other fields of study.
Collapse
Affiliation(s)
- Nicholas M Timme
- Department of Psychology, Indiana University - Purdue University Indianapolis, 402 N. Blackford St, Indianapolis, IN 46202
| | - Christopher Lapish
- Department of Psychology, Indiana University - Purdue University Indianapolis, 402 N. Blackford St, Indianapolis, IN 46202
| |
Collapse
|
9
|
Xu EL, Qian X, Yu Q, Zhang H, Cui S. Feature selection with interactions in logistic regression models using multivariate synergies for a GWAS application. BMC Genomics 2018; 19:170. [PMID: 29589561 PMCID: PMC5872388 DOI: 10.1186/s12864-018-4552-x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND Genotype-phenotype association has been one of the long-standing problems in bioinformatics. Identifying both the marginal and epistatic effects among genetic markers, such as Single Nucleotide Polymorphisms (SNPs), has been extensively integrated in Genome-Wide Association Studies (GWAS) to help derive "causal" genetic risk factors and their interactions, which play critical roles in life and disease systems. Identifying "synergistic" interactions with respect to the outcome of interest can help accurate phenotypic prediction and understand the underlying mechanism of system behavior. Many statistical measures for estimating synergistic interactions have been proposed in the literature for such a purpose. However, except for empirical performance, there is still no theoretical analysis on the power and limitation of these synergistic interaction measures. RESULTS In this paper, it is shown that the existing information-theoretic multivariate synergy depends on a small subset of the interaction parameters in the model, sometimes on only one interaction parameter. In addition, an adjusted version of multivariate synergy is proposed as a new measure to estimate the interactive effects, with experiments conducted over both simulated data sets and a real-world GWAS data set to show the effectiveness. CONCLUSIONS We provide rigorous theoretical analysis and empirical evidence on why the information-theoretic multivariate synergy helps with identifying genetic risk factors via synergistic interactions. We further establish the rigorous sample complexity analysis on detecting interactive effects, confirmed by both simulated and real-world data sets.
Collapse
Affiliation(s)
- Easton Li Xu
- Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor, 48109 MI USA
- School of Science and Engineering, Chinese University of Hong Kong, Shenzhen, Guangdong, 518172 China
| | - Xiaoning Qian
- Department of Electrical and Computer Engineering, Texas A&M University, College Station, 77843 TX USA
| | - Qilian Yu
- Department of Electrical and Computer Engineering, University of California, Davis, 95616 CA USA
| | - Han Zhang
- Department of Electrical and Computer Engineering, University of California, Davis, 95616 CA USA
| | - Shuguang Cui
- Department of Electrical and Computer Engineering, University of California, Davis, 95616 CA USA
| |
Collapse
|
10
|
Sakhanenko NA, Kunert-Graf J, Galas DJ. The Information Content of Discrete Functions and Their Application in Genetic Data Analysis. J Comput Biol 2017; 24:1153-1178. [PMID: 29028175 PMCID: PMC5729883 DOI: 10.1089/cmb.2017.0143] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022] Open
Abstract
The complex of central problems in data analysis consists of three components: (1) detecting the dependence of variables using quantitative measures, (2) defining the significance of these dependence measures, and (3) inferring the functional relationships among dependent variables. We have argued previously that an information theory approach allows separation of the detection problem from the inference of functional form problem. We approach here the third component of inferring functional forms based on information encoded in the functions. We present here a direct method for classifying the functional forms of discrete functions of three variables represented in data sets. Discrete variables are frequently encountered in data analysis, both as the result of inherently categorical variables and from the binning of continuous numerical variables into discrete alphabets of values. The fundamental question of how much information is contained in a given function is answered for these discrete functions, and their surprisingly complex relationships are illustrated. The all-important effect of noise on the inference of function classes is found to be highly heterogeneous and reveals some unexpected patterns. We apply this classification approach to an important area of biological data analysis—that of inference of genetic interactions. Genetic analysis provides a rich source of real and complex biological data analysis problems, and our general methods provide an analytical basis and tools for characterizing genetic problems and for analyzing genetic data. We illustrate the functional description and the classes of a number of common genetic interaction modes and also show how different modes vary widely in their sensitivity to noise.
Collapse
Affiliation(s)
| | | | - David J Galas
- Pacific Northwest Research Institute , Seattle, Washington
| |
Collapse
|
11
|
|
12
|
Chen Y, Cao D, Gao J, Yuan Z. Discovering Pair-wise Synergies in Microarray Data. Sci Rep 2016; 6:30672. [PMID: 27470995 PMCID: PMC4965793 DOI: 10.1038/srep30672] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/04/2016] [Accepted: 07/07/2016] [Indexed: 01/01/2023] Open
Abstract
Informative gene selection can have important implications for the improvement of cancer diagnosis and the identification of new drug targets. Individual-gene-ranking methods ignore interactions between genes. Furthermore, popular pair-wise gene evaluation methods, e.g. TSP and TSG, are helpless for discovering pair-wise interactions. Several efforts to discover pair-wise synergy have been made based on the information approach, such as EMBP and FeatKNN. However, the methods which are employed to estimate mutual information, e.g. binarization, histogram-based and KNN estimators, depend on known data or domain characteristics. Recently, Reshef et al. proposed a novel maximal information coefficient (MIC) measure to capture a wide range of associations between two variables that has the property of generality. An extension from MIC(X; Y) to MIC(X1; X2; Y) is therefore desired. We developed an approximation algorithm for estimating MIC(X1; X2; Y) where Y is a discrete variable. MIC(X1; X2; Y) is employed to detect pair-wise synergy in simulation and cancer microarray data. The results indicate that MIC(X1; X2; Y) also has the property of generality. It can discover synergic genes that are undetectable by reference feature selection methods such as MIC(X; Y) and TSG. Synergic genes can distinguish different phenotypes. Finally, the biological relevance of these synergic genes is validated with GO annotation and OUgene database.
Collapse
Affiliation(s)
- Yuan Chen
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, Hunan, 410128, China.,Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, Hunan, 410128, China
| | - Dan Cao
- Orient Science &Technology College of Hunan Agricultural University, Changsha, Hunan, 410128, China
| | - Jun Gao
- College of Resources &Environment, Hunan Agricultural University, Changsha, Hunan, 410128, China.,Department of Biochemistry and Molecular Biology, University of Arkansas for Medical Sciences, Little Rock, Arkansas, 72205, USA
| | - Zheming Yuan
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Hunan Agricultural University, Changsha, Hunan, 410128, China.,Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, Hunan, 410128, China
| |
Collapse
|
13
|
Chen Y, Wang L, Li L, Zhang H, Yuan Z. Informative gene selection and the direct classification of tumors based on relative simplicity. BMC Bioinformatics 2016; 17:44. [PMID: 26792270 PMCID: PMC4721022 DOI: 10.1186/s12859-016-0893-0] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/10/2015] [Accepted: 01/19/2016] [Indexed: 12/15/2022] Open
Abstract
BACKGROUND Selecting a parsimonious set of informative genes to build highly generalized performance classifier is the most important task for the analysis of tumor microarray expression data. Many existing gene pair evaluation methods cannot highlight diverse patterns of gene pairs only used one strategy of vertical comparison and horizontal comparison, while individual-gene-ranking method ignores redundancy and synergy among genes. RESULTS Here we proposed a novel score measure named relative simplicity (RS). We evaluated gene pairs according to integrating vertical comparison with horizontal comparison, finally built RS-based direct classifier (RS-based DC) based on a set of informative genes capable of binary discrimination with a paired votes strategy. Nine multi-class gene expression datasets involving human cancers were used to validate the performance of new method. Compared with the nine reference models, RS-based DC received the highest average independent test accuracy (91.40%), the best generalization performance and the smallest informative average gene number (20.56). Compared with the four reference feature selection methods, RS also received the highest average test accuracy in three classifiers (Naïve Bayes, k-Nearest Neighbor and Support Vector Machine), and only RS can improve the performance of SVM. CONCLUSIONS Diverse patterns of gene pairs could be highlighted more fully while integrating vertical comparison with horizontal comparison strategy. DC core classifier can effectively control over-fitting. RS-based feature selection method combined with DC classifier can lead to more robust selection of informative genes and classification accuracy.
Collapse
Affiliation(s)
- Yuan Chen
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha, China. .,Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| | - Lifeng Wang
- Biotechnology Research Center, Hunan Academy of Agricultural Sciences, Changsha, China.
| | - Lanzhi Li
- Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| | - Hongyan Zhang
- Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| | - Zheming Yuan
- Hunan Provincial Key Laboratory for Biology and Control of Plant Diseases and Insect Pests, Changsha, China. .,Hunan Provincial Key Laboratory for Germplasm Innovation and Utilization of Crop, Hunan Agricultural University, Changsha, China.
| |
Collapse
|
14
|
Sakhanenko NA, Galas DJ. Biological data analysis as an information theory problem: multivariable dependence measures and the shadows algorithm. J Comput Biol 2015; 22:1005-24. [PMID: 26335709 PMCID: PMC4642827 DOI: 10.1089/cmb.2015.0051] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Information theory is valuable in multiple-variable analysis for being model-free and nonparametric, and for the modest sensitivity to undersampling. We previously introduced a general approach to finding multiple dependencies that provides accurate measures of levels of dependency for subsets of variables in a data set, which is significantly nonzero only if the subset of variables is collectively dependent. This is useful, however, only if we can avoid a combinatorial explosion of calculations for increasing numbers of variables. The proposed dependence measure for a subset of variables,τ, differential interaction information, Δ(τ), has the property that for subsets ofτ some of the factors of Δ(τ) are significantly nonzero, when the full dependence includes more variables. We use this property to suppress the combinatorial explosion by following the “shadows” of multivariable dependency on smaller subsets. Rather than calculating the marginal entropies of all subsets at each degree level, we need to consider only calculations for subsets of variables with appropriate “shadows.” The number of calculations for n variables at a degree level of d grows therefore, at a much smaller rate than the binomial coefficient (n, d), but depends on the parameters of the “shadows” calculation. This approach, avoiding a combinatorial explosion, enables the use of our multivariable measures on very large data sets. We demonstrate this method on simulated data sets, and characterize the effects of noise and sample numbers. In addition, we analyze a data set of a few thousand mutant yeast strains interacting with a few thousand chemical compounds.
Collapse
Affiliation(s)
| | - David J Galas
- 1 Pacific Northwest Diabetes Research Institute , Seattle, Washington.,2 Luxembourg Centre for Systems Biomedicine, Université de Luxembourg , Luxembourg, Luxembourg
| |
Collapse
|