1
|
Robust spatial estimates of biomass carbon on farms. THE SCIENCE OF THE TOTAL ENVIRONMENT 2023; 861:160618. [PMID: 36460106 DOI: 10.1016/j.scitotenv.2022.160618] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 08/10/2022] [Revised: 11/27/2022] [Accepted: 11/27/2022] [Indexed: 06/17/2023]
Abstract
The drive for farm businesses to move towards net zero greenhouse gas emissions means that there is a need to develop robust methods to quantify the amount of biomass carbon (C) on farms. Direct measurements can be destructive and time-consuming and some prediction methods provide no assessment of uncertainty. This study describes the development, validation, and use of an integrated spatial approach, including the use of lidar data, and Bayesian Belief Networks (BBNs) to quantify total biomass carbon stocks (Ctotal) of i) land cover and ii) landscape features such as hedges and lone trees for five case study sites in lowland England. The results demonstrated that it was possible to develop and use a remote integrated approach to estimate biomass carbon at a farm scale. The highest achievable prediction accuracy was attained from models using the variables AGBC, BGBC, DOMC, age, height, species and land cover, derived from measured information and from literature review. The two BBN models successfully predicted the test values of the total biomass carbon with propagated error rates of 6.7 % and 4.3 % for the land cover and landscape features respectively. These error rates were lower than in other studies indicating that the seven predictors are strong determinants of biomass carbon. The lidar data also enabled the spatial presentation and calculation of the variable C stocks along the length of hedges and within woodlands.
Collapse
|
2
|
A Novel Multitasking Ant Colony Optimization Method for Detecting Multiorder SNP Interactions. Interdiscip Sci 2022; 14:814-832. [PMID: 35788965 DOI: 10.1007/s12539-022-00530-2] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/10/2021] [Revised: 05/29/2022] [Accepted: 06/01/2022] [Indexed: 06/15/2023]
Abstract
MOTIVATION Linear or nonlinear interactions of multiple single-nucleotide polymorphisms (SNPs) play an important role in understanding the genetic basis of complex human diseases. However, combinatorial analytics in high-dimensional space makes it extremely challenging to detect multiorder SNP interactions. Most classic approaches can only perform one task (for detecting k-order SNP interactions) in each run. Since prior knowledge of a complex disease is usually not available, it is difficult to determine the value of k for detecting k-order SNP interactions. METHODS A novel multitasking ant colony optimization algorithm (named MTACO-DMSI) is proposed to detect multiorder SNP interactions, and it is divided into two stages: searching and testing. In the searching stage, multiple multiorder SNP interaction detection tasks (from 2nd-order to kth-order) are executed in parallel, and two subpopulations that separately adopt the Bayesian network-based K2-score and Jensen-Shannon divergence (JS-score) as evaluation criteria are generated for each task to improve the global search capability and the discrimination ability for various disease models. In the testing stage, the G test statistical test is adopted to further verify the authenticity of candidate solutions to reduce the error rate. RESULT Three multiorder simulated disease models with different interaction effects and three real age-related macular degeneration (AMD), rheumatoid arthritis (RA) and type 1 diabetes (T1D) datasets were used to investigate the performance of the proposed MTACO-DMSI. The experimental results show that the MTACO-DMSI has a faster search speed and higher discriminatory power for diverse simulation disease models than traditional single-task algorithms. The results on real AMD data and RA and T1D datasets indicate that MTACO-DMSI has the ability to detect multiorder SNP interactions at a genome-wide scale. Availability and implementation: https://github.com/shouhengtuo/MTACO-DMSI/.
Collapse
|
3
|
A causal learning framework for the analysis and interpretation of COVID-19 clinical data. PLoS One 2022; 17:e0268327. [PMID: 35588440 PMCID: PMC9119448 DOI: 10.1371/journal.pone.0268327] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/12/2021] [Accepted: 04/27/2022] [Indexed: 11/29/2022] Open
Abstract
We present a workflow for clinical data analysis that relies on Bayesian Structure Learning (BSL), an unsupervised learning approach, robust to noise and biases, that allows to incorporate prior medical knowledge into the learning process and that provides explainable results in the form of a graph showing the causal connections among the analyzed features. The workflow consists in a multi-step approach that goes from identifying the main causes of patient’s outcome through BSL, to the realization of a tool suitable for clinical practice, based on a Binary Decision Tree (BDT), to recognize patients at high-risk with information available already at hospital admission time. We evaluate our approach on a feature-rich dataset of Coronavirus disease (COVID-19), showing that the proposed framework provides a schematic overview of the multi-factorial processes that jointly contribute to the outcome. We compare our findings with current literature on COVID-19, showing that this approach allows to re-discover established cause-effect relationships about the disease. Further, our approach yields to a highly interpretable tool correctly predicting the outcome of 85% of subjects based exclusively on 3 features: age, a previous history of chronic obstructive pulmonary disease and the PaO2/FiO2 ratio at the time of arrival to the hospital. The inclusion of additional information from 4 routine blood tests (Creatinine, Glucose, pO2 and Sodium) increases predictive accuracy to 94.5%.
Collapse
|
4
|
Multi-Objective Artificial Bee Colony Algorithm Based on Scale-Free Network for Epistasis Detection. Genes (Basel) 2022; 13:genes13050871. [PMID: 35627256 PMCID: PMC9140669 DOI: 10.3390/genes13050871] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2022] [Revised: 04/30/2022] [Accepted: 05/10/2022] [Indexed: 12/04/2022] Open
Abstract
In genome-wide association studies, epistasis detection is of great significance for the occurrence and diagnosis of complex human diseases, but it also faces challenges such as high dimensionality and a small data sample size. In order to cope with these challenges, several swarm intelligence methods have been introduced to identify epistasis in recent years. However, the existing methods still have some limitations, such as high-consumption and premature convergence. In this study, we proposed a multi-objective artificial bee colony (ABC) algorithm based on the scale-free network (SFMOABC). The SFMOABC incorporates the scale-free network into the ABC algorithm to guide the update and selection of solutions. In addition, the SFMOABC uses mutual information and the K2-Score of the Bayesian network as objective functions, and the opposition-based learning strategy is used to improve the search ability. Experiments were performed on both simulation datasets and a real dataset of age-related macular degeneration (AMD). The results of the simulation experiments showed that the SFMOABC has better detection power and efficiency than seven other epistasis detection methods. In the real AMD data experiment, most of the single nucleotide polymorphism combinations detected by the SFMOABC have been shown to be associated with AMD disease. Therefore, SFMOABC is a promising method for epistasis detection.
Collapse
|
5
|
Automated Cyber and Privacy Risk Management Toolkit. SENSORS (BASEL, SWITZERLAND) 2021; 21:5493. [PMID: 34450935 PMCID: PMC8399421 DOI: 10.3390/s21165493] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/05/2021] [Revised: 08/11/2021] [Accepted: 08/12/2021] [Indexed: 11/17/2022]
Abstract
Addressing cyber and privacy risks has never been more critical for organisations. While a number of risk assessment methodologies and software tools are available, it is most often the case that one must, at least, integrate them into a holistic approach that combines several appropriate risk sources as input to risk mitigation tools. In addition, cyber risk assessment primarily investigates cyber risks as the consequence of vulnerabilities and threats that threaten assets of the investigated infrastructure. In fact, cyber risk assessment is decoupled from privacy impact assessment, which aims to detect privacy-specific threats and assess the degree of compliance with data protection legislation. Furthermore, a Privacy Impact Assessment (PIA) is conducted in a proactive manner during the design phase of a system, combining processing activities and their inter-dependencies with assets, vulnerabilities, real-time threats and Personally Identifiable Information (PII) that may occur during the dynamic life-cycle of systems. In this paper, we propose a cyber and privacy risk management toolkit, called AMBIENT (Automated Cyber and Privacy Risk Management Toolkit) that addresses the above challenges by implementing and integrating three distinct software tools. AMBIENT not only assesses cyber and privacy risks in a thorough and automated manner but it also offers decision-support capabilities, to recommend optimal safeguards using the well-known repository of the Center for Internet Security (CIS) Controls. To the best of our knowledge, AMBIENT is the first toolkit in the academic literature that brings together the aforementioned capabilities. To demonstrate its use, we have created a case scenario based on information about cyber attacks we have received from a healthcare organisation, as a reference sector that faces critical cyber and privacy threats.
Collapse
|
6
|
A framework for modeling epistatic interaction. Bioinformatics 2021; 37:1708-1716. [PMID: 33252645 DOI: 10.1093/bioinformatics/btaa990] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/07/2020] [Revised: 10/21/2020] [Accepted: 11/16/2020] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Recently, various tools for detecting single nucleotide polymorphisms (SNPs) involved in epistasis have been developed. However, no studies evaluate the employed statistical epistasis models such as the χ2-test or quadratic regression independently of the tools that use them. Such an independent evaluation is crucial for developing improved epistasis detection tools, for it allows to decide if a tool's performance should be attributed to the epistasis model or to the optimization strategy run on top of it. RESULTS We present a protocol for evaluating epistasis models independently of the tools they are used in and generalize existing models designed for dichotomous phenotypes to the categorical and quantitative case. In addition, we propose a new model which scores candidate SNP sets by computing maximum likelihood distributions for the observed phenotypes in the cells of their penetrance tables. Extensive experiments show that the proposed maximum likelihood model outperforms three widely used epistasis models in most cases. The experiments also provide valuable insights into the properties of existing models, for instance, that quadratic regression perform particularly well on instances with quantitative phenotypes. AVAILABILITY AND IMPLEMENTATION The evaluation protocol and all compared models are implemented in C++ and are supported under Linux and macOS. They are available at https://github.com/baumbachlab/genepiseeker/, along with test datasets and scripts to reproduce the experiments. SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
|
7
|
Brief Survey on Machine Learning in Epistasis. Methods Mol Biol 2021; 2212:169-179. [PMID: 33733356 DOI: 10.1007/978-1-0716-0947-7_11] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 03/09/2023]
Abstract
In biology, the term "epistasis" indicates the effect of the interaction of a gene with another gene. A gene can interact with an independently sorted gene, located far away on the chromosome or on an entirely different chromosome, and this interaction can have a strong effect on the function of the two genes. These changes then can alter the consequences of the biological processes, influencing the organism's phenotype. Machine learning is an area of computer science that develops statistical methods able to recognize patterns from data. A typical machine learning algorithm consists of a training phase, where the model learns to recognize specific trends in the data, and a test phase, where the trained model applies its learned intelligence to recognize trends in external data. Scientists have applied machine learning to epistasis problems multiple times, especially to identify gene-gene interactions from genome-wide association study (GWAS) data. In this brief survey, we report and describe the main scientific articles published in data mining and epistasis. Our article confirms the effectiveness of machine learning in this genetics subfield.
Collapse
|
8
|
Protocol for Epistasis Detection with Machine Learning Using GenEpi Package. Methods Mol Biol 2021. [PMID: 33733363 DOI: 10.1007/978-1-0716-0947-7_18] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register]
Abstract
To develop medical treatments and prevention, the association between disease and genetic variants needs to be identified. The main goal of genome-wide association study (GWAS) is to discover the underlying reason for vulnerability to disease and utilize this knowledge for the development of prevention and treatment against these diseases. Given the methods available to address the scientific problems involved in the search for epistasis, there is not any standard for detecting epistasis, and this remains a problem due to limited statistical power. The GenEpi package is a Python package that uses a two-level workflow machine learning model to detect within-gene and cross-gene epistasis. This protocol chapter shows the usage of GenEpi with example data. The package uses a three-step procedure to reduce dimensionality, select the within-gene epistasis, and select the cross-gene epistasis. The package also provides a medium to build prediction models with the combination of genetic features and environmental influences.
Collapse
|
9
|
A differential evolution based feature combination selection algorithm for high-dimensional data. Inf Sci (N Y) 2021. [DOI: 10.1016/j.ins.2020.08.081] [Citation(s) in RCA: 12] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022]
|
10
|
The Application of Artificial Intelligence in the Genetic Study of Alzheimer's Disease. Aging Dis 2020; 11:1567-1584. [PMID: 33269107 PMCID: PMC7673858 DOI: 10.14336/ad.2020.0312] [Citation(s) in RCA: 13] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/02/2020] [Accepted: 03/12/2020] [Indexed: 12/13/2022] Open
Abstract
Alzheimer's disease (AD) is a neurodegenerative disease in which genetic factors contribute approximately 70% of etiological effects. Studies have found many significant genetic and environmental factors, but the pathogenesis of AD is still unclear. With the application of microarray and next-generation sequencing technologies, research using genetic data has shown explosive growth. In addition to conventional statistical methods for the processing of these data, artificial intelligence (AI) technology shows obvious advantages in analyzing such complex projects. This article first briefly reviews the application of AI technology in medicine and the current status of genetic research in AD. Then, a comprehensive review is focused on the application of AI in the genetic research of AD, including the diagnosis and prognosis of AD based on genetic data, the analysis of genetic variation, gene expression profile, gene-gene interaction in AD, and genetic analysis of AD based on a knowledge base. Although many studies have yielded some meaningful results, they are still in a preliminary stage. The main shortcomings include the limitations of the databases, failing to take advantage of AI to conduct a systematic biology analysis of multilevel databases, and lack of a theoretical framework for the analysis results. Finally, we outlook the direction of future development. It is crucial to develop high quality, comprehensive, large sample size, data sharing resources; a multi-level system biology AI analysis strategy is one of the development directions, and computational creativity may play a role in theory model building, verification, and designing new intervention protocols for AD.
Collapse
|
11
|
Nature-Inspired Multiobjective Epistasis Elucidation from Genome-Wide Association Studies. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2020; 17:226-237. [PMID: 29994485 DOI: 10.1109/tcbb.2018.2849759] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
In recent years, the detection of epistatic interactions of multiple genetic variants on the causes of complex diseases brings a significant challenge in genome-wide association studies (GWAS). However, most of the existing methods still suffer from algorithmic limitations such as single-objective optimization, intensive computational requirement, and premature convergence. In this paper, we propose and formulate an epistatic interaction multi-objective artificial bee colony algorithm based on decomposition (EIMOABC/D) to address those problems for genetic interaction detection in genome-wide association studies. First, to direct the genetic interaction detection, two objective functions are formulated to characterize various epistatic models; rank probability model is proposed to sort each population into different nondomination levels based on the fast nondominated sorting approach. After that, the mutual information based local search algorithm is proposed to guide the population search for disease model evaluations in an unbiased manner. To validate the effectiveness of EIMOABC/D, we compare EIMOABC/D against seven state-of-the-art methods on 77 epistatic models including eight small-scale epistatic models with marginal effects, eight large-scale epistatic models with marginal effects, 60 large-scale epistatic models without any marginal effect, and one case study. The experimental results indicate that our proposed algorithm EIMOABC/D outperforms seven state-of-the-art methods on those epistatic models. Furthermore, time complexity analysis and parameter analysis are conducted to demonstrate various properties of our proposed algorithm.
Collapse
|
12
|
Epi-GTBN: an approach of epistasis mining based on genetic Tabu algorithm and Bayesian network. BMC Bioinformatics 2019; 20:444. [PMID: 31455207 PMCID: PMC6712799 DOI: 10.1186/s12859-019-3022-z] [Citation(s) in RCA: 14] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/04/2018] [Accepted: 08/07/2019] [Indexed: 12/31/2022] Open
Abstract
Background Mining epistatic loci which affects specific phenotypic traits is an important research issue in the field of biology. Bayesian network (BN) is a graphical model which can express the relationship between genetic loci and phenotype. Until now, it has been widely used into epistasis mining in many research work. However, this method has two disadvantages: low learning efficiency and easy to fall into local optimum. Genetic algorithm has the excellence of rapid global search and avoiding falling into local optimum. It is scalable and easy to integrate with other algorithms. This work proposes an epistasis mining approach based on genetic tabu algorithm and Bayesian network (Epi-GTBN). It uses genetic algorithm into the heuristic search strategy of Bayesian network. The individual structure can be evolved through the genetic operations of selection, crossover and mutation. It can help to find the optimal network structure, and then further to mine the epistasis loci effectively. In order to enhance the diversity of the population and obtain a more effective global optimal solution, we use the tabu search strategy into the operations of crossover and mutation in genetic algorithm. It can help to accelerate the convergence of the algorithm. Results We compared Epi-GTBN with other recent algorithms using both simulated and real datasets. The experimental results demonstrate that our method has much better epistasis detection accuracy in the case of not affecting the efficiency for different datasets. Conclusions The presented methodology (Epi-GTBN) is an effective method for epistasis detection, and it can be seen as an interesting addition to the arsenal used in complex traits analyses. Electronic supplementary material The online version of this article (10.1186/s12859-019-3022-z) contains supplementary material, which is available to authorized users.
Collapse
|
13
|
Bayesian Network Construction and Genotype-Phenotype Inference Using GWAS Statistics. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2019; 16:475-489. [PMID: 29990020 PMCID: PMC6499495 DOI: 10.1109/tcbb.2017.2779498] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/15/2023]
Abstract
Genome-wide association studies (GWASs) have received increasing attention to understand how genetic variation affects different human traits. In this paper, we study whether and to what extent exploiting the GWAS statistics can be used for inferring private information about a human individual. We first provide a method to construct a three-layered Bayesian network explicitly revealing the conditional dependency between single-nucleotide polymorphisms (SNPs) and traits from public GWAS catalog. The key challenge in building a Bayesian network from GWAS statistics is the specification of the conditional probability table of a variable with multiple parent variables. We employ the models of independence of causal influences which assume that the causal mechanism of each parent variable is mutually independent. We then formulate three inference problems based on the dependency relationship captured in the Bayesian network, namely trait inference given SNP genotype, genotype inference given trait, and trait inference given known traits, and develop efficient formulas and algorithms. Different from previous work, the possible target of these inference problems we study may be any individual, not limited to GWAS participants. Empirical evaluations show the effectiveness of our proposed methods. In summary, our work implies that meaningful information can be inferred from modeling GWAS statistics, and appropriate privacy protection mechanisms need to be developed to protect genetic privacy not only of GWAS participants but also regular individuals.
Collapse
|
14
|
Self-Adjusting Ant Colony Optimization Based on Information Entropy for Detecting Epistatic Interactions. Genes (Basel) 2019; 10:genes10020114. [PMID: 30717303 PMCID: PMC6409693 DOI: 10.3390/genes10020114] [Citation(s) in RCA: 11] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/14/2018] [Revised: 01/21/2019] [Accepted: 01/28/2019] [Indexed: 12/15/2022] Open
Abstract
The epistatic interactions of single nucleotide polymorphisms (SNPs) are considered to be an important factor in determining the susceptibility of individuals to complex diseases. Although many methods have been proposed to detect such interactions, the development of detection algorithm is still ongoing due to the computational burden in large-scale association studies. In this paper, to deal with the intensive computing problem of detecting epistatic interactions in large-scale datasets, a self-adjusting ant colony optimization based on information entropy (IEACO) is proposed. The algorithm can automatically self-adjust the path selection strategy according to the real-time information entropy. The performance of IEACO is compared with that of ant colony optimization (ACO), AntEpiSeeker, AntMiner, and epiACO on a set of simulated datasets and a real genome-wide dataset. The results of extensive experiments show that the proposed method is superior to the other methods.
Collapse
|
15
|
Ant colony optimization with an automatic adjustment mechanism for detecting epistatic interactions. Comput Biol Chem 2018; 77:354-362. [DOI: 10.1016/j.compbiolchem.2018.11.001] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/26/2018] [Revised: 10/01/2018] [Accepted: 11/05/2018] [Indexed: 12/13/2022]
|
16
|
A fast and exhaustive method for heterogeneity and epistasis analysis based on multi-objective optimization. Bioinformatics 2018; 33:2829-2836. [PMID: 28541468 DOI: 10.1093/bioinformatics/btx339] [Citation(s) in RCA: 29] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/02/2017] [Accepted: 05/20/2017] [Indexed: 12/29/2022] Open
Abstract
Motivation The existing epistasis analysis approaches have been criticized mainly for their: (i) ignoring heterogeneity during epistasis analysis; (ii) high computational costs; and (iii) volatility of performances and results. Therefore, they will not perform well in general, leading to lack of reproducibility and low power in complex disease association studies. In this work, a fast scheme is proposed to accelerate exhaustive searching based on multi-objective optimization named ESMO for concurrently analyzing heterogeneity and epistasis phenomena. In ESMO, mutual entropy and Bayesian network approaches are combined for evaluating epistatic SNP combinations. In order to be compatible with heterogeneity of complex diseases, we designed an adaptive framework based on non-dominant sort and top k selection algorithm with improved time complexity O(k*M*N) . Moreover, ESMO is accelerated by strategies such as trading space for time, calculation sharing and parallel computing. Finally, ESMO is nonparametric and model-free. Results We compared ESMO with other recent or classic methods using different evaluating measures. The experimental results show that our method not only can quickly handle epistasis, but also can effectively detect heterogeneity of complex population structures. Availability and implementation https://github.com/XiongLi2016/ESMO/tree/master/ESMO-common-master . Contact lx_hncs@163.com.
Collapse
|
17
|
Heterogeneity Analysis and Diagnosis of Complex Diseases Based on Deep Learning Method. Sci Rep 2018; 8:6155. [PMID: 29670206 PMCID: PMC5906634 DOI: 10.1038/s41598-018-24588-5] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/14/2018] [Accepted: 04/05/2018] [Indexed: 12/26/2022] Open
Abstract
Understanding genetic mechanism of complex diseases is a serious challenge. Existing methods often neglect the heterogeneity phenomenon of complex diseases, resulting in lack of power or low reproducibility. Addressing heterogeneity when detecting epistatic single nucleotide polymorphisms (SNPs) can enhance the power of association studies and improve prediction performance of complex diseases diagnosis. In this study, we propose a three-stage framework including epistasis detection, clustering and prediction to address both epistasis and heterogeneity of complex diseases based on deep learning method. The epistasis detection stage applies a multi-objective optimization method to find several candidate sets of epistatic SNPs which contribute to different subtypes of complex diseases. Then, a K-means clustering algorithm is used to define subtypes of the case group. Finally, a deep learning model has been trained for disease prediction based on graphics processing unit (GPU). Experimental results on pure and heterogeneous datasets show that our method has potential practicality and can serve as a possible alternative to other methods. Therefore, when epistasis and heterogeneity exist at the same time, our method is especially suitable for diagnosis of complex diseases.
Collapse
|
18
|
Abstract
BACKGROUND Recently, measuring phenotype similarity began to play an important role in disease diagnosis. Researchers have begun to pay attention to develop phenotype similarity measurement. However, existing methods ignore the interactions between phenotype-associated proteins, which may lead to inaccurate phenotype similarity. RESULTS We proposed a network-based method PhenoNet to calculate the similarity between phenotypes. We localized phenotypes in the network and calculated the similarity between phenotype-associated modules by modeling both the inter- and intra-similarity. CONCLUSIONS PhenoNet was evaluated on two independent evaluation datasets: gene ontology and gene expression data. The result shows that PhenoNet performs better than the state-of-art methods on all evaluation tests.
Collapse
|
19
|
epiACO - a method for identifying epistasis based on ant Colony optimization algorithm. BioData Min 2017; 10:23. [PMID: 28694848 PMCID: PMC5500974 DOI: 10.1186/s13040-017-0143-7] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2016] [Accepted: 06/29/2017] [Indexed: 11/23/2022] Open
Abstract
Background Identifying epistasis or epistatic interactions, which refer to nonlinear interaction effects of single nucleotide polymorphisms (SNPs), is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Though many works have been done for identifying epistatic interactions, due to their methodological and computational challenges, the algorithmic development is still ongoing. Results In this study, a method epiACO is proposed to identify epistatic interactions, which based on ant colony optimization algorithm. Highlights of epiACO are the introduced fitness function Svalue, path selection strategies, and a memory based strategy. The Svalue leverages the advantages of both mutual information and Bayesian network to effectively and efficiently measure associations between SNP combinations and the phenotype. Two path selection strategies, i.e., probabilistic path selection strategy and stochastic path selection strategy, are provided to adaptively guide ant behaviors of exploration and exploitation. The memory based strategy is designed to retain candidate solutions found in the previous iterations, and compare them to solutions of the current iteration to generate new candidate solutions, yielding a more accurate way for identifying epistasis. Conclusions Experiments of epiACO and its comparison with other recent methods epiMODE, TEAM, BOOST, SNPRuler, AntEpiSeeker, AntMiner, MACOED, and IACO are performed on both simulation data sets and a real data set of age-related macular degeneration. Results show that epiACO is promising in identifying epistasis and might be an alternative to existing methods.
Collapse
|
20
|
|
21
|
Improving risk management for violence in mental health services: a multimethods approach. PROGRAMME GRANTS FOR APPLIED RESEARCH 2016. [DOI: 10.3310/pgfar04160] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
BackgroundMental health professionals increasingly carry out risk assessments to prevent future violence by their patients. However, there are problems with accuracy and these assessments do not always translate into successful risk management.ObjectivesOur aim was to improve the accuracy of assessment and identify risk factors that are causal to be targeted by clinicians to ensure good risk management. Our objectives were to investigate key risks at the population level, construct new static and dynamic instruments, test validity and construct new models of risk management using Bayesian networks.Methods and resultsWe utilised existing data sets from two national and commissioned a survey to identify risk factors at the population level. We confirmed that certain mental health factors previously thought to convey risk were important in future assessments and excluded others from subsequent parts of the study. Using a first-episode psychosis cohort, we constructed a risk assessment instrument for men and women and showed important sex differences in pathways to violence. We included a 1-year follow-up of patients discharged from medium secure services and validated a previously developed risk assessment guide, the Medium Security Recidivism Assessment Guide (MSRAG). We found that it is essential to combine ratings from static instruments such as the MSRAG with dynamic risk factors. Static levels of risk have important modifying effects on dynamic risk factors for their effects on violence and we further demonstrated this using a sample of released prisoners to construct risk assessment instruments for violence, robbery, drugs and acquisitive convictions. We constructed a preliminary instrument including dynamic risk measures and validated this in a second large data set of released prisoners. Finally, we incorporated findings from the follow-up of psychiatric patients discharged from medium secure services and two samples of released prisoners to construct Bayesian models to guide clinicians in risk management.ConclusionsRisk factors for violence identified at the population level, including paranoid delusions and anxiety disorder, should be integrated in risk assessments together with established high-risk psychiatric morbidity such as substance misuse and antisocial personality disorder. The incorporation of dynamic factors resulted in improved accuracy, especially when combined in assessments using actuarial measures to obtain levels of risk using static factors. It is important to continue developing dynamic risk and protective measures with the aim of identifying factors that are causally related to violence. Only causal factors should be targeted in violence prevention interventions. Bayesian networks show considerable promise in developing software for clinicians to identify targets for intervention in the field. The Bayesian models developed in this programme are at the prototypical stage and require further programmer development into applications for use on tablets. These should be further tested in the field and then compared with structured professional judgement in a randomised controlled trial in terms of their effectiveness in preventing future violence.FundingThe National Institute for Health Research Programme Grants for Applied Research programme.
Collapse
|
22
|
Discovering causal interactions using Bayesian network scoring and information gain. BMC Bioinformatics 2016; 17:221. [PMID: 27230078 PMCID: PMC4880828 DOI: 10.1186/s12859-016-1084-8] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/27/2015] [Accepted: 05/14/2016] [Indexed: 11/17/2022] Open
Abstract
BACKGROUND The problem of learning causal influences from data has recently attracted much attention. Standard statistical methods can have difficulty learning discrete causes, which interacting to affect a target, because the assumptions in these methods often do not model discrete causal relationships well. An important task then is to learn such interactions from data. Motivated by the problem of learning epistatic interactions from datasets developed in genome-wide association studies (GWAS), researchers conceived new methods for learning discrete interactions. However, many of these methods do not differentiate a model representing a true interaction from a model representing non-interacting causes with strong individual affects. The recent algorithm MBS-IGain addresses this difficulty by using Bayesian network learning and information gain to discover interactions from high-dimensional datasets. However, MBS-IGain requires marginal effects to detect interactions containing more than two causes. If the dataset is not high-dimensional, we can avoid this shortcoming by doing an exhaustive search. RESULTS We develop Exhaustive-IGain, which is like MBS-IGain but does an exhaustive search. We compare the performance of Exhaustive-IGain to MBS-IGain using low-dimensional simulated datasets based on interactions with marginal effects and ones based on interactions without marginal effects. Their performance is similar on the datasets based on marginal effects. However, Exhaustive-IGain compellingly outperforms MBS-IGain on the datasets based on 3 and 4-cause interactions without marginal effects. We apply Exhaustive-IGain to investigate how clinical variables interact to affect breast cancer survival, and obtain results that agree with judgements of a breast cancer oncologist. CONCLUSIONS We conclude that the combined use of information gain and Bayesian network scoring enables us to discover higher order interactions with no marginal effects if we perform an exhaustive search. We further conclude that Exhaustive-IGain can be effective when applied to real data.
Collapse
|
23
|
FHSA-SED: Two-Locus Model Detection for Genome-Wide Association Study with Harmony Search Algorithm. PLoS One 2016; 11:e0150669. [PMID: 27014873 PMCID: PMC4807955 DOI: 10.1371/journal.pone.0150669] [Citation(s) in RCA: 29] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/01/2015] [Accepted: 02/16/2016] [Indexed: 12/24/2022] Open
Abstract
Motivation Two-locus model is a typical significant disease model to be identified in genome-wide association study (GWAS). Due to intensive computational burden and diversity of disease models, existing methods have drawbacks on low detection power, high computation cost, and preference for some types of disease models. Method In this study, two scoring functions (Bayesian network based K2-score and Gini-score) are used for characterizing two SNP locus as a candidate model, the two criteria are adopted simultaneously for improving identification power and tackling the preference problem to disease models. Harmony search algorithm (HSA) is improved for quickly finding the most likely candidate models among all two-locus models, in which a local search algorithm with two-dimensional tabu table is presented to avoid repeatedly evaluating some disease models that have strong marginal effect. Finally G-test statistic is used to further test the candidate models. Results We investigate our method named FHSA-SED on 82 simulated datasets and a real AMD dataset, and compare it with two typical methods (MACOED and CSE) which have been developed recently based on swarm intelligent search algorithm. The results of simulation experiments indicate that our method outperforms the two compared algorithms in terms of detection power, computation time, evaluation times, sensitivity (TPR), specificity (SPC), positive predictive value (PPV) and accuracy (ACC). Our method has identified two SNPs (rs3775652 and rs10511467) that may be also associated with disease in AMD dataset.
Collapse
|
24
|
Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences. BMC Bioinformatics 2016; 17:116. [PMID: 26940649 PMCID: PMC4778322 DOI: 10.1186/s12859-016-0959-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2015] [Accepted: 02/19/2016] [Indexed: 11/10/2022] Open
Abstract
Background Ubiquitination is a very important process in protein post-translational modification, which has been widely investigated by biology scientists and researchers. Different experimental and computational methods have been developed to identify the ubiquitination sites in protein sequences. This paper aims at exploring computational machine learning methods for the prediction of ubiquitination sites using the physicochemical properties (PCPs) of amino acids in the protein sequences. Results We first establish six different ubiquitination data sets, whose records contain both ubiquitination sites and non-ubiquitination sites in variant numbers of protein sequence segments. In particular, to establish such data sets, protein sequence segments are extracted from the original protein sequences used in four published papers on ubiquitination, while 531 PCP features of each extracted protein sequence segment are calculated based on PCP values from AAindex (Amino Acid index database) by averaging PCP values of all amino acids on each segment. Various computational machine-learning methods, including four Bayesian network methods (i.e., Naïve Bayes (NB), Feature Selection NB (FSNB), Model Averaged NB (MANB), and Efficient Bayesian Multivariate Classifier (EBMC)) and three regression methods (i.e., Support Vector Machine (SVM), Logistic Regression (LR), and Least Absolute Shrinkage and Selection Operator (LASSO)), are then applied to the six established segment-PCP data sets. Five-fold cross-validation and the Area Under Receiver Operating Characteristic Curve (AUROC) are employed to evaluate the ubiquitination prediction performance of each method. Results demonstrate that the PCP data of protein sequences contain information that could be mined by machine learning methods for ubiquitination site prediction. The comparative results show that EBMC, SVM and LR perform better than other methods, and EBMC is the only method that can get AUCs greater than or equal to 0.6 for the six established data sets. Results also show EBMC tends to perform better for larger data. Conclusions Machine learning methods have been employed for the ubiquitination site prediction based on physicochemical properties of amino acids on protein sequences. Results demonstrate the effectiveness of using machine learning methodology to mine information from PCP data concerning protein sequences, as well as the superiority of EBMC, SVM and LR (especially EBMC) for the ubiquitination prediction compared to other methods. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0959-z) contains supplementary material, which is available to authorized users.
Collapse
|
25
|
Learning Predictive Interactions Using Information Gain and Bayesian Network Scoring. PLoS One 2015; 10:e0143247. [PMID: 26624895 PMCID: PMC4666609 DOI: 10.1371/journal.pone.0143247] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/21/2015] [Accepted: 11/02/2015] [Indexed: 12/28/2022] Open
Abstract
BACKGROUND The problems of correlation and classification are long-standing in the fields of statistics and machine learning, and techniques have been developed to address these problems. We are now in the era of high-dimensional data, which is data that can concern billions of variables. These data present new challenges. In particular, it is difficult to discover predictive variables, when each variable has little marginal effect. An example concerns Genome-wide Association Studies (GWAS) datasets, which involve millions of single nucleotide polymorphism (SNPs), where some of the SNPs interact epistatically to affect disease status. Towards determining these interacting SNPs, researchers developed techniques that addressed this specific problem. However, the problem is more general, and so these techniques are applicable to other problems concerning interactions. A difficulty with many of these techniques is that they do not distinguish whether a learned interaction is actually an interaction or whether it involves several variables with strong marginal effects. METHODOLOGY/FINDINGS We address this problem using information gain and Bayesian network scoring. First, we identify candidate interactions by determining whether together variables provide more information than they do separately. Then we use Bayesian network scoring to see if a candidate interaction really is a likely model. Our strategy is called MBS-IGain. Using 100 simulated datasets and a real GWAS Alzheimer's dataset, we investigated the performance of MBS-IGain. CONCLUSIONS/SIGNIFICANCE When analyzing the simulated datasets, MBS-IGain substantially out-performed nine previous methods at locating interacting predictors, and at identifying interactions exactly. When analyzing the real Alzheimer's dataset, we obtained new results and results that substantiated previous findings. We conclude that MBS-IGain is highly effective at finding interactions in high-dimensional datasets. This result is significant because we have increasingly abundant high-dimensional data in many domains, and to learn causes and perform prediction/classification using these data, we often must first identify interactions.
Collapse
|
26
|
Evaluation of a two-stage framework for prediction using big genomic data. Brief Bioinform 2015; 16:912-21. [PMID: 25788325 PMCID: PMC4652616 DOI: 10.1093/bib/bbv010] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2014] [Revised: 02/05/2015] [Indexed: 01/13/2023] Open
Abstract
We are in the era of abundant 'big' or 'high-dimensional' data. These data afford us the opportunity to discover predictors of an event of interest, and to estimate occurrence of the event based on values of these predictors. For example, 'genome-wide association studies' examine millions of single-nucleotide polymorphisms (SNPs), along with disease status. We can learn SNPs that affect disease status from these data sets, and use the knowledge learned to predict disease likelihood. Owing to the large number of features, it is difficult for many prediction methods to use all the features directly. The ReliefF algorithm ranks a set of features in terms of how well they predict a target. It can be used to identify good predictors, which can then be provided to a prediction method. We compared the performance of eight prediction methods when predicting binary outcomes using high-dimensional discrete data sets. We performed two-stage prediction, where ReliefF is used in the first stage to identify good predictors. Bayesian network (BN)-based methods performed best overall. Furthermore, ReliefF did not improve their performance. The BN-based methods use the Bayesian Dirichlet Equivalent Uniform score to evaluate candidate models, and use BN inference algorithms to perform prediction. This score and these algorithms were developed for discrete variables. This perhaps explains why they perform better in this domain. Many prediction methods are available, and researchers have little reason for choosing one over the other in the domain of binary prediction using high-dimensional data sets. Our results indicate that the best choices overall are BN-based methods.
Collapse
|
27
|
The metabolomics of asthma control: a promising link between genetics and disease. IMMUNITY INFLAMMATION AND DISEASE 2015; 3:224-38. [PMID: 26421150 PMCID: PMC4578522 DOI: 10.1002/iid3.61] [Citation(s) in RCA: 65] [Impact Index Per Article: 7.2] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 03/02/2015] [Revised: 03/26/2015] [Accepted: 03/27/2015] [Indexed: 12/12/2022]
Abstract
Short-acting β agonists (e.g., albuterol) are the most commonly used medications for asthma, a disease that affects over 300 million people in the world. Metabolomic profiling of asthmatics taking β agonists presents a new and promising resource for identifying the molecular determinants of asthma control. The objective is to identify novel genetic and biochemical predictors of asthma control using an integrative "omics" approach. We generated lipidomic data by liquid chromatography tandem mass spectrometry (LC-MS), - using plasma samples from 20 individuals with asthma. The outcome of interest was a binary indicator of asthma control defined by the use of albuterol inhalers in the preceding week. We integrated metabolomic data with genome-wide genotype, gene expression, and methylation data of this cohort to identify genomic and molecular indicators of asthma control. A Conditional Gaussian Bayesian Network (CGBN) was generated using the strongest predictors from each of these analyses. Integrative and metabolic pathway over-representation analyses (ORA) identified enrichment of known biological pathways within the strongest molecular determinants. Of the 64 metabolites measured, 32 had known identities. The CGBN model based on four SNPs (rs9522789, rs7147228, rs2701423, rs759582) and two metabolites-monoHETE_0863 and sphingosine-1-phosphate (S1P) could predict asthma control with an AUC of 95%. Integrative ORA identified 17 significantly enriched pathways related to cellular immune response, interferon signaling, and cytokine-related signaling, for which arachidonic acid, PGE2 and S1P, in addition to six genes (CHN1, PRKCE, GNA12, OASL, OAS1, and IFIT3) appeared to drive the pathway results. Of these predictors, S1P, GNA12, and PRKCE were enriched in the results from integrative and metabolic ORAs. Through an integrative analysis of metabolomic, genomic, and methylation data from a small cohort of asthmatics, we implicate altered metabolic pathways, related to sphingolipid metabolism, in asthma control. These results provide insight into the pathophysiology of asthma control.
Collapse
|
28
|
LEAP: biomarker inference through learning and evaluating association patterns. Genet Epidemiol 2015; 39:173-84. [PMID: 25677188 PMCID: PMC4366363 DOI: 10.1002/gepi.21889] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/16/2014] [Revised: 12/16/2014] [Accepted: 01/06/2015] [Indexed: 01/22/2023]
Abstract
Single nucleotide polymorphism (SNP) high-dimensional datasets are available from Genome Wide Association Studies (GWAS). Such data provide researchers opportunities to investigate the complex genetic basis of diseases. Much of genetic risk might be due to undiscovered epistatic interactions, which are interactions in which combination of several genes affect disease. Research aimed at discovering interacting SNPs from GWAS datasets proceeded in two directions. First, tools were developed to evaluate candidate interactions. Second, algorithms were developed to search over the space of candidate interactions. Another problem when learning interacting SNPs, which has not received much attention, is evaluating how likely it is that the learned SNPs are associated with the disease. A complete system should provide this information as well. We develop such a system. Our system, called LEAP, includes a new heuristic search algorithm for learning interacting SNPs, and a Bayesian network based algorithm for computing the probability of their association. We evaluated the performance of LEAP using 100 1,000-SNP simulated datasets, each of which contains 15 SNPs involved in interactions. When learning interacting SNPs from these datasets, LEAP outperformed seven others methods. Furthermore, only SNPs involved in interactions were found to be probable. We also used LEAP to analyze real Alzheimer's disease and breast cancer GWAS datasets. We obtained interesting and new results from the Alzheimer's dataset, but limited results from the breast cancer dataset. We conclude that our results support that LEAP is a useful tool for extracting candidate interacting SNPs from high-dimensional datasets and determining their probability.
Collapse
|
29
|
Abstract
One of the challenges of understanding the genetic basis of complex phenotypes is explaining variability not attributable to individual genes. While most existing methods that investigate variant mutations or differential gene expression focus on individual effects, a complex system of gene interactions (epistasis) and pathways is likely needed to explain phenotypic variation. Herein, we examine methods for treating the interactions in these biological data sets as edges in a network model of the phenotype and review relevant network theory methods for analyzing network structure and identifying important genes. In particular, we review methods for detecting community structure, describing the statistical properties of networks, and computing network centrality of genes that may reveal insights missed by individual genetic effects. We also discuss available tools to facilitate the construction and visualization of epistasis networks of GWAS data.
Collapse
|
30
|
Identifying genetic interactions associated with late-onset Alzheimer's disease. BioData Min 2014; 7:35. [PMID: 25649863 PMCID: PMC4300162 DOI: 10.1186/s13040-014-0035-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2014] [Accepted: 12/06/2014] [Indexed: 01/23/2023] Open
Abstract
Background Identifying genetic interactions in data obtained from genome-wide association studies (GWASs) can help in understanding the genetic basis of complex diseases. The large number of single nucleotide polymorphisms (SNPs) in GWASs however makes the identification of genetic interactions computationally challenging. We developed the Bayesian Combinatorial Method (BCM) that can identify pairs of SNPs that in combination have high statistical association with disease. Results We applied BCM to two late-onset Alzheimer’s disease (LOAD) GWAS datasets to identify SNPs that interact with known Alzheimer associated SNPs. We also compared BCM with logistic regression that is implemented in PLINK. Gene Ontology analysis of genes from the top 200 dataset SNPs for both GWAS datasets showed overrepresentation of LOAD-related terms. Four genes were common to both datasets: APOE and APOC1, which have well established associations with LOAD, and CAMK1D and FBXL13, not previously linked to LOAD but having evidence of involvement in LOAD. Supporting evidence was also found for additional genes from the top 30 dataset SNPs. Conclusion BCM performed well in identifying several SNPs having evidence of involvement in the pathogenesis of LOAD that would not have been identified by univariate analysis due to small main effect. These results provide support for applying BCM to identify potential genetic variants such as SNPs from high dimensional GWAS datasets. Electronic supplementary material The online version of this article (doi:10.1186/s13040-014-0035-z) contains supplementary material, which is available to authorized users.
Collapse
|
31
|
Bayesian neural networks for detecting epistasis in genetic association studies. BMC Bioinformatics 2014; 15:368. [PMID: 25413600 PMCID: PMC4256933 DOI: 10.1186/s12859-014-0368-0] [Citation(s) in RCA: 20] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2014] [Accepted: 10/30/2014] [Indexed: 12/02/2022] Open
Abstract
Background Discovering causal genetic variants from large genetic association studies poses many difficult challenges. Assessing which genetic markers are involved in determining trait status is a computationally demanding task, especially in the presence of gene-gene interactions. Results A non-parametric Bayesian approach in the form of a Bayesian neural network is proposed for use in analyzing genetic association studies. Demonstrations on synthetic and real data reveal they are able to efficiently and accurately determine which variants are involved in determining case-control status. By using graphics processing units (GPUs) the time needed to build these models is decreased by several orders of magnitude. In comparison with commonly used approaches for detecting interactions, Bayesian neural networks perform very well across a broad spectrum of possible genetic relationships. Conclusions The proposed framework is shown to be a powerful method for detecting causal SNPs while being computationally efficient enough to handle large datasets. Electronic supplementary material The online version of this article (doi:10.1186/s12859-014-0368-0) contains supplementary material, which is available to authorized users.
Collapse
|
32
|
MACOED: a multi-objective ant colony optimization algorithm for SNP epistasis detection in genome-wide association studies. Bioinformatics 2014; 31:634-41. [PMID: 25338719 DOI: 10.1093/bioinformatics/btu702] [Citation(s) in RCA: 55] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023] Open
Abstract
MOTIVATION The existing methods for genetic-interaction detection in genome-wide association studies are designed from different paradigms, and their performances vary considerably for different disease models. One important reason for this variability is that their construction is based on a single-correlation model between SNPs and disease. Due to potential model preference and disease complexity, a single-objective method will therefore not work well in general, resulting in low power and a high false-positive rate. METHOD In this work, we present a multi-objective heuristic optimization methodology named MACOED for detecting genetic interactions. In MACOED, we combine both logistical regression and Bayesian network methods, which are from opposing schools of statistics. The combination of these two evaluation objectives proved to be complementary, resulting in higher power with a lower false-positive rate than observed for optimizing either objective independently. To solve the space and time complexity for high-dimension problems, a memory-based multi-objective ant colony optimization algorithm is designed in MACOED that is able to retain non-dominated solutions found in past iterations. RESULTS We compared MACOED with other recent algorithms using both simulated and real datasets. The experimental results demonstrate that our method outperforms others in both detection power and computational feasibility for large datasets. AVAILABILITY AND IMPLEMENTATION Codes and datasets are available at: www.csbio.sjtu.edu.cn/bioinf/MACOED/.
Collapse
|
33
|
Abstract
Bayesian networks are a type of probabilistic graphical models lie at the intersection between statistics and machine learning. They have been shown to be powerful tools to encode dependence relationships among the variables of a domain under uncertainty. Thanks to their generality, Bayesian networks can accommodate continuous and discrete variables, as well as temporal processes. In this paper we review Bayesian networks and how they can be learned automatically from data by means of structure learning algorithms. Also, we examine how a user can take advantage of these networks for reasoning by exact or approximate inference algorithms that propagate the given evidence through the graphical structure. Despite their applicability in many fields, they have been little used in neuroscience, where they have focused on specific problems, like functional connectivity analysis from neuroimaging data. Here we survey key research in neuroscience where Bayesian networks have been used with different aims: discover associations between variables, perform probabilistic reasoning over the model, and classify new observations with and without supervision. The networks are learned from data of any kind-morphological, electrophysiological, -omics and neuroimaging-, thereby broadening the scope-molecular, cellular, structural, functional, cognitive and medical- of the brain aspects to be studied.
Collapse
|
34
|
Inferring Aberrant Signal Transduction Pathways in Ovarian Cancer from TCGA Data. Cancer Inform 2014; 13:29-36. [PMID: 25392681 PMCID: PMC4216062 DOI: 10.4137/cin.s13881] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2014] [Revised: 03/10/2014] [Accepted: 03/10/2014] [Indexed: 12/12/2022] Open
Abstract
This paper concerns a new method for identifying aberrant signal transduction pathways (STPs) in cancer using case/control gene expression-level datasets, and applying that method and an existing method to an ovarian carcinoma dataset. Both methods identify STPs that are plausibly linked to all cancers based on current knowledge. Thus, the paper is most appropriate for the cancer informatics community. Our hypothesis is that STPs that are altered in tumorous tissue can be identified by applying a new Bayesian network (BN)-based method (causal analysis of STP aberration (CASA)) and an existing method (signaling pathway impact analysis (SPIA)) to the cancer genome atlas (TCGA) gene expression-level datasets. To test this hypothesis, we analyzed 20 cancer-related STPs and 6 randomly chosen STPs using the 591 cases in the TCGA ovarian carcinoma dataset, and the 102 controls in all 5 TCGA cancer datasets. We identified all the genes related to each of the 26 pathways, and developed separate gene expression datasets for each pathway. The results of the two methods were highly correlated. Furthermore, many of the STPs that ranked highest according to both methods are plausibly linked to all cancers based on current knowledge. Finally, CASA ranked the cancer-related STPs over the randomly selected STPs at a significance level below 0.05 (P = 0.047), but SPIA did not (P = 0.083).
Collapse
|
35
|
A comparative analysis of methods for predicting clinical outcomes using high-dimensional genomic datasets. J Am Med Inform Assoc 2014; 21:e312-9. [PMID: 24737607 PMCID: PMC4173174 DOI: 10.1136/amiajnl-2013-002358] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2013] [Revised: 02/20/2014] [Accepted: 03/14/2014] [Indexed: 11/03/2022] Open
Abstract
OBJECTIVE The objective of this investigation is to evaluate binary prediction methods for predicting disease status using high-dimensional genomic data. The central hypothesis is that the Bayesian network (BN)-based method called efficient Bayesian multivariate classifier (EBMC) will do well at this task because EBMC builds on BN-based methods that have performed well at learning epistatic interactions. METHOD We evaluate how well eight methods perform binary prediction using high-dimensional discrete genomic datasets containing epistatic interactions. The methods are as follows: naive Bayes (NB), model averaging NB (MANB), feature selection NB (FSNB), EBMC, logistic regression (LR), support vector machines (SVM), Lasso, and extreme learning machines (ELM). We use a hundred 1000-single nucleotide polymorphism (SNP) simulated datasets, ten 10,000-SNP datasets, six semi-synthetic sets, and two real genome-wide association studies (GWAS) datasets in our evaluation. RESULTS In fivefold cross-validation studies, the SVM performed best on the 1000-SNP dataset, while the BN-based methods performed best on the other datasets, with EBMC exhibiting the best overall performance. In-sample testing indicates that LR, SVM, Lasso, ELM, and NB tend to overfit the data. DISCUSSION EBMC performed better than NB when there are several strong predictors, whereas NB performed better when there are many weak predictors. Furthermore, for all BN-based methods, prediction capability did not degrade as the dimension increased. CONCLUSIONS Our results support the hypothesis that EBMC performs well at binary outcome prediction using high-dimensional discrete datasets containing epistatic-like interactions. Future research using more GWAS datasets is needed to further investigate the potential of EBMC.
Collapse
|
36
|
Modeling the altered expression levels of genes on signaling pathways in tumors as causal bayesian networks. Cancer Inform 2014; 13:77-84. [PMID: 24932098 PMCID: PMC4051800 DOI: 10.4137/cin.s13578] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2013] [Revised: 11/25/2013] [Accepted: 11/25/2013] [Indexed: 01/05/2023] Open
Abstract
This paper concerns a study indicating that the expression levels of genes in signaling pathways can be modeled using a causal Bayesian network (BN) that is altered in tumorous tissue. These results open up promising areas of future research that can help identify driver genes and therapeutic targets. So, it is most appropriate for the cancer informatics community. Our central hypothesis is that the expression levels of genes that code for proteins on a signal transduction network (STP) are causally related and that this causal structure is altered when the STP is involved in cancer. To test this hypothesis, we analyzed 5 STPs associated with breast cancer, 7 STPs associated with other cancers, and 10 randomly chosen pathways, using a breast cancer gene expression level dataset containing 529 cases and 61 controls. We identified all the genes related to each of the 22 pathways and developed separate gene expression datasets for each pathway. We obtained significant results indicating that the causal structure of the expression levels of genes coding for proteins on STPs, which are believed to be implicated in both breast cancer and in all cancers, is more altered in the cases relative to the controls than the causal structure of the randomly chosen pathways.
Collapse
|
37
|
A novel artificial neural network method for biomedical prediction based on matrix pseudo-inversion. J Biomed Inform 2013; 48:114-21. [PMID: 24361387 DOI: 10.1016/j.jbi.2013.12.009] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2013] [Revised: 12/06/2013] [Accepted: 12/11/2013] [Indexed: 12/13/2022]
Abstract
Biomedical prediction based on clinical and genome-wide data has become increasingly important in disease diagnosis and classification. To solve the prediction problem in an effective manner for the improvement of clinical care, we develop a novel Artificial Neural Network (ANN) method based on Matrix Pseudo-Inversion (MPI) for use in biomedical applications. The MPI-ANN is constructed as a three-layer (i.e., input, hidden, and output layers) feed-forward neural network, and the weights connecting the hidden and output layers are directly determined based on MPI without a lengthy learning iteration. The LASSO (Least Absolute Shrinkage and Selection Operator) method is also presented for comparative purposes. Single Nucleotide Polymorphism (SNP) simulated data and real breast cancer data are employed to validate the performance of the MPI-ANN method via 5-fold cross validation. Experimental results demonstrate the efficacy of the developed MPI-ANN for disease classification and prediction, in view of the significantly superior accuracy (i.e., the rate of correct predictions), as compared with LASSO. The results based on the real breast cancer data also show that the MPI-ANN has better performance than other machine learning methods (including support vector machine (SVM), logistic regression (LR), and an iterative ANN). In addition, experiments demonstrate that our MPI-ANN could be used for bio-marker selection as well.
Collapse
|
38
|
Synthetic sickness or lethality points at candidate combination therapy targets in glioblastoma. Int J Cancer 2013; 133:2123-32. [PMID: 23629686 DOI: 10.1002/ijc.28235] [Citation(s) in RCA: 27] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/13/2012] [Accepted: 04/11/2013] [Indexed: 12/30/2022]
Abstract
Synthetic lethal interactions in cancer hold the potential for successful combined therapies, which would avoid the difficulties of single molecule-targeted treatment. Identification of interactions that are specific for human tumors is an open problem in cancer research. This work aims at deciphering synthetic sick or lethal interactions directly from somatic alteration, expression and survival data of cancer patients. To this end, we look for pairs of genes and their alterations or expression levels that are "avoided" by tumors and "beneficial" for patients. Thus, candidates for synthetic sickness or lethality (SSL) interaction are identified as such gene pairs whose combination of states is under-represented in the data. Our main methodological contribution is a quantitative score that allows ranking of the candidate SSL interactions according to evidence found in patient survival. Applying this analysis to glioblastoma data, we collect 1,956 synthetic sick or lethal partners for 85 abundantly altered genes, most of which show extensive copy number variation across the patient cohort. We rediscover and interpret known interaction between TP53 and PLK1, as well as provide insight into the mechanism behind EGFR interacting with AKT2, but not AKT1 nor AKT3. Cox model analysis determines 274 of identified interactions as having significant impact on overall survival in glioblastoma, which is more informative than a standard survival predictor based on patient's age.
Collapse
|
39
|
An omnibus permutation test on ensembles of two-locus analyses can detect pure epistasis and genetic heterogeneity in genome-wide association studies. SPRINGERPLUS 2013; 2:230. [PMID: 24804170 PMCID: PMC4006521 DOI: 10.1186/2193-1801-2-230] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 11/20/2012] [Accepted: 04/24/2013] [Indexed: 01/20/2023]
Abstract
This article presents the ability of an omnibus permutation test on ensembles of two-locus analyses (2LOmb) to detect pure epistasis in the presence of genetic heterogeneity. The performance of 2LOmb is evaluated in various simulation scenarios covering two independent causes of complex disease where each cause is governed by a purely epistatic interaction. Different scenarios are set up by varying the number of available single nucleotide polymorphisms (SNPs) in data, number of causative SNPs and ratio of case samples from two affected groups. The simulation results indicate that 2LOmb outperforms multifactor dimensionality reduction (MDR) and random forest (RF) techniques in terms of a low number of output SNPs and a high number of correctly-identified causative SNPs. Moreover, 2LOmb is capable of identifying the number of independent interactions in tractable computational time and can be used in genome-wide association studies. 2LOmb is subsequently applied to a type 1 diabetes mellitus (T1D) data set, which is collected from a UK population by the Wellcome Trust Case Control Consortium (WTCCC). After screening for SNPs that locate within or near genes and exhibit no marginal single-locus effects, the T1D data set is reduced to 95,991 SNPs from 12,146 genes. The 2LOmb search in the reduced T1D data set reveals that 12 SNPs, which can be divided into two independent sets, are associated with the disease. The first SNP set consists of three SNPs from MUC21 (mucin 21, cell surface associated), three SNPs from MUC22 (mucin 22), two SNPs from PSORS1C1 (psoriasis susceptibility 1 candidate 1) and one SNP from TCF19 (transcription factor 19). A four-locus interaction between these four genes is also detected. The second SNP set consists of three SNPs from ATAD1 (ATPase family, AAA domain containing 1). Overall, the findings indicate the detection of pure epistasis in the presence of genetic heterogeneity and provide an alternative explanation for the aetiology of T1D in the UK population.
Collapse
|
40
|
Prediction of P53 mutants (multiple sites) transcriptional activity based on structural (2D&3D) properties. PLoS One 2013; 8:e55401. [PMID: 23468845 PMCID: PMC3572112 DOI: 10.1371/journal.pone.0055401] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/19/2012] [Accepted: 12/21/2012] [Indexed: 01/05/2023] Open
Abstract
Prediction of secondary site mutations that reinstate mutated p53 to normalcy has been the focus of intense research in the recent past owing to the fact that p53 mutants have been implicated in more than half of all human cancers and restoration of p53 causes tumor regression. However laboratory investigations are more often laborious and resource intensive but computational techniques could well surmount these drawbacks. In view of this, we formulated a novel approach utilizing computational techniques to predict the transcriptional activity of multiple site (one-site to five-site) p53 mutants. The optimal MCC obtained by the proposed approach on prediction of one-site, two-site, three-site, four-site and five-site mutants were 0.775,0.341,0.784,0.916 and 0.655 respectively, the highest reported thus far in literature. We have also demonstrated that 2D and 3D features generate higher prediction accuracy of p53 activity and our findings revealed the optimal results for prediction of p53 status, reported till date. We believe detection of the secondary site mutations that suppress tumor growth may facilitate better understanding of the relationship between p53 structure and function and further knowledge on the molecular mechanisms and biological activity of p53, a targeted source for cancer therapy. We expect that our prediction methods and reported results may provide useful insights on p53 functional mechanisms and generate more avenues for utilizing computational techniques in biological data analysis.
Collapse
|
41
|
Genetic studies of complex human diseases: characterizing SNP-disease associations using Bayesian networks. BMC SYSTEMS BIOLOGY 2012; 6 Suppl 3:S14. [PMID: 23281790 PMCID: PMC3524021 DOI: 10.1186/1752-0509-6-s3-s14] [Citation(s) in RCA: 39] [Impact Index Per Article: 3.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Indexed: 12/28/2022]
Abstract
BACKGROUND Detecting epistatic interactions plays a significant role in improving pathogenesis, prevention, diagnosis, and treatment of complex human diseases. Applying machine learning or statistical methods to epistatic interaction detection will encounter some common problems, e.g., very limited number of samples, an extremely high search space, a large number of false positives, and ways to measure the association between disease markers and the phenotype. RESULTS To address the problems of computational methods in epistatic interaction detection, we propose a score-based Bayesian network structure learning method, EpiBN, to detect epistatic interactions. We apply the proposed method to both simulated datasets and three real disease datasets. Experimental results on simulation data show that our method outperforms some other commonly-used methods in terms of power and sample-efficiency, and is especially suitable for detecting epistatic interactions with weak or no marginal effects. Furthermore, our method is scalable to real disease data. CONCLUSIONS We propose a Bayesian network-based method, EpiBN, to detect epistatic interactions. In EpiBN, we develop a new scoring function, which can reflect higher-order epistatic interactions by estimating the model complexity from data, and apply a fast Branch-and-Bound algorithm to learn the structure of a two-layer Bayesian network containing only one target node. To make our method scalable to real data, we propose the use of a Markov chain Monte Carlo (MCMC) method to perform the screening process. Applications of the proposed method to some real GWAS (genome-wide association studies) datasets may provide helpful insights into understanding the genetic basis of Age-related Macular Degeneration, late-onset Alzheimer's disease, and autism.
Collapse
|
42
|
Mining pure, strict epistatic interactions from high-dimensional datasets: ameliorating the curse of dimensionality. PLoS One 2012; 7:e46771. [PMID: 23071633 PMCID: PMC3470561 DOI: 10.1371/journal.pone.0046771] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/01/2012] [Accepted: 09/07/2012] [Indexed: 01/13/2023] Open
Abstract
BACKGROUND The interaction between loci to affect phenotype is called epistasis. It is strict epistasis if no proper subset of the interacting loci exhibits a marginal effect. For many diseases, it is likely that unknown epistatic interactions affect disease susceptibility. A difficulty when mining epistatic interactions from high-dimensional datasets concerns the curse of dimensionality. There are too many combinations of SNPs to perform an exhaustive search. A method that could locate strict epistasis without an exhaustive search can be considered the brass ring of methods for analyzing high-dimensional datasets. METHODOLOGY/FINDINGS A SNP pattern is a Bayesian network representing SNP-disease relationships. The Bayesian score for a SNP pattern is the probability of the data given the pattern, and has been used to learn SNP patterns. We identified a bound for the score of a SNP pattern. The bound provides an upper limit on the Bayesian score of any pattern that could be obtained by expanding a given pattern. We felt that the bound might enable the data to say something about the promise of expanding a 1-SNP pattern even when there are no marginal effects. We tested the bound using simulated datasets and semi-synthetic high-dimensional datasets obtained from GWAS datasets. We found that the bound was able to dramatically reduce the search time for strict epistasis. Using an Alzheimer's dataset, we showed that it is possible to discover an interaction involving the APOE gene based on its score because of its large marginal effect, but that the bound is most effective at discovering interactions without marginal effects. CONCLUSIONS/SIGNIFICANCE We conclude that the bound appears to ameliorate the curse of dimensionality in high-dimensional datasets. This is a very consequential result and could be pivotal in our efforts to reveal the dark matter of genetic disease risk from high-dimensional datasets.
Collapse
|
43
|
Dynamic network-based epistasis analysis: boolean examples. FRONTIERS IN PLANT SCIENCE 2011; 2:92. [PMID: 22645556 PMCID: PMC3355816 DOI: 10.3389/fpls.2011.00092] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/03/2011] [Accepted: 11/17/2011] [Indexed: 05/21/2023]
Abstract
In this article we focus on how the hierarchical and single-path assumptions of epistasis analysis can bias the inference of gene regulatory networks. Here we emphasize the critical importance of dynamic analyses, and specifically illustrate the use of Boolean network models. Epistasis in a broad sense refers to gene interactions, however, as originally proposed by Bateson, epistasis is defined as the blocking of a particular allelic effect due to the effect of another allele at a different locus (herein, classical epistasis). Classical epistasis analysis has proven powerful and useful, allowing researchers to infer and assign directionality to gene interactions. As larger data sets are becoming available, the analysis of classical epistasis is being complemented with computer science tools and system biology approaches. We show that when the hierarchical and single-path assumptions are not met in classical epistasis analysis, the access to relevant information and the correct inference of gene interaction topologies is hindered, and it becomes necessary to consider the temporal dynamics of gene interactions. The use of dynamical networks can overcome these limitations. We particularly focus on the use of Boolean networks that, like classical epistasis analysis, relies on logical formalisms, and hence can complement classical epistasis analysis and relax its assumptions. We develop a couple of theoretical examples and analyze them from a dynamic Boolean network model perspective. Boolean networks could help to guide additional experiments and discern among alternative regulatory schemes that would be impossible or difficult to infer without the elimination of these assumption from the classical epistasis analysis. We also use examples from the literature to show how a Boolean network-based approach has resolved ambiguities and guided epistasis analysis. Our article complements previous accounts, not only by focusing on the implications of the hierarchical and single-path assumption, but also by demonstrating the importance of considering temporal dynamics, and specifically introducing the usefulness of Boolean network models and also reviewing some key properties of network approaches.
Collapse
|
44
|
Performance analysis of novel methods for detecting epistasis. BMC Bioinformatics 2011; 12:475. [PMID: 22172045 PMCID: PMC3259123 DOI: 10.1186/1471-2105-12-475] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2011] [Accepted: 12/15/2011] [Indexed: 02/03/2023] Open
Abstract
Background Epistasis is recognized fundamentally important for understanding the mechanism of disease-causing genetic variation. Though many novel methods for detecting epistasis have been proposed, few studies focus on their comparison. Undertaking a comprehensive comparison study is an urgent task and a pathway of the methods to real applications. Results This paper aims at a comparison study of epistasis detection methods through applying related software packages on datasets. For this purpose, we categorize methods according to their search strategies, and select five representative methods (TEAM, BOOST, SNPRuler, AntEpiSeeker and epiMODE) originating from different underlying techniques for comparison. The methods are tested on simulated datasets with different size, various epistasis models, and with/without noise. The types of noise include missing data, genotyping error and phenocopy. Performance is evaluated by detection power (three forms are introduced), robustness, sensitivity and computational complexity. Conclusions None of selected methods is perfect in all scenarios and each has its own merits and limitations. In terms of detection power, AntEpiSeeker performs best on detecting epistasis displaying marginal effects (eME) and BOOST performs best on identifying epistasis displaying no marginal effects (eNME). In terms of robustness, AntEpiSeeker is robust to all types of noise on eME models, BOOST is robust to genotyping error and phenocopy on eNME models, and SNPRuler is robust to phenocopy on eME models and missing data on eNME models. In terms of sensitivity, AntEpiSeeker is the winner on eME models and both SNPRuler and BOOST perform well on eNME models. In terms of computational complexity, BOOST is the fastest among the methods. In terms of overall performance, AntEpiSeeker and BOOST are recommended as the efficient and effective methods. This comparison study may provide guidelines for applying the methods and further clues for epistasis detection.
Collapse
|
45
|
Evaluating de novo locus-disease discoveries in GWAS using the signal-to-noise ratio. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2011; 2011:617-624. [PMID: 22195117 PMCID: PMC3243170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
A genome-wide association study (GWAS) involves examining representative SNPs obtained using high throughput technologies. A GWAS data set can entail a million SNPs and may soon entail many millions. In a GWAS researchers often investigate the correlation of each SNP with a disease. With so many hypotheses, it is not straightforward how to interpret the results. Strategies include using the Bonferroni correction to determine the significance of a model and Bayesian methods. However, when we are discovering new locus-disease associations, i.e., so called de novo discoveries, we should not just endeavor to determine the significance of particular models, but also concern ourselves with determining whether it is likely that we have any true discoveries, and if so how many of the highest ranking models we should investigate further. We develop a method based on a signal-to-noise ratio that targets this issue. We apply the method to a GWAS Alzheimer's data set.
Collapse
|
46
|
A bayesian method for evaluating and discovering disease loci associations. PLoS One 2011; 6:e22075. [PMID: 21853025 PMCID: PMC3154195 DOI: 10.1371/journal.pone.0022075] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2011] [Accepted: 06/14/2011] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND A genome-wide association study (GWAS) typically involves examining representative SNPs in individuals from some population. A GWAS data set can concern a million SNPs and may soon concern billions. Researchers investigate the association of each SNP individually with a disease, and it is becoming increasingly commonplace to also analyze multi-SNP associations. Techniques for handling so many hypotheses include the Bonferroni correction and recently developed bayesian methods. These methods can encounter problems. Most importantly, they are not applicable to a complex multi-locus hypothesis which has several competing hypotheses rather than only a null hypothesis. A method that computes the posterior probability of complex hypotheses is a pressing need. METHODOLOGY/FINDINGS We introduce the bayesian network posterior probability (BNPP) method which addresses the difficulties. The method represents the relationship between a disease and SNPs using a directed acyclic graph (DAG) model, and computes the likelihood of such models using a bayesian network scoring criterion. The posterior probability of a hypothesis is computed based on the likelihoods of all competing hypotheses. The BNPP can not only be used to evaluate a hypothesis that has previously been discovered or suspected, but also to discover new disease loci associations. The results of experiments using simulated and real data sets are presented. Our results concerning simulated data sets indicate that the BNPP exhibits both better evaluation and discovery performance than does a p-value based method. For the real data sets, previous findings in the literature are confirmed and additional findings are found. CONCLUSIONS/SIGNIFICANCE We conclude that the BNPP resolves a pressing problem by providing a way to compute the posterior probability of complex multi-locus hypotheses. A researcher can use the BNPP to determine the expected utility of investigating a hypothesis further. Furthermore, we conclude that the BNPP is a promising method for discovering disease loci associations.
Collapse
|