1
|
SNPs-Panel Polymorphism Variations in GHRL and GHSR Genes Are Not Associated with Prostate Cancer. Biomedicines 2023; 11:3276. [PMID: 38137497 PMCID: PMC10741232 DOI: 10.3390/biomedicines11123276] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/02/2023] [Revised: 12/05/2023] [Accepted: 12/09/2023] [Indexed: 12/24/2023] Open
Abstract
Prostate cancer (PCa) is a major public health problem worldwide. Recent studies have suggested that ghrelin and its receptor could be involved in the susceptibility to several cancers such as PCa, leading to their use as an important predictive way for the clinical progression and prognosis of cancer. However, conflicting results of single nucleotide polymorphisms (SNPs) with ghrelin (GHRL) and its receptor (GHSR) genes were demonstrated in different studies. Thus, the present case-control study was undertaken to investigate the association of GHRL and GHSR polymorphisms with the susceptibility to sporadic PCa. A cohort of 120 PCa patients and 95 healthy subjects were enrolled in this study. Genotyping of six SNPs was performed: three tag SNPs in GHRL (rs696217, rs4684677, rs3491141) and three tag SNPs in the GHSR (rs2922126, rs572169, rs2948694) using TaqMan. The allele and genotype distribution, as well as haplotypes frequencies and linked disequilibrium (LD), were established. Multifactor dimensionality reduction (MDR) analysis was used to study gene-gene interactions between the six SNPs. Our results showed no significant association of the target polymorphisms with PCa (p > 0.05). Nevertheless, SNPs are often just markers that help identify or delimit specific genomic regions that may harbour functional variants rather than the variants causing the disease. Furthermore, we found that one GHSR rs2922126, namely the TT genotype, was significantly more frequent in PCa patients than in controls (p = 0.040). These data suggest that this genotype could be a PCa susceptibility genotype. MDR analyses revealed that the rs2922126 and rs572169 combination was the best model, with 81.08% accuracy (p = 0.0001) for predicting susceptibility to PCa. The results also showed a precision of 98.1% (p < 0.0001) and a PR-AUC of 1.00. Our findings provide new insights into the influence of GHRL and GHSR polymorphisms and significant evidence for gene-gene interactions in PCa susceptibility, and they may guide clinical decision-making to prevent overtreatment and enhance patients' quality of life.
Collapse
|
2
|
Identification of potential key genes and functional role of CENPF in osteosarcoma using bioinformatics and experimental analysis. Exp Ther Med 2021; 23:80. [PMID: 34934449 PMCID: PMC8652394 DOI: 10.3892/etm.2021.11003] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2020] [Accepted: 09/21/2021] [Indexed: 11/25/2022] Open
Abstract
Osteosarcoma, which arises from bone tissue, is considered to be one of the most common types of cancer in children and teenagers. As the etiology of osteosarcoma has not been fully elucidated, the overall prognosis for patients is generally poor. In recent years, the development of bioinformatical technology has allowed researchers to identify numerous molecular biological characteristics associated with the prognosis of osteosarcoma using online databases. In the present study, Gene Expression Omnibus (GEO) database was used and three microarray datasets were obtained. The GEO2R web tool was utilized and differentially expressed genes (DEGs) in osteosarcoma tissue were identified. Venn analysis was performed to determine the intersection of the DEG profiles. DEGs were analyzed by Gene Ontology function and Kyoto Encyclopedia of Genes and Genomes pathway enrichment analysis. Protein-protein interactions (PPIs) between these DEGs were analyzed using the Search Tool for the Retrieval of Interacting Genes database, and the PPI network was then visualized using Cytoscape software. The top ten genes were identified based on measurement of degree, density of maximum neighborhood component, maximal clique centrality and mononuclear cell counts in the PPI network, and five overlapping genes [origin recognition complex subunit 6 (ORC6), IGF-binding protein 5 (IGFBP5), minichromosome maintenance 10 replication initiation factor (MCM10), MET proto-oncogene, receptor tyrosine kinase (MET) and centromere protein F (CENPF)] were identified. Additionally, three module networks were analyzed by Molecular Complex Detection (MCODE), and six key genes [ORC6, MCM10, DEP domain containing 1 (DEPDC1), CENPF, TIMELESS interacting protein (TIPIN) and shugoshin 1 (SGOL1)] were screened. Combined with the results from Cytoscape and MCODE, eight hub genes (ORC6, MCM10, DEPDC1, CENPF, TIPIN, SGOL1, MET and IGFBP5) were obtained. Furthermore, Kaplan-Meier plotter survival analysis was used to evaluate the prognostic value of these eight hub genes in patients with osteosarcoma. Oncomine and GEPIA databases were applied to further confirm the expression levels of hub genes in tissue. Finally, the functional roles of the core gene CENPF were investigated using Cell Counting Kit-8, wound healing and Transwell assays, which indicated that CENPF knockdown inhibited the proliferation, migration and invasion of osteosarcoma cells. These results provided potential prognostic markers, as well as a basis for further investigation of the mechanism underlying osteosarcoma.
Collapse
|
3
|
Diagnostic and Prognostic Significance of Keap1 mRNA Expression for Lung Cancer Based on Microarray and Clinical Information from Oncomine Database. Curr Med Sci 2021; 41:597-609. [PMID: 34169426 DOI: 10.1007/s11596-021-2378-2] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/25/2019] [Accepted: 01/21/2021] [Indexed: 11/29/2022]
Abstract
We performed a bioinformatics analysis with validation by multiple databases, aiming to evaluate the diagnostic and prognostic value of Kelch-like ECH-associated protein 1 (Keap1) mRNA for lung cancer, and to explore possible mechanisms. Diagnostic performance of Keap1 mRNA was determined by receiver operating characteristic (ROC) curve analysis. Prognostic implication of Keap1 mRNA was estimated by Kaplan-Meier survival analysis. Co-expressed genes with both Keap1 and Nfe2L2 were identified by LinkedOmics. Mechanisms of Keap1-Nfe2L2-co-expressed genes underlying the pathogenesis of lung cancer were explored by function enrichment and pathway analysis. The ROC curve analysis determined a good diagnostic performance of Keap1 mRNA for lung squamous cell carcinoma (LUSC), with an area under the ROC curve (AUC) of 0.833, sensitivity of 72.7%, and specificity of 90.6% (P<0.001). Multivariate Cox regression recognized high Keap1 mRNA to be an independent risk factor of mortality for overall lung cancer [hazard ratio (HR): 11.034, P=0.044], but an independent antagonistic factor for lung adenocarcinoma (LUAD) (HR: 0.404, P<0.001). Validation by UALCAN and GEPIA supported Oncomine findings regarding the diagnostic value of Keap1 mRNA for LUSC, but denied its prognostic value. After screening, we identified 17 co-expressed genes with both Keap1 and Nfe2L2 for LUAD, and 22 for LUSC, mainly enriched in signaling pathway of oxidative stress-induced gene expression via Nrf2. In conclusion, Keap1 mRNA has a good diagnostic performance, but controversial prognostic efficacy for LUSC. The pathogenesis of lung cancer is associated with Keap1-Nfe2L2-co-expressed genes by signaling pathway of oxidative stress-induced gene expression via Nrf2.
Collapse
|
4
|
Exploring a Local Genetic Interaction Network Using Evolutionary Replay Experiments. Mol Biol Evol 2021; 38:3144-3152. [PMID: 33749796 PMCID: PMC8321538 DOI: 10.1093/molbev/msab087] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/18/2022] Open
Abstract
Understanding how genes interact is a central challenge in biology. Experimental evolution provides a useful, but underutilized, tool for identifying genetic interactions, particularly those that involve non-loss-of-function mutations or mutations in essential genes. We previously identified a strong positive genetic interaction between specific mutations in KEL1 (P344T) and HSL7 (A695fs) that arose in an experimentally evolved Saccharomyces cerevisiae population. Because this genetic interaction is not phenocopied by gene deletion, it was previously unknown. Using “evolutionary replay” experiments, we identified additional mutations that have positive genetic interactions with the kel1-P344T mutation. We replayed the evolution of this population 672 times from six timepoints. We identified 30 populations where the kel1-P344T mutation reached high frequency. We performed whole-genome sequencing on these populations to identify genes in which mutations arose specifically in the kel1-P344T background. We reconstructed mutations in the ancestral and kel1-P344T backgrounds to validate positive genetic interactions. We identify several genetic interactors with KEL1, we validate these interactions by reconstruction experiments, and we show these interactions are not recapitulated by loss-of-function mutations. Our results demonstrate the power of experimental evolution to identify genetic interactions that are positive, allele specific, and not readily detected by other methods, shedding light on an underexplored region of the yeast genetic interaction network.
Collapse
|
5
|
Robust Sampling of Defective Pathways in Alzheimer's Disease. Implications in Drug Repositioning. Int J Mol Sci 2020; 21:ijms21103594. [PMID: 32438758 PMCID: PMC7279419 DOI: 10.3390/ijms21103594] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/27/2020] [Revised: 05/09/2020] [Accepted: 05/13/2020] [Indexed: 12/21/2022] Open
Abstract
We present the analysis of the defective genetic pathways of the Late-Onset Alzheimer’s Disease (LOAD) compared to the Mild Cognitive Impairment (MCI) and Healthy Controls (HC) using different sampling methodologies. These algorithms sample the uncertainty space that is intrinsic to any kind of highly underdetermined phenotype prediction problem, by looking for the minimum-scale signatures (header genes) corresponding to different random holdouts. The biological pathways can be identified performing posterior analysis of these signatures established via cross-validation holdouts and plugging the set of most frequently sampled genes into different ontological platforms. That way, the effect of helper genes, whose presence might be due to the high degree of under determinacy of these experiments and data noise, is reduced. Our results suggest that common pathways for Alzheimer’s disease and MCI are mainly related to viral mRNA translation, influenza viral RNA transcription and replication, gene expression, mitochondrial translation, and metabolism, with these results being highly consistent regardless of the comparative methods. The cross-validated predictive accuracies achieved for the LOAD and MCI discriminations were 84% and 81.5%, respectively. The difference between LOAD and MCI could not be clearly established (74% accuracy). The most discriminatory genes of the LOAD-MCI discrimination are associated with proteasome mediated degradation and G-protein signaling. Based on these findings we have also performed drug repositioning using Dr. Insight package, proposing the following different typologies of drugs: isoquinoline alkaloids, antitumor antibiotics, phosphoinositide 3-kinase PI3K, autophagy inhibitors, antagonists of the muscarinic acetylcholine receptor and histone deacetylase inhibitors. We believe that the potential clinical relevance of these findings should be further investigated and confirmed with other independent studies.
Collapse
|
6
|
On the Role of Artificial Intelligence in Genomics to Enhance Precision Medicine. PHARMACOGENOMICS & PERSONALIZED MEDICINE 2020; 13:105-119. [PMID: 32256101 PMCID: PMC7090191 DOI: 10.2147/pgpm.s205082] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 12/02/2019] [Accepted: 02/17/2020] [Indexed: 12/21/2022]
Abstract
The complexity of orphan diseases, which are those that do not have an effective treatment, together with the high dimensionality of the genetic data used for their analysis and the high degree of uncertainty in the understanding of the mechanisms and genetic pathways which are involved in their development, motivate the use of advanced techniques of artificial intelligence and in-depth knowledge of molecular biology, which is crucial in order to find plausible solutions in drug design, including drug repositioning. Particularly, we show that the use of robust deep sampling methodologies of the altered genetics serves to obtain meaningful results and dramatically decreases the cost of research and development in drug design, influencing very positively the use of precision medicine and the outcomes in patients. The target-centric approach and the use of strong prior hypotheses that are not matched against reality (disease genetic data) are undoubtedly the cause of the high number of drug design failures and attrition rates. Sampling and prediction under uncertain conditions cannot be avoided in the development of precision medicine.
Collapse
|
7
|
Abstract
Background Phenotype prediction problems are usually considered ill-posed, as the amount of samples is very limited with respect to the scrutinized genetic probes. This fact complicates the sampling of the defective genetic pathways due to the high number of possible discriminatory genetic networks involved. In this research, we outline three novel sampling algorithms utilized to identify, classify and characterize the defective pathways in phenotype prediction problems, such as the Fisher’s ratio sampler, the Holdout sampler and the Random sampler, and apply each one to the analysis of genetic pathways involved in tumor behavior and outcomes of triple negative breast cancers (TNBC). Altered biological pathways are identified using the most frequently sampled genes and are compared to those obtained via Bayesian Networks (BNs). Results Random, Fisher’s ratio and Holdout samplers were more accurate and robust than BNs, while providing comparable insights about disease genomics. Conclusions The three samplers tested are good alternatives to Bayesian Networks since they are less computationally demanding algorithms. Importantly, this analysis confirms the concept of “biological invariance” since the altered pathways should be independent of the sampling methodology and the classifier used for their inference. Nevertheless, still some modifications are needed in the Bayesian networks to be able to sample correctly the uncertainty space in phenotype prediction problems, since the probabilistic parameterization of the uncertainty space is not unique and the use of the optimum network might falsify the pathways analysis.
Collapse
|
8
|
Identification of genes associated with cancer progression and prognosis in lung adenocarcinoma: Analyses based on microarray from Oncomine and The Cancer Genome Atlas databases. Mol Genet Genomic Med 2018; 7:e00528. [PMID: 30556321 PMCID: PMC6393652 DOI: 10.1002/mgg3.528] [Citation(s) in RCA: 32] [Impact Index Per Article: 5.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/17/2018] [Revised: 10/28/2018] [Accepted: 11/07/2018] [Indexed: 12/27/2022] Open
Abstract
Background Lung adenocarcinoma (LUAD) accounts for approximately 40% of all lung cancer patients. There is an urgent need to understand the mechanisms of cancer progression in LUAD and to identify useful biomarkers to predict prognosis. Methods In this study, Oncomine database was used to identify potential genes contributed to cancer progression. Bioinformatics analysis including pathway enrichment and text mining was used to explain the potential roles of identified genes in LUAD. The Cancer Genome Atlas database was used to analyze the association of gene expression with survival result. Results Our results indicated that 80 genes were significantly dysregulated in LUAD according to four microarrays covering 356 cases of LUAD and 164 cases of normal lung tissues. Twenty genes were consistently and stably dysregulated by more than twofold. Ten of 20 genes had a relationship with overall survival or disease‐free survival in a cohort of 516 LUAD patients, and 19 genes were associated with tumor stage, gender, age, lymph node, or smoking. Low expression of AGER and high expression of CCNB1 were specifically associated with poor survival. Conclusion Our findings implicate AGER and CCNB1 might be potential biomarkers for diagnosis and prognosis targets for LUAD.
Collapse
|
9
|
Predicting the Health Status of an Unmanned Aerial Vehicles Data-Link System Based on a Bayesian Network. SENSORS 2018; 18:s18113916. [PMID: 30428631 PMCID: PMC6263980 DOI: 10.3390/s18113916] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 09/20/2018] [Revised: 11/07/2018] [Accepted: 11/07/2018] [Indexed: 01/12/2023]
Abstract
Unmanned aerial vehicles (UAVs) require data-link system to link ground data terminals to the real-time controls of each UAV. Consequently, the ability to predict the health status of a UAV data-link system is vital for safe and efficient operations. The performance of a UAV data-link system is affected by the health status of both the hardware and UAV data-links. This paper proposes a method for predicting the health state of a UAV data-link system based on a Bayesian network fusion of information about potential hardware device failures and link failures. Our model employs the Bayesian network to describe the information and uncertainty associated with a complex multi-level system. To predict the health status of the UAV data-link, we use the health status information about the root node equipment with various life characteristics along with the health status of the links as affected by the bit error rate. In order to test the validity of the model, we tested its prediction of the health of a multi-level solar-powered unmanned aerial vehicle data-link system and the result shows that the method can quantitatively predict the health status of the solar-powered UAV data-link system. The results can provide guidance for improving the reliability of UAV data-link system and lay a foundation for predicting the health status of a UAV data-link system accurately.
Collapse
|
10
|
Integrative Approaches to Understanding the Pathogenic Role of Genetic Variation in Rheumatic Diseases. Rheum Dis Clin North Am 2018; 43:449-466. [PMID: 28711145 DOI: 10.1016/j.rdc.2017.04.012] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/30/2022]
Abstract
The use of high-throughput omics may help to understand the contribution of genetic variants to the pathogenesis of rheumatic diseases. We discuss the concept of missing heritability: that genetic variants do not explain the heritability of rheumatoid arthritis and related rheumatologic conditions. In addition to an overview of how integrative data analysis can lead to novel insights into mechanisms of rheumatic diseases, we describe statistical approaches to prioritizing genetic variants for future functional analyses. We illustrate how analyses of large datasets provide hope for improved approaches to the diagnosis, treatment, and prevention of rheumatic diseases.
Collapse
|
11
|
Sampling Defective Pathways in Phenotype Prediction Problems via the Fisher’s Ratio Sampler. BIOINFORMATICS AND BIOMEDICAL ENGINEERING 2018. [DOI: 10.1007/978-3-319-78759-6_2] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/27/2023]
|
12
|
Verification of Three-Phase Dependency Analysis Bayesian Network Learning Method for Maize Carotenoid Gene Mining. BIOMED RESEARCH INTERNATIONAL 2017; 2017:1813494. [PMID: 28828382 PMCID: PMC5554554 DOI: 10.1155/2017/1813494] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 02/03/2017] [Accepted: 06/27/2017] [Indexed: 11/17/2022]
Abstract
Background and Objective Mining the genes related to maize carotenoid components is important to improve the carotenoid content and the quality of maize. Methods On the basis of using the entropy estimation method with Gaussian kernel probability density estimator, we use the three-phase dependency analysis (TPDA) Bayesian network structure learning method to construct the network of maize gene and carotenoid components traits. Results In the case of using two discretization methods and setting different discretization values, we compare the learning effect and efficiency of 10 kinds of Bayesian network structure learning methods. The method is verified and analyzed on the maize dataset of global germplasm collection with 527 elite inbred lines. Conclusions The result confirmed the effectiveness of the TPDA method, which outperforms significantly another 9 kinds of Bayesian network learning methods. It is an efficient method of mining genes for maize carotenoid components traits. The parameters obtained by experiments will help carry out practical gene mining effectively in the future.
Collapse
|
13
|
epiACO - a method for identifying epistasis based on ant Colony optimization algorithm. BioData Min 2017; 10:23. [PMID: 28694848 PMCID: PMC5500974 DOI: 10.1186/s13040-017-0143-7] [Citation(s) in RCA: 32] [Impact Index Per Article: 4.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/27/2016] [Accepted: 06/29/2017] [Indexed: 11/23/2022] Open
Abstract
Background Identifying epistasis or epistatic interactions, which refer to nonlinear interaction effects of single nucleotide polymorphisms (SNPs), is essential to understand disease susceptibility and to detect genetic architectures underlying complex diseases. Though many works have been done for identifying epistatic interactions, due to their methodological and computational challenges, the algorithmic development is still ongoing. Results In this study, a method epiACO is proposed to identify epistatic interactions, which based on ant colony optimization algorithm. Highlights of epiACO are the introduced fitness function Svalue, path selection strategies, and a memory based strategy. The Svalue leverages the advantages of both mutual information and Bayesian network to effectively and efficiently measure associations between SNP combinations and the phenotype. Two path selection strategies, i.e., probabilistic path selection strategy and stochastic path selection strategy, are provided to adaptively guide ant behaviors of exploration and exploitation. The memory based strategy is designed to retain candidate solutions found in the previous iterations, and compare them to solutions of the current iteration to generate new candidate solutions, yielding a more accurate way for identifying epistasis. Conclusions Experiments of epiACO and its comparison with other recent methods epiMODE, TEAM, BOOST, SNPRuler, AntEpiSeeker, AntMiner, MACOED, and IACO are performed on both simulation data sets and a real data set of age-related macular degeneration. Results show that epiACO is promising in identifying epistasis and might be an alternative to existing methods.
Collapse
|
14
|
Improving risk management for violence in mental health services: a multimethods approach. PROGRAMME GRANTS FOR APPLIED RESEARCH 2016. [DOI: 10.3310/pgfar04160] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
BackgroundMental health professionals increasingly carry out risk assessments to prevent future violence by their patients. However, there are problems with accuracy and these assessments do not always translate into successful risk management.ObjectivesOur aim was to improve the accuracy of assessment and identify risk factors that are causal to be targeted by clinicians to ensure good risk management. Our objectives were to investigate key risks at the population level, construct new static and dynamic instruments, test validity and construct new models of risk management using Bayesian networks.Methods and resultsWe utilised existing data sets from two national and commissioned a survey to identify risk factors at the population level. We confirmed that certain mental health factors previously thought to convey risk were important in future assessments and excluded others from subsequent parts of the study. Using a first-episode psychosis cohort, we constructed a risk assessment instrument for men and women and showed important sex differences in pathways to violence. We included a 1-year follow-up of patients discharged from medium secure services and validated a previously developed risk assessment guide, the Medium Security Recidivism Assessment Guide (MSRAG). We found that it is essential to combine ratings from static instruments such as the MSRAG with dynamic risk factors. Static levels of risk have important modifying effects on dynamic risk factors for their effects on violence and we further demonstrated this using a sample of released prisoners to construct risk assessment instruments for violence, robbery, drugs and acquisitive convictions. We constructed a preliminary instrument including dynamic risk measures and validated this in a second large data set of released prisoners. Finally, we incorporated findings from the follow-up of psychiatric patients discharged from medium secure services and two samples of released prisoners to construct Bayesian models to guide clinicians in risk management.ConclusionsRisk factors for violence identified at the population level, including paranoid delusions and anxiety disorder, should be integrated in risk assessments together with established high-risk psychiatric morbidity such as substance misuse and antisocial personality disorder. The incorporation of dynamic factors resulted in improved accuracy, especially when combined in assessments using actuarial measures to obtain levels of risk using static factors. It is important to continue developing dynamic risk and protective measures with the aim of identifying factors that are causally related to violence. Only causal factors should be targeted in violence prevention interventions. Bayesian networks show considerable promise in developing software for clinicians to identify targets for intervention in the field. The Bayesian models developed in this programme are at the prototypical stage and require further programmer development into applications for use on tablets. These should be further tested in the field and then compared with structured professional judgement in a randomised controlled trial in terms of their effectiveness in preventing future violence.FundingThe National Institute for Health Research Programme Grants for Applied Research programme.
Collapse
|
15
|
New Algorithm and Software (BNOmics) for Inferring and Visualizing Bayesian Networks from Heterogeneous Big Biological and Genetic Data. J Comput Biol 2016; 24:340-356. [PMID: 27681505 PMCID: PMC5372779 DOI: 10.1089/cmb.2016.0100] [Citation(s) in RCA: 18] [Impact Index Per Article: 2.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/13/2022] Open
Abstract
Bayesian network (BN) reconstruction is a prototypical systems biology data analysis approach that has been successfully used to reverse engineer and model networks reflecting different layers of biological organization (ranging from genetic to epigenetic to cellular pathway to metabolomic). It is especially relevant in the context of modern (ongoing and prospective) studies that generate heterogeneous high-throughput omics datasets. However, there are both theoretical and practical obstacles to the seamless application of BN modeling to such big data, including computational inefficiency of optimal BN structure search algorithms, ambiguity in data discretization, mixing data types, imputation and validation, and, in general, limited scalability in both reconstruction and visualization of BNs. To overcome these and other obstacles, we present BNOmics, an improved algorithm and software toolkit for inferring and analyzing BNs from omics datasets. BNOmics aims at comprehensive systems biology—type data exploration, including both generating new biological hypothesis and testing and validating the existing ones. Novel aspects of the algorithm center around increasing scalability and applicability to varying data types (with different explicit and implicit distributional assumptions) within the same analysis framework. An output and visualization interface to widely available graph-rendering software is also included. Three diverse applications are detailed. BNOmics was originally developed in the context of genetic epidemiology data and is being continuously optimized to keep pace with the ever-increasing inflow of available large-scale omics datasets. As such, the software scalability and usability on the less than exotic computer hardware are a priority, as well as the applicability of the algorithm and software to the heterogeneous datasets containing many data types—single-nucleotide polymorphisms and other genetic/epigenetic/transcriptome variables, metabolite levels, epidemiological variables, endpoints, and phenotypes, etc.
Collapse
|
16
|
Abstract
The analysis of GWAS data has long been restricted to simple models that cannot fully capture the genetic architecture of complex human diseases. As a shift from standard approaches, we propose here a general statistical framework for multi-SNP analysis of GWAS data based on a Bayesian graphical model. Our goal is to develop a general approach applicable to a wide range of genetic association problems, including GWAS and fine-mapping studies, and, more specifically, be able to: (1) Assess the joint effect of multiple SNPs that can be linked or unlinked and interact or not; (2) Explore the multi-SNP model space efficiently using the Mode Oriented Stochastic Search (MOSS) algorithm and determine the best models. We illustrate our new methodology with an application to the CGEM breast cancer GWAS data. Our algorithm selected several SNPs embedded in multi-locus models with high posterior probabilities. Most of the SNPs selected have a biological relevance. Interestingly, several of them have never been detected in standard single-SNP analyses. Finally, our approach has been implemented in the open source R package genMOSS.
Collapse
|
17
|
Identification of genetic interaction networks via an evolutionary algorithm evolved Bayesian network. BioData Min 2016; 9:18. [PMID: 27168765 PMCID: PMC4862166 DOI: 10.1186/s13040-016-0094-4] [Citation(s) in RCA: 8] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/05/2015] [Accepted: 04/18/2016] [Indexed: 12/01/2022] Open
Abstract
Background The future of medicine is moving towards the phase of precision medicine, with the goal to prevent and treat diseases by taking inter-individual variability into account. A large part of the variability lies in our genetic makeup. With the fast paced improvement of high-throughput methods for genome sequencing, a tremendous amount of genetics data have already been generated. The next hurdle for precision medicine is to have sufficient computational tools for analyzing large sets of data. Genome-Wide Association Studies (GWAS) have been the primary method to assess the relationship between single nucleotide polymorphisms (SNPs) and disease traits. While GWAS is sufficient in finding individual SNPs with strong main effects, it does not capture potential interactions among multiple SNPs. In many traits, a large proportion of variation remain unexplained by using main effects alone, leaving the door open for exploring the role of genetic interactions. However, identifying genetic interactions in large-scale genomics data poses a challenge even for modern computing. Results For this study, we present a new algorithm, Grammatical Evolution Bayesian Network (GEBN) that utilizes Bayesian Networks to identify interactions in the data, and at the same time, uses an evolutionary algorithm to reduce the computational cost associated with network optimization. GEBN excelled in simulation studies where the data contained main effects and interaction effects. We also applied GEBN to a Type 2 diabetes (T2D) dataset obtained from the Marshfield Personalized Medicine Research Project (PMRP). We were able to identify genetic interactions for T2D cases and controls and use information from those interactions to classify T2D samples. We obtained an average testing area under the curve (AUC) of 86.8 %. We also identified several interacting genes such as INADL and LPP that are known to be associated with T2D. Conclusions Developing the computational tools to explore genetic associations beyond main effects remains a critically important challenge in human genetics. Methods, such as GEBN, demonstrate the utility of considering genetic interactions, as they likely explain some of the missing heritability.
Collapse
|
18
|
Computational methods for ubiquitination site prediction using physicochemical properties of protein sequences. BMC Bioinformatics 2016; 17:116. [PMID: 26940649 PMCID: PMC4778322 DOI: 10.1186/s12859-016-0959-z] [Citation(s) in RCA: 17] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/20/2015] [Accepted: 02/19/2016] [Indexed: 11/10/2022] Open
Abstract
Background Ubiquitination is a very important process in protein post-translational modification, which has been widely investigated by biology scientists and researchers. Different experimental and computational methods have been developed to identify the ubiquitination sites in protein sequences. This paper aims at exploring computational machine learning methods for the prediction of ubiquitination sites using the physicochemical properties (PCPs) of amino acids in the protein sequences. Results We first establish six different ubiquitination data sets, whose records contain both ubiquitination sites and non-ubiquitination sites in variant numbers of protein sequence segments. In particular, to establish such data sets, protein sequence segments are extracted from the original protein sequences used in four published papers on ubiquitination, while 531 PCP features of each extracted protein sequence segment are calculated based on PCP values from AAindex (Amino Acid index database) by averaging PCP values of all amino acids on each segment. Various computational machine-learning methods, including four Bayesian network methods (i.e., Naïve Bayes (NB), Feature Selection NB (FSNB), Model Averaged NB (MANB), and Efficient Bayesian Multivariate Classifier (EBMC)) and three regression methods (i.e., Support Vector Machine (SVM), Logistic Regression (LR), and Least Absolute Shrinkage and Selection Operator (LASSO)), are then applied to the six established segment-PCP data sets. Five-fold cross-validation and the Area Under Receiver Operating Characteristic Curve (AUROC) are employed to evaluate the ubiquitination prediction performance of each method. Results demonstrate that the PCP data of protein sequences contain information that could be mined by machine learning methods for ubiquitination site prediction. The comparative results show that EBMC, SVM and LR perform better than other methods, and EBMC is the only method that can get AUCs greater than or equal to 0.6 for the six established data sets. Results also show EBMC tends to perform better for larger data. Conclusions Machine learning methods have been employed for the ubiquitination site prediction based on physicochemical properties of amino acids on protein sequences. Results demonstrate the effectiveness of using machine learning methodology to mine information from PCP data concerning protein sequences, as well as the superiority of EBMC, SVM and LR (especially EBMC) for the ubiquitination prediction compared to other methods. Electronic supplementary material The online version of this article (doi:10.1186/s12859-016-0959-z) contains supplementary material, which is available to authorized users.
Collapse
|
19
|
Interaction of Wnt pathway related variants with type 2 diabetes in a Chinese Han population. PeerJ 2015; 3:e1304. [PMID: 26509107 PMCID: PMC4621788 DOI: 10.7717/peerj.1304] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2015] [Accepted: 09/17/2015] [Indexed: 11/20/2022] Open
Abstract
Aims. Epistasis from gene set based on the function-related genes may confer to the susceptibility of type 2 diabetes (T2D). The Wnt pathway has been reported to play an important role in the pathogenesis of T2D. Here we applied tag SNPs to explore the association between epistasis among genes from Wnt and T2D in the Han Chinese population. Methods. Variants of fourteen genes selected from Wnt pathways were performed to analyze epistasis. Gene–gene interactions in case-control samples were identified by generalized multifactor dimensionality reduction (GMDR) method. We performed a case-controlled association analysis on a total of 1,026 individual with T2D and 1,157 controls via tag SNPs in Wnt pathway. Results. In single-locus analysis, SNPs in four genes were significantly associated with T2D adjusted for multiple testing (rs7903146C in TCF7L2, p = 3.21∗10−3, OR = 1.39, 95% CI [1.31–1.47], rs12904944G in SMAD3, p = 2.51∗10−3, OR = 1.39, 95% CI [1.31–1.47], rs2273368C in WNT2B, p = 4.46∗10−3, OR = 1.23, 95% CI [1.11–1.32], rs6902123C in PPARD, p = 1.14∗10−2, OR = 1.40, 95% CI [1.32–1.48]). The haplotype TGC constructed by TCF7L2 (rs7903146), DKK1 (rs2241529) and BTRC (rs4436485) showed a significant association with T2D (OR = 0.750, 95% CI [0.579–0.972], P = 0.03). For epistasis analysis, the optimized combination was the two locus model of WNT2B rs2273368 and TCF7L2rs7903146, which had the maximum cross-validation consistency. This was 9 out of 10 for the sign test at 0.0107 level. The best combination increased the risk of T2D by 1.47 times (95% CI [1.13–1.91], p = 0.0039). Conclusions. Epistasis between TCF7L2 and WNT2B is associated with the susceptibility of T2D in a Han Chinese population. Our results were compatible with the idea of the complex nature of T2D that would have been missed using conventional tools.
Collapse
|
20
|
A gene-based information gain method for detecting gene-gene interactions in case-control studies. Eur J Hum Genet 2015; 23:1566-72. [PMID: 25758991 DOI: 10.1038/ejhg.2015.16] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/28/2014] [Revised: 11/30/2014] [Accepted: 01/14/2015] [Indexed: 12/31/2022] Open
Abstract
Currently, most methods for detecting gene-gene interactions (GGIs) in genome-wide association studies are divided into SNP-based methods and gene-based methods. Generally, the gene-based methods can be more powerful than SNP-based methods. Some gene-based entropy methods can only capture the linear relationship between genes. We therefore proposed a nonparametric gene-based information gain method (GBIGM) that can capture both linear relationship and nonlinear correlation between genes. Through simulation with different odds ratio, sample size and prevalence rate, GBIGM was shown to be valid and more powerful than classic KCCU method and SNP-based entropy method. In the analysis of data from 17 genes on rheumatoid arthritis, GBIGM was more effective than the other two methods as it obtains fewer significant results, which was important for biological verification. Therefore, GBIGM is a suitable and powerful tool for detecting GGIs in case-control studies.
Collapse
|
21
|
Identifying genetic interactions associated with late-onset Alzheimer's disease. BioData Min 2014; 7:35. [PMID: 25649863 PMCID: PMC4300162 DOI: 10.1186/s13040-014-0035-z] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/30/2014] [Accepted: 12/06/2014] [Indexed: 01/23/2023] Open
Abstract
Background Identifying genetic interactions in data obtained from genome-wide association studies (GWASs) can help in understanding the genetic basis of complex diseases. The large number of single nucleotide polymorphisms (SNPs) in GWASs however makes the identification of genetic interactions computationally challenging. We developed the Bayesian Combinatorial Method (BCM) that can identify pairs of SNPs that in combination have high statistical association with disease. Results We applied BCM to two late-onset Alzheimer’s disease (LOAD) GWAS datasets to identify SNPs that interact with known Alzheimer associated SNPs. We also compared BCM with logistic regression that is implemented in PLINK. Gene Ontology analysis of genes from the top 200 dataset SNPs for both GWAS datasets showed overrepresentation of LOAD-related terms. Four genes were common to both datasets: APOE and APOC1, which have well established associations with LOAD, and CAMK1D and FBXL13, not previously linked to LOAD but having evidence of involvement in LOAD. Supporting evidence was also found for additional genes from the top 30 dataset SNPs. Conclusion BCM performed well in identifying several SNPs having evidence of involvement in the pathogenesis of LOAD that would not have been identified by univariate analysis due to small main effect. These results provide support for applying BCM to identify potential genetic variants such as SNPs from high dimensional GWAS datasets. Electronic supplementary material The online version of this article (doi:10.1186/s13040-014-0035-z) contains supplementary material, which is available to authorized users.
Collapse
|
22
|
Revealing Biological Pathways Implicated in Lung Cancer from TCGA Gene Expression Data Using Gene Set Enrichment Analysis. Cancer Inform 2014; 13:113-21. [PMID: 25520551 PMCID: PMC4251186 DOI: 10.4137/cin.s13882] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/16/2014] [Revised: 09/05/2014] [Accepted: 09/09/2014] [Indexed: 12/11/2022] Open
Abstract
Analyzing biological system abnormalities in cancer patients based on measures of biological entities, such as gene expression levels, is an important and challenging problem. This paper applies existing methods, Gene Set Enrichment Analysis and Signaling Pathway Impact Analysis, to pathway abnormality analysis in lung cancer using microarray gene expression data. Gene expression data from studies of Lung Squamous Cell Carcinoma (LUSC) in The Cancer Genome Atlas project, and pathway gene set data from the Kyoto Encyclopedia of Genes and Genomes were used to analyze the relationship between pathways and phenotypes. Results, in the form of pathway rankings, indicate that some pathways may behave abnormally in LUSC. For example, both the cell cycle and viral carcinogenesis pathways ranked very high in LUSC. Furthermore, some pathways that are known to be associated with cancer, such as the p53 and the PI3K-Akt signal transduction pathways, were found to rank high in LUSC. Other pathways, such as bladder cancer and thyroid cancer pathways, were also ranked high in LUSC.
Collapse
|
23
|
Inferring Aberrant Signal Transduction Pathways in Ovarian Cancer from TCGA Data. Cancer Inform 2014; 13:29-36. [PMID: 25392681 PMCID: PMC4216062 DOI: 10.4137/cin.s13881] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/25/2014] [Revised: 03/10/2014] [Accepted: 03/10/2014] [Indexed: 12/12/2022] Open
Abstract
This paper concerns a new method for identifying aberrant signal transduction pathways (STPs) in cancer using case/control gene expression-level datasets, and applying that method and an existing method to an ovarian carcinoma dataset. Both methods identify STPs that are plausibly linked to all cancers based on current knowledge. Thus, the paper is most appropriate for the cancer informatics community. Our hypothesis is that STPs that are altered in tumorous tissue can be identified by applying a new Bayesian network (BN)-based method (causal analysis of STP aberration (CASA)) and an existing method (signaling pathway impact analysis (SPIA)) to the cancer genome atlas (TCGA) gene expression-level datasets. To test this hypothesis, we analyzed 20 cancer-related STPs and 6 randomly chosen STPs using the 591 cases in the TCGA ovarian carcinoma dataset, and the 102 controls in all 5 TCGA cancer datasets. We identified all the genes related to each of the 26 pathways, and developed separate gene expression datasets for each pathway. The results of the two methods were highly correlated. Furthermore, many of the STPs that ranked highest according to both methods are plausibly linked to all cancers based on current knowledge. Finally, CASA ranked the cancer-related STPs over the randomly selected STPs at a significance level below 0.05 (P = 0.047), but SPIA did not (P = 0.083).
Collapse
|
24
|
Modeling the altered expression levels of genes on signaling pathways in tumors as causal bayesian networks. Cancer Inform 2014; 13:77-84. [PMID: 24932098 PMCID: PMC4051800 DOI: 10.4137/cin.s13578] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/05/2013] [Revised: 11/25/2013] [Accepted: 11/25/2013] [Indexed: 01/05/2023] Open
Abstract
This paper concerns a study indicating that the expression levels of genes in signaling pathways can be modeled using a causal Bayesian network (BN) that is altered in tumorous tissue. These results open up promising areas of future research that can help identify driver genes and therapeutic targets. So, it is most appropriate for the cancer informatics community. Our central hypothesis is that the expression levels of genes that code for proteins on a signal transduction network (STP) are causally related and that this causal structure is altered when the STP is involved in cancer. To test this hypothesis, we analyzed 5 STPs associated with breast cancer, 7 STPs associated with other cancers, and 10 randomly chosen pathways, using a breast cancer gene expression level dataset containing 529 cases and 61 controls. We identified all the genes related to each of the 22 pathways and developed separate gene expression datasets for each pathway. We obtained significant results indicating that the causal structure of the expression levels of genes coding for proteins on STPs, which are believed to be implicated in both breast cancer and in all cancers, is more altered in the cases relative to the controls than the causal structure of the randomly chosen pathways.
Collapse
|
25
|
Bayesian systems-based genetic association analysis with effect strength estimation and omic wide interpretation: a case study in rheumatoid arthritis. METHODS IN MOLECULAR BIOLOGY (CLIFTON, N.J.) 2014; 1142:143-76. [PMID: 24706282 DOI: 10.1007/978-1-4939-0404-4_14] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
Abstract
Rich dependency structures are often formed in genetic association studies between the phenotypic, clinical, and environmental descriptors. These descriptors may not be standardized, and may encompass various disease definitions and clinical endpoints which are only weakly influenced by various (e.g., genetic) factors. Such loosely defined complex intermediate clinical phenotypes are typically used in follow-up candidate gene association studies, e.g., after genome-wide analysis, to deepen the understanding of the associations and to estimate effect strength. This chapter discusses a solid methodology, which is useful in such a scenario, by using probabilistic graphical models, namely, Bayesian networks in the Bayesian statistical framework. This method offers systematically scalable, comprehensive hierarchical hypotheses about multivariate relevance. We discuss its workflow: from data engineering to semantic publication of the results. We overview the construction, visualization, and interpretation of complex hypotheses related to the structural analysis of relevance. Furthermore, we illustrate the use of a dependency model-based relevance measure, which takes into account the structural properties of the model, for quantifying the effect strength. Finally, we discuss the "interpretational" or translational challenge of a genetic association study, with a focus on the fusion of heterogeneous omic knowledge to reintegrate the results into a genome-wide context.
Collapse
|
26
|
ATHENA: the analysis tool for heritable and environmental network associations. Bioinformatics 2014; 30:698-705. [PMID: 24149050 PMCID: PMC3933870 DOI: 10.1093/bioinformatics/btt572] [Citation(s) in RCA: 41] [Impact Index Per Article: 4.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/23/2013] [Revised: 09/03/2013] [Accepted: 09/26/2013] [Indexed: 11/13/2022] Open
Abstract
MOTIVATION Advancements in high-throughput technology have allowed researchers to examine the genetic etiology of complex human traits in a robust fashion. Although genome-wide association studies have identified many novel variants associated with hundreds of traits, a large proportion of the estimated trait heritability remains unexplained. One hypothesis is that the commonly used statistical techniques and study designs are not robust to the complex etiology that may underlie these human traits. This etiology could include non-linear gene × gene or gene × environment interactions. Additionally, other levels of biological regulation may play a large role in trait variability. RESULTS To address the need for computational tools that can explore enormous datasets to detect complex susceptibility models, we have developed a software package called the Analysis Tool for Heritable and Environmental Network Associations (ATHENA). ATHENA combines various variable filtering methods with machine learning techniques to analyze high-throughput categorical (i.e. single nucleotide polymorphisms) and quantitative (i.e. gene expression levels) predictor variables to generate multivariable models that predict either a categorical (i.e. disease status) or quantitative (i.e. cholesterol levels) outcomes. The goal of this article is to demonstrate the utility of ATHENA using simulated and biological datasets that consist of both single nucleotide polymorphisms and gene expression variables to identify complex prediction models. Importantly, this method is flexible and can be expanded to include other types of high-throughput data (i.e. RNA-seq data and biomarker measurements). AVAILABILITY ATHENA is freely available for download. The software, user manual and tutorial can be downloaded from http://ritchielab.psu.edu/ritchielab/software.
Collapse
|
27
|
A novel artificial neural network method for biomedical prediction based on matrix pseudo-inversion. J Biomed Inform 2013; 48:114-21. [PMID: 24361387 DOI: 10.1016/j.jbi.2013.12.009] [Citation(s) in RCA: 17] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/09/2013] [Revised: 12/06/2013] [Accepted: 12/11/2013] [Indexed: 12/13/2022]
Abstract
Biomedical prediction based on clinical and genome-wide data has become increasingly important in disease diagnosis and classification. To solve the prediction problem in an effective manner for the improvement of clinical care, we develop a novel Artificial Neural Network (ANN) method based on Matrix Pseudo-Inversion (MPI) for use in biomedical applications. The MPI-ANN is constructed as a three-layer (i.e., input, hidden, and output layers) feed-forward neural network, and the weights connecting the hidden and output layers are directly determined based on MPI without a lengthy learning iteration. The LASSO (Least Absolute Shrinkage and Selection Operator) method is also presented for comparative purposes. Single Nucleotide Polymorphism (SNP) simulated data and real breast cancer data are employed to validate the performance of the MPI-ANN method via 5-fold cross validation. Experimental results demonstrate the efficacy of the developed MPI-ANN for disease classification and prediction, in view of the significantly superior accuracy (i.e., the rate of correct predictions), as compared with LASSO. The results based on the real breast cancer data also show that the MPI-ANN has better performance than other machine learning methods (including support vector machine (SVM), logistic regression (LR), and an iterative ANN). In addition, experiments demonstrate that our MPI-ANN could be used for bio-marker selection as well.
Collapse
|
28
|
Assessment of genetic and nongenetic interactions for the prediction of depressive symptomatology: an analysis of the Wisconsin Longitudinal Study using machine learning algorithms. Am J Public Health 2013; 103 Suppl 1:S136-44. [PMID: 23927508 DOI: 10.2105/ajph.2012.301141] [Citation(s) in RCA: 24] [Impact Index Per Article: 2.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/04/2022]
Abstract
OBJECTIVES We examined depression within a multidimensional framework consisting of genetic, environmental, and sociobehavioral factors and, using machine learning algorithms, explored interactions among these factors that might better explain the etiology of depressive symptoms. METHODS We measured current depressive symptoms using the Center for Epidemiologic Studies Depression Scale (n = 6378 participants in the Wisconsin Longitudinal Study). Genetic factors were 78 single nucleotide polymorphisms (SNPs); environmental factors-13 stressful life events (SLEs), plus a composite proportion of SLEs index; and sociobehavioral factors-18 personality, intelligence, and other health or behavioral measures. We performed traditional SNP associations via logistic regression likelihood ratio testing and explored interactions with support vector machines and Bayesian networks. RESULTS After correction for multiple testing, we found no significant single genotypic associations with depressive symptoms. Machine learning algorithms showed no evidence of interactions. Naïve Bayes produced the best models in both subsets and included only environmental and sociobehavioral factors. CONCLUSIONS We found no single or interactive associations with genetic factors and depressive symptoms. Various environmental and sociobehavioral factors were more predictive of depressive symptoms, yet their impacts were independent of one another. A genome-wide analysis of genetic alterations using machine learning methodologies will provide a framework for identifying genetic-environmental-sociobehavioral interactions in depressive symptoms.
Collapse
|
29
|
Abstract
After an association between genetic variants and a phenotype has been established, further study goals comprise the classification of patients according to disease risk or the estimation of disease probability. To accomplish this, different statistical methods are required, and specifically machine-learning approaches may offer advantages over classical techniques. In this paper, we describe methods for the construction and evaluation of classification and probability estimation rules. We review the use of machine-learning approaches in this context and explain some of the machine-learning algorithms in detail. Finally, we illustrate the methodology through application to a genome-wide association analysis on rheumatoid arthritis.
Collapse
|
30
|
Contributions of renin-angiotensin system-related gene interactions to obesity in a Chinese population. PLoS One 2012; 7:e42881. [PMID: 22880127 PMCID: PMC3412812 DOI: 10.1371/journal.pone.0042881] [Citation(s) in RCA: 12] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/31/2012] [Accepted: 07/13/2012] [Indexed: 01/11/2023] Open
Abstract
BACKGROUND Gene-gene interactions may be partly responsible for complex traits such as obesity. Increasing evidence suggests that the renin-angiotensin system (RAS) contributes to the etiology of obesity. How the epistasis of genes in the RAS contributes to obesity is still under research. We aim to evaluate the contribution of RAS-related gene interactions to a predisposition of obesity in a Chinese population. METHODOLOGY AND PRINCIPAL FINDINGS We selected six single nucleotide polymorphisms (SNPs) located in angiotensin (AGT), angiotensin converting enzyme (ACE), angiotensin type 1 receptor (AGTR1), MAS1, nitric oxide synthase 3 (NOS3) and the bradykinin B2 receptor gene (BDKRB2), and genotyped them in 324 unrelated individuals with obesity (BMI ≥ 28 kg/m(2)) and 373 non-obese controls (BMI 18.5 to <24 kg/m(2)) from a large scale population-based cohort. We analyzed gene-gene interactions among 6 polymorphic loci using the Generalized Multifactor Dimensionality Reduction (GMDR) method, which has been shown to be effective for detecting gene-gene interactions in case-control studies with relatively small samples. Then we used logistic regression models to confirm the best combination of loci identified in the GMDR. It showed a significant gene-gene interaction between the rs220721 polymorphism in the MAS1 gene and the rs1799722 polymorphism in the gene BDKB2R. The best two-locus combination scored 9 for cross-validation consistency and 9 for sign test (p = 0.0107). This interaction showed the maximum consistency and minimum prediction error among all gene-gene interaction models evaluated. Moreover, the combination of the MAS1 rs220721 and the BDKRB2 rs1799722 was associated with a significantly increased risk of obesity (OR 1.82, CI 95%: 1.15-2.88, p = 0.0103). CONCLUSIONS AND SIGNIFICANCE These results suggest that the SNPs from the RAS-related genes may contribute to the risk of obesity in an interactive manner in a Chinese population. The gene-gene interaction may serve as a novel area for obesity research.
Collapse
|
31
|
An infinitesimal model for quantitative trait genomic value prediction. PLoS One 2012; 7:e41336. [PMID: 22815992 PMCID: PMC3399838 DOI: 10.1371/journal.pone.0041336] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/15/2012] [Accepted: 06/20/2012] [Indexed: 11/19/2022] Open
Abstract
We developed a marker based infinitesimal model for quantitative trait analysis. In contrast to the classical infinitesimal model, we now have new information about the segregation of every individual locus of the entire genome. Under this new model, we propose that the genetic effect of an individual locus is a function of the genome location (a continuous quantity). The overall genetic value of an individual is the weighted integral of the genetic effect function along the genome. Numerical integration is performed to find the integral, which requires partitioning the entire genome into a finite number of bins. Each bin may contain many markers. The integral is approximated by the weighted sum of all the bin effects. We now turn the problem of marker analysis into bin analysis so that the model dimension has decreased from a virtual infinity to a finite number of bins. This new approach can efficiently handle virtually unlimited number of markers without marker selection. The marker based infinitesimal model requires high linkage disequilibrium of all markers within a bin. For populations with low or no linkage disequilibrium, we develop an adaptive infinitesimal model. Both the original and the adaptive models are tested using simulated data as well as beef cattle data. The simulated data analysis shows that there is always an optimal number of bins at which the predictability of the bin model is much greater than the original marker analysis. Result of the beef cattle data analysis indicates that the bin model can increase the predictability from 10% (multiple marker analysis) to 33% (multiple bin analysis). The marker based infinitesimal model paves a way towards the solution of genetic mapping and genomic selection using the whole genome sequence data.
Collapse
|
32
|
Integrating heterogeneous high-throughput data for meta-dimensional pharmacogenomics and disease-related studies. Pharmacogenomics 2012; 13:213-22. [PMID: 22256870 DOI: 10.2217/pgs.11.145] [Citation(s) in RCA: 35] [Impact Index Per Article: 2.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/21/2023] Open
Abstract
The current paradigm of human genetics research is to analyze variation of a single data type (i.e., DNA sequence or RNA levels) to detect genes and pathways that underlie complex traits such as disease state or drug response. While these studies have detected thousands of variations that associate with hundreds of complex phenotypes, much of the estimated heritability, or trait variability due to genetic factors, remain unexplained. We may be able to account for a portion of the missing heritability if we incorporate a systems biology approach into these analyses. Rapid technological advances will make it possible for scientists to explore this hypothesis via the generation of high-throughput omics data - transcriptomic, proteomic and methylomic to name a few. Analyzing this 'meta-dimensional' data will require clever statistical techniques that allow for the integration of qualitative and quantitative predictor variables. For this article, we examine two major categories of approaches for integrated data analysis, give examples of their use in experimental and in silico datasets, and assess the limitations of each method.
Collapse
|
33
|
Abstract
Pharmacogenetics aims to elucidate the genetic factors underlying the individual's response to pharmacotherapy. Coupled with the recent (and ongoing) progress in high-throughput genotyping, sequencing and other genomic technologies, pharmacogenetics is rapidly transforming into pharmacogenomics, while pursuing the primary goals of identifying and studying the genetic contribution to drug therapy response and adverse effects, and existing drug characterization and new drug discovery. Accomplishment of both of these goals hinges on gaining a better understanding of the underlying biological systems; however, reverse-engineering biological system models from the massive datasets generated by the large-scale genetic epidemiology studies presents a formidable data analysis challenge. In this article, we review the recent progress made in developing such data analysis methodology within the paradigm of systems biology research that broadly aims to gain a 'holistic', or 'mechanistic' understanding of biological systems by attempting to capture the entirety of interactions between the components (genetic and otherwise) of the system.
Collapse
|
34
|
Performance analysis of novel methods for detecting epistasis. BMC Bioinformatics 2011; 12:475. [PMID: 22172045 PMCID: PMC3259123 DOI: 10.1186/1471-2105-12-475] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/28/2011] [Accepted: 12/15/2011] [Indexed: 02/03/2023] Open
Abstract
Background Epistasis is recognized fundamentally important for understanding the mechanism of disease-causing genetic variation. Though many novel methods for detecting epistasis have been proposed, few studies focus on their comparison. Undertaking a comprehensive comparison study is an urgent task and a pathway of the methods to real applications. Results This paper aims at a comparison study of epistasis detection methods through applying related software packages on datasets. For this purpose, we categorize methods according to their search strategies, and select five representative methods (TEAM, BOOST, SNPRuler, AntEpiSeeker and epiMODE) originating from different underlying techniques for comparison. The methods are tested on simulated datasets with different size, various epistasis models, and with/without noise. The types of noise include missing data, genotyping error and phenocopy. Performance is evaluated by detection power (three forms are introduced), robustness, sensitivity and computational complexity. Conclusions None of selected methods is perfect in all scenarios and each has its own merits and limitations. In terms of detection power, AntEpiSeeker performs best on detecting epistasis displaying marginal effects (eME) and BOOST performs best on identifying epistasis displaying no marginal effects (eNME). In terms of robustness, AntEpiSeeker is robust to all types of noise on eME models, BOOST is robust to genotyping error and phenocopy on eNME models, and SNPRuler is robust to phenocopy on eME models and missing data on eNME models. In terms of sensitivity, AntEpiSeeker is the winner on eME models and both SNPRuler and BOOST perform well on eNME models. In terms of computational complexity, BOOST is the fastest among the methods. In terms of overall performance, AntEpiSeeker and BOOST are recommended as the efficient and effective methods. This comparison study may provide guidelines for applying the methods and further clues for epistasis detection.
Collapse
|
35
|
Evaluating de novo locus-disease discoveries in GWAS using the signal-to-noise ratio. AMIA ... ANNUAL SYMPOSIUM PROCEEDINGS. AMIA SYMPOSIUM 2011; 2011:617-624. [PMID: 22195117 PMCID: PMC3243170] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Subscribe] [Scholar Register] [Indexed: 05/31/2023]
Abstract
A genome-wide association study (GWAS) involves examining representative SNPs obtained using high throughput technologies. A GWAS data set can entail a million SNPs and may soon entail many millions. In a GWAS researchers often investigate the correlation of each SNP with a disease. With so many hypotheses, it is not straightforward how to interpret the results. Strategies include using the Bonferroni correction to determine the significance of a model and Bayesian methods. However, when we are discovering new locus-disease associations, i.e., so called de novo discoveries, we should not just endeavor to determine the significance of particular models, but also concern ourselves with determining whether it is likely that we have any true discoveries, and if so how many of the highest ranking models we should investigate further. We develop a method based on a signal-to-noise ratio that targets this issue. We apply the method to a GWAS Alzheimer's data set.
Collapse
|
36
|
Cardiovascular diseases and genome-wide association studies. Clin Chim Acta 2011; 412:1697-701. [DOI: 10.1016/j.cca.2011.05.035] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/25/2011] [Revised: 05/31/2011] [Accepted: 05/31/2011] [Indexed: 12/27/2022]
|
37
|
A bayesian method for evaluating and discovering disease loci associations. PLoS One 2011; 6:e22075. [PMID: 21853025 PMCID: PMC3154195 DOI: 10.1371/journal.pone.0022075] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/17/2011] [Accepted: 06/14/2011] [Indexed: 11/22/2022] Open
Abstract
BACKGROUND A genome-wide association study (GWAS) typically involves examining representative SNPs in individuals from some population. A GWAS data set can concern a million SNPs and may soon concern billions. Researchers investigate the association of each SNP individually with a disease, and it is becoming increasingly commonplace to also analyze multi-SNP associations. Techniques for handling so many hypotheses include the Bonferroni correction and recently developed bayesian methods. These methods can encounter problems. Most importantly, they are not applicable to a complex multi-locus hypothesis which has several competing hypotheses rather than only a null hypothesis. A method that computes the posterior probability of complex hypotheses is a pressing need. METHODOLOGY/FINDINGS We introduce the bayesian network posterior probability (BNPP) method which addresses the difficulties. The method represents the relationship between a disease and SNPs using a directed acyclic graph (DAG) model, and computes the likelihood of such models using a bayesian network scoring criterion. The posterior probability of a hypothesis is computed based on the likelihoods of all competing hypotheses. The BNPP can not only be used to evaluate a hypothesis that has previously been discovered or suspected, but also to discover new disease loci associations. The results of experiments using simulated and real data sets are presented. Our results concerning simulated data sets indicate that the BNPP exhibits both better evaluation and discovery performance than does a p-value based method. For the real data sets, previous findings in the literature are confirmed and additional findings are found. CONCLUSIONS/SIGNIFICANCE We conclude that the BNPP resolves a pressing problem by providing a way to compute the posterior probability of complex multi-locus hypotheses. A researcher can use the BNPP to determine the expected utility of investigating a hypothesis further. Furthermore, we conclude that the BNPP is a promising method for discovering disease loci associations.
Collapse
|
38
|
Power and pitfalls of the genome-wide association study approach to identify genes for Alzheimer's disease. Curr Psychiatry Rep 2011; 13:138-46. [PMID: 21312009 PMCID: PMC3154249 DOI: 10.1007/s11920-011-0184-4] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
Until recently, the search for genes contributing to Alzheimer's disease (AD) had been slow and disappointing, with the notable exception of the APOE ε4 allele, which increases risk and reduces the age at onset of AD in a dose-dependent fashion. Findings from genome-wide association studies (GWAS) made up of fewer than several thousand cases and controls each have not been replicated. Efforts of several consortia--each assembling much larger datasets with sufficient power to detect loci conferring small changes in AD risk--have resulted in robust associations with many novel genes involved in multiple biological pathways. Complex data mining strategies are being used to identify additional members of these pathways and gene-gene interactions contributing to AD risk. Guided by GWAS results, next-generation sequencing and functional studies are under way with the hope of helping us better understand AD pathology and providing new drug targets.
Collapse
|
39
|
Learning genetic epistasis using Bayesian network scoring criteria. BMC Bioinformatics 2011; 12:89. [PMID: 21453508 PMCID: PMC3080825 DOI: 10.1186/1471-2105-12-89] [Citation(s) in RCA: 71] [Impact Index Per Article: 5.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/21/2010] [Accepted: 03/31/2011] [Indexed: 02/01/2023] Open
Abstract
Background Gene-gene epistatic interactions likely play an important role in the genetic basis of many common diseases. Recently, machine-learning and data mining methods have been developed for learning epistatic relationships from data. A well-known combinatorial method that has been successfully applied for detecting epistasis is Multifactor Dimensionality Reduction (MDR). Jiang et al. created a combinatorial epistasis learning method called BNMBL to learn Bayesian network (BN) epistatic models. They compared BNMBL to MDR using simulated data sets. Each of these data sets was generated from a model that associates two SNPs with a disease and includes 18 unrelated SNPs. For each data set, BNMBL and MDR were used to score all 2-SNP models, and BNMBL learned significantly more correct models. In real data sets, we ordinarily do not know the number of SNPs that influence phenotype. BNMBL may not perform as well if we also scored models containing more than two SNPs. Furthermore, a number of other BN scoring criteria have been developed. They may detect epistatic interactions even better than BNMBL. Although BNs are a promising tool for learning epistatic relationships from data, we cannot confidently use them in this domain until we determine which scoring criteria work best or even well when we try learning the correct model without knowledge of the number of SNPs in that model. Results We evaluated the performance of 22 BN scoring criteria using 28,000 simulated data sets and a real Alzheimer's GWAS data set. Our results were surprising in that the Bayesian scoring criterion with large values of a hyperparameter called α performed best. This score performed better than other BN scoring criteria and MDR at recall using simulated data sets, at detecting the hardest-to-detect models using simulated data sets, and at substantiating previous results using the real Alzheimer's data set. Conclusions We conclude that representing epistatic interactions using BN models and scoring them using a BN scoring criterion holds promise for identifying epistatic genetic variants in data. In particular, the Bayesian scoring criterion with large values of a hyperparameter α appears more promising than a number of alternatives.
Collapse
|