1
|
Zhou J, Hoen AG, Mcritchie S, Pathmasiri W, Viles WD, Nguyen QP, Madan JC, Dade E, Karagas MR, Gui J. Information enhanced model selection for Gaussian graphical model with application to metabolomic data. Biostatistics 2022; 23:926-948. [PMID: 33720330 PMCID: PMC9608647 DOI: 10.1093/biostatistics/kxab006] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/13/2020] [Revised: 01/21/2021] [Accepted: 01/22/2021] [Indexed: 11/12/2022] Open
Abstract
In light of the low signal-to-noise nature of many large biological data sets, we propose a novel method to learn the structure of association networks using Gaussian graphical models combined with prior knowledge. Our strategy includes two parts. In the first part, we propose a model selection criterion called structural Bayesian information criterion, in which the prior structure is modeled and incorporated into Bayesian information criterion. It is shown that the popular extended Bayesian information criterion is a special case of structural Bayesian information criterion. In the second part, we propose a two-step algorithm to construct the candidate model pool. The algorithm is data-driven and the prior structure is embedded into the candidate model automatically. Theoretical investigation shows that under some mild conditions structural Bayesian information criterion is a consistent model selection criterion for high-dimensional Gaussian graphical model. Simulation studies validate the superiority of the proposed algorithm over the existing ones and show the robustness to the model misspecification. Application to relative concentration data from infant feces collected from subjects enrolled in a large molecular epidemiological cohort study validates that metabolic pathway involvement is a statistically significant factor for the conditional dependence between metabolites. Furthermore, new relationships among metabolites are discovered which can not be identified by the conventional methods of pathway analysis. Some of them have been widely recognized in biological literature.
Collapse
Affiliation(s)
- Jie Zhou
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, 3 Rope Ferry Road, Hanover, NH 03755, USA
| | - Anne G Hoen
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA and Depatment of Epidemiology, Geisel School of Medicine, Dartmouth College, 3 Rope Ferry Road, Hanover, NH 03755, USA
| | - Susan Mcritchie
- Nutrition Research Institute, Department of Nutrition, School of Public Health, University of North Carolina at Chapel Hill, Chapel Hill, 500 Laureate Way, Kannapolis, NC 28081, USA
| | - Wimal Pathmasiri
- Nutrition Research Institute, Department of Nutrition, School of Public Health, University of North Carolina at Chapel Hill, Chapel Hill, 500 Laureate Way, Kannapolis, NC 28081, USA
| | - Weston D Viles
- Department of Mathematics and Statistics, University of Southern Maine, 96 Falmouth St, Portland, ME 04103, USA
| | - Quang P Nguyen
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA and Depatment of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
| | - Juliette C Madan
- Depatment of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
| | - Erika Dade
- Depatment of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
| | - Margaret R Karagas
- Depatment of Epidemiology, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
| | - Jiang Gui
- Department of Biomedical Data Science, Geisel School of Medicine, Dartmouth College, Hanover, NH, USA
| |
Collapse
|
2
|
Na S, Kolar M, Koyejo O. Estimating differential latent variable graphical models with applications to brain connectivity. Biometrika 2020. [DOI: 10.1093/biomet/asaa066] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Abstract
Summary
Differential graphical models are designed to represent the difference between the conditional dependence structures of two groups, and thus are of particular interest for scientific investigations. Motivated by modern applications, this manuscript considers an extended setting where each group is generated by a latent variable Gaussian graphical model. Due to the existence of latent factors, the differential network is decomposed into sparse and low-rank components, both of which are symmetric indefinite matrices. We estimate these two components simultaneously using a two-stage procedure: (i) an initialization stage, which computes a simple, consistent estimator, and (ii) a convergence stage, implemented using a projected alternating gradient descent algorithm applied to a nonconvex objective, initialized using the output of the first stage. We prove that given the initialization, the estimator converges linearly with a nontrivial, minimax optimal statistical error. Experiments on synthetic and real data illustrate that the proposed nonconvex procedure outperforms existing methods.
Collapse
Affiliation(s)
- S Na
- Department of Statistics, University of Chicago, 5747 South Ellis Avenue, Chicago, Illinois 60637, U.S.A
| | - M Kolar
- Booth School of Business, University of Chicago, 5807 South Woodlawn Avenue, Chicago, Illinois 60637, U.S.A
| | - O Koyejo
- Department of Computer Science, University of Illinois at Urbana-Champaign, 201 North Goodwin Avenue, Urbana, Illinois 61801, U.S.A
| |
Collapse
|
3
|
Córdoba I, Bielza C, Larrañaga P. A review of Gaussian Markov models for conditional independence. J Stat Plan Inference 2020. [DOI: 10.1016/j.jspi.2019.09.008] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/24/2022]
|
4
|
Liu S, Zhang R, Shang X, Li W. Analysis for warning factors of type 2 diabetes mellitus complications with Markov blanket based on a Bayesian network model. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2020; 188:105302. [PMID: 31923820 DOI: 10.1016/j.cmpb.2019.105302] [Citation(s) in RCA: 6] [Impact Index Per Article: 1.5] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 05/01/2019] [Revised: 12/05/2019] [Accepted: 12/24/2019] [Indexed: 06/10/2023]
Abstract
BACKGROUND AND OBJECTIVE Type 2 diabetes mellitus (T2DM) complications seriously affect the quality of life and could not be cured completely. Actions should be taken for prevention and self-management. Analysis of warning factors is beneficial for patients, on which some previous studies focused. They generally used the professional medical test factors or complete factors to predict and prevent, but it was inconvenient and impractical for patients to self-manage. With this in mind, this study built a Bayesian network (BN) model, from the perspective of diabetic patients' self-management and prevention, to predict six complications of T2DM using the selected warning factors which patients could have access from medical examination. Furthermore, the model was analyzed to explore the relationships between physiological variables and T2DM complications, as well as the complications themselves. The model aims to help patients with T2DM self-manage and prevent themselves from complications. METHODS The dataset was collected from a well-known data center called the National Health Clinical Center between 1st January 2009 and 31st December 2009. After preprocess and impute the data, a BN model merging expert knowledge was built with Bootstrap and Tabu search algorithm. Markov Blanket (MB) was used to select the warning factors and predict T2DM complications. Moreover, a Bayesian network without prior information (BN-wopi) model learned using 10-fold cross-validation both in structure and in parameters was added to compare with other classifiers learned using 10-fold cross-validation fairly. The warning factors were selected according the structure learned in each fold and were used to predict. Finally, the performance of two BN models using warning features were compared with Naïve Bayes model, Random Forest model, and C5.0 Decision Tree model, which used all features to predict. Besides, the validation parameters of the proposed model were also compared with those in existing studies using some other variables in clinical data or biomedical data to predict T2DM complications. RESULTS Experimental results indicated that the BN models using warning factors performed statistically better than their counterparts using all other variables in predicting T2DM complications. In addition, the proposed BN model were effective and significant in predicting diabetic nephropathy (DN) (AUC: 0.831), diabetic foot (DF) (AUC: 0.905), diabetic macrovascular complications (DMV) (AUC: 0.753) and diabetic ketoacidosis (DK) (AUC: 0.877) with the selected warning factors compared with other experiments. CONCLUSIONS The warning factors of DN, DF, DMV, and DK selected by MB in this research might be able to help predict certain T2DM complications effectively, and the proposed BN model might be used as a general tool for prevention, monitoring, and self-management.
Collapse
Affiliation(s)
- Siying Liu
- School of Economics and Management, Beijing Jiaotong University, Beijing 100044, PR China
| | - Runtong Zhang
- School of Economics and Management, Beijing Jiaotong University, Beijing 100044, PR China.
| | - Xiaopu Shang
- School of Economics and Management, Beijing Jiaotong University, Beijing 100044, PR China
| | - Weizi Li
- Informatics Research Center, University of Reading, Berkshire RG6 6AH, United Kingdom
| |
Collapse
|
5
|
Behrouzi P, Grootswagers P, Keizer PLC, Smeets ETHC, Feskens EJM, de Groot LCPGM, van Eeuwijk FA. Dietary Intakes of Vegetable Protein, Folate, and Vitamins B-6 and B-12 Are Partially Correlated with Physical Functioning of Dutch Older Adults Using Copula Graphical Models. J Nutr 2020; 150:634-643. [PMID: 31858107 PMCID: PMC7056616 DOI: 10.1093/jn/nxz269] [Citation(s) in RCA: 16] [Impact Index Per Article: 4.0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/13/2019] [Revised: 06/14/2019] [Accepted: 10/09/2019] [Indexed: 01/07/2023] Open
Abstract
BACKGROUND In nutritional epidemiology, dealing with confounding and complex internutrient relations are major challenges. An often-used approach is dietary pattern analyses, such as principal component analysis, to deal with internutrient correlations, and to more closely resemble the true way nutrients are consumed. However, despite these improvements, these approaches still require subjective decisions in the preselection of food groups. Moreover, they do not make efficient use of multivariate dietary data, because they detect only marginal associations. We propose the use of copula graphical models (CGMs) to model and make statistical inferences regarding complex associations among variables in multivariate data, where associations between all variables can be learned simultaneously. OBJECTIVE We aimed to reconstruct nutritional intake and physical functioning networks in Dutch older adults by applying a CGM. METHODS We addressed this issue by uncovering the pairwise associations between variables while correcting for the effect of remaining variables. More specifically, we used a CGM to infer the precision matrix, which contains all the conditional independence relations between nodes in the graph. The nonzero elements of the precision matrix indicate the presence of a direct association. We applied this method to reconstruct nutrient-physical functioning networks from the combined data of 4 studies (Nu-Age, ProMuscle, ProMO, and V-Fit, total n = 662, mean ± SD age = 75 ± 7 y). The method was implemented in the R package nutriNetwork which is freely available at https://cran.r-project.org/web/packages/nutriNetwork. RESULTS Greater intakes of vegetable protein and vitamin B-6 were partially correlated with higher scores on the total Short Physical Performance Battery (SPPB) and the chair rise test. Greater intakes of vitamin B-12 and folate were partially correlated with higher scores on the chair rise test and the total SPPB, respectively. CONCLUSIONS We determined that vegetable protein, vitamin B-6, folate, and vitamin B-12 intakes are partially correlated with improved functional outcome measurements in Dutch older adults.
Collapse
Affiliation(s)
- Pariya Behrouzi
- Biometris, Mathematical and Statistical Methods, Wageningen University and Research, Wageningen, Netherlands
| | - Pol Grootswagers
- Department of Human Nutrition, Wageningen University and Research, Wageningen, Netherlands
| | - Paul L C Keizer
- Biometris, Mathematical and Statistical Methods, Wageningen University and Research, Wageningen, Netherlands
| | - Ellen T H C Smeets
- Department of Human Nutrition, Wageningen University and Research, Wageningen, Netherlands
| | - Edith J M Feskens
- Department of Human Nutrition, Wageningen University and Research, Wageningen, Netherlands
| | | | - Fred A van Eeuwijk
- Biometris, Mathematical and Statistical Methods, Wageningen University and Research, Wageningen, Netherlands
| |
Collapse
|
6
|
|
7
|
Whalen A, Ros-Freixedes R, Wilson DL, Gorjanc G, Hickey JM. Hybrid peeling for fast and accurate calling, phasing, and imputation with sequence data of any coverage in pedigrees. Genet Sel Evol 2018; 50:67. [PMID: 30563452 PMCID: PMC6299538 DOI: 10.1186/s12711-018-0438-2] [Citation(s) in RCA: 30] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/03/2018] [Accepted: 12/11/2018] [Indexed: 12/31/2022] Open
Abstract
BACKGROUND In this paper, we extend multi-locus iterative peeling to provide a computationally efficient method for calling, phasing, and imputing sequence data of any coverage in small or large pedigrees. Our method, called hybrid peeling, uses multi-locus iterative peeling to estimate shared chromosome segments between parents and their offspring at a subset of loci, and then uses single-locus iterative peeling to aggregate genomic information across multiple generations at the remaining loci. RESULTS Using a synthetic dataset, we first analysed the performance of hybrid peeling for calling and phasing genotypes in disconnected families, which contained only a focal individual and its parents and grandparents. Second, we analysed the performance of hybrid peeling for calling and phasing genotypes in the context of a full general pedigree. Third, we analysed the performance of hybrid peeling for imputing whole-genome sequence data to non-sequenced individuals in the population. We found that hybrid peeling substantially increased the number of called and phased genotypes by leveraging sequence information on related individuals. The calling rate and accuracy increased when the full pedigree was used compared to a reduced pedigree of just parents and grandparents. Finally, hybrid peeling imputed accurately whole-genome sequence to non-sequenced individuals. CONCLUSIONS We believe that this algorithm will enable the generation of low cost and high accuracy whole-genome sequence data in many pedigreed populations. We make this algorithm available as a standalone program called AlphaPeel.
Collapse
Affiliation(s)
- Andrew Whalen
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - Roger Ros-Freixedes
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - David L. Wilson
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - Gregor Gorjanc
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| | - John M. Hickey
- The Roslin Institute and Royal (Dick) School of Veterinary Studies, The University of Edinburgh, Midlothian, Scotland, UK
| |
Collapse
|
8
|
Alarcon F, Planté-Bordeneuve V, Olsson M, Nuel G. Non-parametric estimation of survival in age-dependent genetic disease and application to the transthyretin-related hereditary amyloidosis. PLoS One 2018; 13:e0203860. [PMID: 30252892 PMCID: PMC6155453 DOI: 10.1371/journal.pone.0203860] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/31/2017] [Accepted: 08/29/2018] [Indexed: 11/30/2022] Open
Abstract
In genetic diseases with variable age of onset, survival function estimation for the mutation carriers as well as estimation of the modifying factors effects are essential to provide individual risk assessment, both for mutation carriers management and prevention strategies. In practice, this survival function is classically estimated from pedigrees data where most genotypes are unobserved. In this article, we present a unifying Expectation-Maximization (EM) framework combining probabilistic computations in Bayesian networks with standard statistical survival procedures in order to provide mutation carrier survival estimates. The proposed approach allows to obtain previously published parametric estimates (e.g. Weibull survival) as particular cases as well as more general Kaplan-Meier non-parametric estimates, which is the main contribution. Note that covariates can also be taken into account using a proportional hazard model. The whole methodology is both validated on simulated data and applied to family samples with transthyretin-related hereditary amyloidosis (a rare autosomal dominant disease with highly variable age of onset), showing very promising results.
Collapse
Affiliation(s)
- Flora Alarcon
- Mathématiques appliquées Paris 5 (MAP5) CNRS: UMR8145 – Université Paris Descartes – Sorbonne Paris Cité, Paris, France
- * E-mail:
| | - Violaine Planté-Bordeneuve
- Hôpital Universitaire Henri Mondor, Département de Neurologie Créteil, France
- Inserm, U955-E10, Créteil, France
| | - Malin Olsson
- Umea university, Norrlands university hospital, NUS M31, Umea, Sweden
| | - Grégory Nuel
- Institute of Mathematics (INSMI), National Center for French Research (CNRS), Paris, France
- Laboratory of Probability (LPMA), Université Pierre et Marie Curie, Sorbonne Université, Paris, France
| |
Collapse
|
9
|
Backes M, Berrang P, Humbert M, Wolf V. Simulating the Large-Scale Erosion of Genomic Privacy Over Time. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2018; 15:1405-1412. [PMID: 30047894 DOI: 10.1109/tcbb.2018.2859380] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/08/2023]
Abstract
The dramatically decreasing costs of DNA sequencing have triggered more than a million humans to have their genotypes sequenced. Moreover, these individuals increasingly make their genomic data publicly available, thereby creating privacy threats for themselves and their relatives because of their DNA similarities. More generally, an entity that gains access to a significant fraction of sequenced genotypes might be able to infer even the genomes of unsequenced individuals. In this paper, we propose a simulation-based model for quantifying the impact of continuously sequencing and publicizing personal genomic data on a population's genomic privacy. Our simulation probabilistically models data sharing and takes into account events such as migration and interracial mating. We exemplarily instantiate our simulation with a sample population of 1,000 individuals and evaluate the privacy under multiple settings over 6,000 genomic variants and a subset of phenotype-related variants. Our findings demonstrate that an increasing sharing rate in the future entails a substantial negative effect on the privacy of all older generations. Moreover, we find that mixed populations face a less severe erosion of privacy over time than more homogeneous populations. Finally, we demonstrate that genomic-data sharing can be much more detrimental for the privacy of the phenotype-related variants.
Collapse
|
10
|
Ehrmann C, Bickenbach J, Stucki G. Graphical modelling: a tool for describing and understanding the functioning of people living with a health condition. Eur J Phys Rehabil Med 2017; 55:131-135. [PMID: 29144108 DOI: 10.23736/s1973-9087.17.04970-x] [Citation(s) in RCA: 3] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/08/2022]
Abstract
Rehabilitation aims to optimize people's lived experience of health or functioning. A comprehensive understanding of people's functioning is thus fundamental for rehabilitation clinicians and scientists. Over the past ten years it has been shown that graphical modelling is a promising technique for modelling data on people's functioning. It can contribute to our understanding of the complex associations between domains of functioning and the identification of potential targets for rehabilitation interventions both at the level of the person and the environment. The objective of this methodological note is to demonstrate how graphical modelling can be used by rehabilitation clinicians and scientists in the description, understanding and influencing of people's functioning. The application of graphical modelling and the interpretation of results is illustrated using the Spinal Cord Injury Independence Measure - Self Report used in the Swiss Spinal Cord Injury Cohort Study. Finally, we discuss the potential of graphical modelling for the planning of studies that expand our understanding of functioning and for rehabilitation interventions.
Collapse
Affiliation(s)
- Cristina Ehrmann
- Department of Health Sciences and Health Policy, Faculty of Humanities and Social Sciences, University of Lucerne, Lucerne, Switzerland - .,Swiss Paraplegic Research (SPF), Nottwil, Switzerland -
| | - Jerome Bickenbach
- Department of Health Sciences and Health Policy, Faculty of Humanities and Social Sciences, University of Lucerne, Lucerne, Switzerland.,Swiss Paraplegic Research (SPF), Nottwil, Switzerland.,ICF Research Branch, a cooperation partner within the WHO Collaborating Center for the Family of International Classifications in Germany (at DIMDI), Nottwil, Switzerland
| | - Gerold Stucki
- Department of Health Sciences and Health Policy, Faculty of Humanities and Social Sciences, University of Lucerne, Lucerne, Switzerland.,Swiss Paraplegic Research (SPF), Nottwil, Switzerland.,ICF Research Branch, a cooperation partner within the WHO Collaborating Center for the Family of International Classifications in Germany (at DIMDI), Nottwil, Switzerland
| |
Collapse
|
11
|
Nuel G, Lefebvre A, Bouaziz O. Computing Individual Risks Based on Family History in Genetic Disease in the Presence of Competing Risks. COMPUTATIONAL AND MATHEMATICAL METHODS IN MEDICINE 2017; 2017:9193630. [PMID: 29312466 PMCID: PMC5700554 DOI: 10.1155/2017/9193630] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 06/08/2017] [Revised: 08/22/2017] [Accepted: 09/10/2017] [Indexed: 12/01/2022]
Abstract
When considering a genetic disease with variable age at onset (e.g., familial amyloid neuropathy, cancers), computing the individual risk of the disease based on family history (FH) is of critical interest for both clinicians and patients. Such a risk is very challenging to compute because (1) the genotype X of the individual of interest is in general unknown, (2) the posterior distribution ℙ(X∣FH, T > t) changes with t (T is the age at disease onset for the targeted individual), and (3) the competing risk of death is not negligible. In this work, we present modeling of this problem using a Bayesian network mixed with (right-censored) survival outcomes where hazard rates only depend on the genotype of each individual. We explain how belief propagation can be used to obtain posterior distribution of genotypes given the FH and how to obtain a time-dependent posterior hazard rate for any individual in the pedigree. Finally, we use this posterior hazard rate to compute individual risk, with or without the competing risk of death. Our method is illustrated using the Claus-Easton model for breast cancer. The competing risk of death is derived from the national French registry.
Collapse
Affiliation(s)
- Gregory Nuel
- LPMA, UMR CNRS 7599, Paris, France
- UPMC, Sorbonne Universités, Paris, France
| | | | | |
Collapse
|
12
|
Martínez CA, Khare K, Rahman S, Elzo MA. Gaussian covariance graph models accounting for correlated marker effects in genome-wide prediction. J Anim Breed Genet 2017; 134:412-421. [PMID: 28804930 DOI: 10.1111/jbg.12286] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/13/2016] [Accepted: 06/30/2017] [Indexed: 11/26/2022]
Abstract
Several statistical models used in genome-wide prediction assume uncorrelated marker allele substitution effects, but it is known that these effects may be correlated. In statistics, graphical models have been identified as a useful tool for covariance estimation in high-dimensional problems and it is an area that has recently experienced a great expansion. In Gaussian covariance graph models (GCovGM), the joint distribution of a set of random variables is assumed to be Gaussian and the pattern of zeros of the covariance matrix is encoded in terms of an undirected graph G. In this study, methods adapting the theory of GCovGM to genome-wide prediction were developed (Bayes GCov, Bayes GCov-KR and Bayes GCov-H). In simulated data sets, improvements in correlation between phenotypes and predicted breeding values and accuracies of predicted breeding values were found. Our models account for correlation of marker effects and permit to accommodate general structures as opposed to models proposed in previous studies, which consider spatial correlation only. In addition, they allow incorporation of biological information in the prediction process through its use when constructing graph G, and their extension to the multi-allelic loci case is straightforward.
Collapse
Affiliation(s)
- C A Martínez
- Department of Animal Sciences, University of Florida, Gainesville, FL, USA
| | - K Khare
- Department of Statistics, University of Florida, Gainesville, FL, USA
| | - S Rahman
- Department of Statistics, University of Florida, Gainesville, FL, USA
| | - M A Elzo
- Department of Animal Sciences, University of Florida, Gainesville, FL, USA
| |
Collapse
|
13
|
Green PJ, Mortera J. Paternity testing and other inference about relationships from DNA mixtures. Forensic Sci Int Genet 2017; 28:128-137. [DOI: 10.1016/j.fsigen.2017.02.001] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/07/2016] [Revised: 01/11/2017] [Accepted: 02/02/2017] [Indexed: 01/29/2023]
|
14
|
Madsen T, Hobolth A, Jensen JL, Pedersen JS. Significance evaluation in factor graphs. BMC Bioinformatics 2017; 18:199. [PMID: 28359297 PMCID: PMC5374669 DOI: 10.1186/s12859-017-1614-z] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2016] [Accepted: 03/24/2017] [Indexed: 01/05/2023] Open
Abstract
BACKGROUND Factor graphs provide a flexible and general framework for specifying probability distributions. They can capture a range of popular and recent models for analysis of both genomics data as well as data from other scientific fields. Owing to the ever larger data sets encountered in genomics and the multiple-testing issues accompanying them, accurate significance evaluation is of great importance. We here address the problem of evaluating statistical significance of observations from factor graph models. RESULTS Two novel numerical approximations for evaluation of statistical significance are presented. First a method using importance sampling. Second a saddlepoint approximation based method. We develop algorithms to efficiently compute the approximations and compare them to naive sampling and the normal approximation. The individual merits of the methods are analysed both from a theoretical viewpoint and with simulations. A guideline for choosing between the normal approximation, saddle-point approximation and importance sampling is also provided. Finally, the applicability of the methods is demonstrated with examples from cancer genomics, motif-analysis and phylogenetics. CONCLUSIONS The applicability of saddlepoint approximation and importance sampling is demonstrated on known models in the factor graph framework. Using the two methods we can substantially improve computational cost without compromising accuracy. This contribution allows analyses of large datasets in the general factor graph framework.
Collapse
Affiliation(s)
- Tobias Madsen
- Department of Molecular Medicine, Aarhus University, Palle Juul-Jensens Boulevard 99, Aarhus, Denmark. .,Bioinformatics Research Center, Aarhus University, C.F. Møllers Allé 8, Aarhus, Denmark.
| | - Asger Hobolth
- Bioinformatics Research Center, Aarhus University, C.F. Møllers Allé 8, Aarhus, Denmark
| | - Jens Ledet Jensen
- Department of Mathematics, Aarhus University, Ny Munkegade 118, Aarhus, Denmark
| | - Jakob Skou Pedersen
- Department of Molecular Medicine, Aarhus University, Palle Juul-Jensens Boulevard 99, Aarhus, Denmark.,Bioinformatics Research Center, Aarhus University, C.F. Møllers Allé 8, Aarhus, Denmark
| |
Collapse
|
15
|
Humbert M, Ayday E, Hubaux JP, Telenti A. Quantifying Interdependent Risks in Genomic Privacy. ACM TRANSACTIONS ON PRIVACY AND SECURITY 2017. [DOI: 10.1145/3035538] [Citation(s) in RCA: 7] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 10/20/2022]
Abstract
The rapid progress in human-genome sequencing is leading to a high availability of genomic data. These data is notoriously very sensitive and stable in time, and highly correlated among relatives. In this article, we study the implications of these familial correlations on kin genomic privacy. We formalize the problem and detail efficient reconstruction attacks based on graphical models and belief propagation. With our approach, an attacker can infer the genomes of the relatives of an individual whose genome or phenotype are observed by notably relying on Mendel’s Laws, statistical relationships between the genomic variants, and between the genome and the phenotype. We evaluate the effect of these dependencies on privacy with respect to the amount of observed variants and the relatives sharing them. We also study how the algorithmic performance evolves when we take these various relationships into account. Furthermore, to quantify the level of genomic privacy as a result of the proposed inference attack, we discuss possible definitions of
genomic privacy
metrics, and compare their values and evolution. Genomic data reveals Mendelian disorders and the likelihood of developing severe diseases, such as Alzheimer’s. We also introduce the quantification of
health privacy
, specifically, the measure of how well the predisposition to a disease is concealed from an attacker. We evaluate our approach on actual genomic data from a pedigree and show the threat extent by combining data gathered from a genome-sharing website as well as an online social network.
Collapse
|
16
|
Abstract
Bayesian networks are probabilistic models that represent complex distributions in a modular way and have become very popular in many fields. There are many methods to build Bayesian networks from a random sample of independent and identically distributed observations. However, many observational studies are designed using some form of clustered sampling that introduces correlations between observations within the same cluster and ignoring this correlation typically inflates the rate of false positive associations. We describe a novel parameterization of Bayesian networks that uses random effects to model the correlation within sample units and can be used for structure and parameter learning from correlated data without inflating the Type I error rate. We compare different learning metrics using simulations and illustrate the method in two real examples: an analysis of genetic and non-genetic factors associated with human longevity from a family-based study, and an example of risk factors for complications of sickle cell anemia from a longitudinal study with repeated measures.
Collapse
|
17
|
Anderson EC, Ng TC. Bayesian pedigree inference with small numbers of single nucleotide polymorphisms via a factor-graph representation. Theor Popul Biol 2015; 107:39-51. [PMID: 26450523 DOI: 10.1016/j.tpb.2015.09.005] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/04/2015] [Revised: 09/23/2015] [Accepted: 09/24/2015] [Indexed: 10/22/2022]
Abstract
We develop a computational framework for addressing pedigree inference problems using small numbers (80-400) of single nucleotide polymorphisms (SNPs). Our approach relaxes the assumptions, which are commonly made, that sampling is complete with respect to the pedigree and that there is no genotyping error. It relies on representing the inferred pedigree as a factor graph and invoking the Sum-Product algorithm to compute and store quantities that allow the joint probability of the data to be rapidly computed under a large class of rearrangements of the pedigree structure. This allows efficient MCMC sampling over the space of pedigrees, and, hence, Bayesian inference of pedigree structure. In this paper we restrict ourselves to inference of pedigrees without loops using SNPs assumed to be unlinked. We present the methodology in general for multigenerational inference, and we illustrate the method by applying it to the inference of full sibling groups in a large sample (n=1157) of Chinook salmon typed at 95 SNPs. The results show that our method provides a better point estimate and estimate of uncertainty than the currently best-available maximum-likelihood sibling reconstruction method. Extensions of this work to more complex scenarios are briefly discussed.
Collapse
Affiliation(s)
- Eric C Anderson
- Fisheries Ecology Division, Southwest Fisheries Science Center, National Marine Fisheries Service, National Oceanic and Atmospheric Administration, 110 Shaffer Road, Santa Cruz, CA 95060, USA.
| | - Thomas C Ng
- Department of Biomolecular Engineering, University of California, Santa Cruz, CA, USA
| |
Collapse
|
18
|
Sugaya Y, Shibata R. Probability inheritance algorithm and its implementation. J STAT COMPUT SIM 2015. [DOI: 10.1080/00949655.2014.915032] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/25/2022]
|
19
|
Abstract
We consider the emerging problem of comparing the similarity between (unlabeled) pedigrees. More specifically, we focus on the simplest pedigrees, namely, the 2-generation pedigrees. We show that the isomorphism testing for two 2-generation pedigrees is GI-hard. If the 2-generation pedigrees are monogamous (i.e., each individual at level-1 can mate with exactly one partner) then the isomorphism testing problem can be solved in polynomial time. We then consider the problem by relaxing it into an NP-complete decomposition problem which can be formulated as the Minimum Common Integer Pair Partition (MCIPP) problem, which we show to be FPT by exploiting a property of the optimal solution. While there is still some difficulty to overcome, this lays down a solid foundation for this research.
Collapse
Affiliation(s)
- Haitao Jiang
- School of Computer Science and Technology, Shandong University, 1500 Shunhua Road, Jinan, Shandong, 250101, China
| | - Guohui Lin
- Department of Computing Science, University of Alberta, Edmonton, Alberta, T2G 2E6, Germany
| | - Weitian Tong
- Department of Computing Science, University of Alberta, Edmonton, Alberta, T2G 2E6, Germany
| | - Daming Zhu
- School of Computer Science and Technology, Shandong University, 1500 Shunhua Road, Jinan, Shandong, 250101, China
| | - Binhai Zhu
- Department of Computer Science, Montana State University, Bozeman, MT, 59717, USA
| |
Collapse
|
20
|
Improved maximum likelihood reconstruction of complex multi-generational pedigrees. Theor Popul Biol 2014; 97:11-9. [DOI: 10.1016/j.tpb.2014.07.002] [Citation(s) in RCA: 11] [Impact Index Per Article: 1.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/26/2014] [Revised: 07/11/2014] [Accepted: 07/16/2014] [Indexed: 11/17/2022]
|
21
|
Corradi F, Ricciardi F. Evaluation of kinship identification systems based on short tandem repeat DNA profiles. J R Stat Soc Ser C Appl Stat 2013. [DOI: 10.1111/rssc.12017] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
|
22
|
Johnson S, Marker L, Mengersen K, Gordon CH, Melzheimer J, Schmidt-Küntzel A, Nghikembua M, Fabiano E, Henghali J, Wachter B. Modeling the viability of the free-ranging cheetah population in Namibia: an object-oriented Bayesian network approach. Ecosphere 2013. [DOI: 10.1890/es12-00357.1] [Citation(s) in RCA: 23] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
|
23
|
Cussens J, Bartlett M, Jones EM, Sheehan NA. Maximum Likelihood Pedigree Reconstruction Using Integer Linear Programming. Genet Epidemiol 2012; 37:69-83. [DOI: 10.1002/gepi.21686] [Citation(s) in RCA: 52] [Impact Index Per Article: 4.3] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/02/2012] [Revised: 08/30/2012] [Accepted: 09/07/2012] [Indexed: 11/10/2022]
Affiliation(s)
- James Cussens
- Department of Computer Science; University of York; York; North Yorkshire; United Kingdom
| | - Mark Bartlett
- Department of Computer Science; University of York; York; North Yorkshire; United Kingdom
| | - Elinor M. Jones
- Department of Health Sciences; University of Leicester; Leicester; Leicestershire; United Kingdom
| | - Nuala A. Sheehan
- Department of Health Sciences; University of Leicester; Leicester; Leicestershire; United Kingdom
| |
Collapse
|
24
|
Kirkpatrick B, Reshef Y, Finucane H, Jiang H, Zhu B, Karp RM. Comparing pedigree graphs. J Comput Biol 2012; 19:998-1014. [PMID: 22897201 DOI: 10.1089/cmb.2011.0254] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/06/2023] Open
Abstract
Pedigree graphs, or family trees, are typically constructed by an expensive process of examining genealogical records to determine which pairs of individuals are parent and child. New methods to automate this process take as input genetic data from a set of extant individuals and reconstruct ancestral individuals. There is a great need to evaluate the quality of these methods by comparing the estimated pedigree to the true pedigree. In this article, we consider two main pedigree comparison problems. The first is the pedigree isomorphism problem, for which we present a linear-time algorithm for leaf-labeled pedigrees. The second is the pedigree edit distance problem, for which we present (1) several algorithms that are fast and exact in various special cases, and (2) a general, randomized heuristic algorithm. In the negative direction, we first prove that the pedigree isomorphism problem is as hard as the general graph isomorphism problem, and that the sub-pedigree isomorphism problem is NP-hard. We then show that the pedigree edit distance problem is APX-hard in general and NP-hard on leaf-labeled pedigrees. We use simulated pedigrees to compare our edit-distance algorithms to each other as well as to a branch-and-bound algorithm that always finds an optimal solution.
Collapse
Affiliation(s)
- Bonnie Kirkpatrick
- Electrical Engineering and Computer Sciences, University of California, Berkeley, California, USA.
| | | | | | | | | | | |
Collapse
|
25
|
Chang HH, Soderberg K, Skinner JA, Banchereau J, Chaussabel D, Haynes BF, Ramoni M, Letvin NL. Transcriptional network predicts viral set point during acute HIV-1 infection. J Am Med Inform Assoc 2012; 19:1103-9. [PMID: 22700869 DOI: 10.1136/amiajnl-2012-000867] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/29/2023] Open
Abstract
BACKGROUND HIV-1-infected individuals with higher viral set points progress to AIDS more rapidly than those with lower set points. Predicting viral set point early following infection can contribute to our understanding of early control of HIV-1 replication, to predicting long-term clinical outcomes, and to the choice of optimal therapeutic regimens. METHODS In a longitudinal study of 10 untreated HIV-1-infected patients, we used gene expression profiling of peripheral blood mononuclear cells to identify transcriptional networks for viral set point prediction. At each sampling time, a statistical analysis inferred the optimal transcriptional network that best predicted viral set point. We then assessed the accuracy of this transcriptional model by predicting viral set point in an independent cohort of 10 untreated HIV-1-infected patients from Malawi. RESULTS The gene network inferred at time of enrollment predicted viral set point 24 weeks later in the independent Malawian cohort with an accuracy of 87.5%. As expected, the predictive accuracy of the networks inferred at later time points was even greater, exceeding 90% after week 4. The composition of the inferred networks was largely conserved between time points. The 12 genes comprising this dynamic signature of viral set point implicated the involvement of two major canonical pathways: interferon signaling (p<0.0003) and membrane fraction (p<0.02). A silico knockout study showed that HLA-DRB1 and C4BPA may contribute to restricting HIV-1 replication. CONCLUSIONS Longitudinal gene expression profiling of peripheral blood mononuclear cells from patients with acute HIV-1 infection can be used to create transcriptional network models to early predict viral set point with a high degree of accuracy.
Collapse
Affiliation(s)
- Hsun-Hsien Chang
- Children's Hospital Informatics Program, Harvard-MIT Division of Health Sciences and Technology, Harvard Medical School, Boston, Massachusetts, USA.
| | | | | | | | | | | | | | | |
Collapse
|
26
|
Bayesian networks for evaluating forensic DNA profiling evidence: A review and guide to literature. Forensic Sci Int Genet 2012; 6:147-57. [DOI: 10.1016/j.fsigen.2011.06.009] [Citation(s) in RCA: 57] [Impact Index Per Article: 4.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/22/2011] [Revised: 06/07/2011] [Accepted: 06/24/2011] [Indexed: 11/17/2022]
|
27
|
Abel HJ, Thomas A. Case-control association testing by graphical modeling for the Genetic Analysis Workshop 17 mini-exome sequence data. BMC Proc 2011; 5 Suppl 9:S62. [PMID: 22373360 PMCID: PMC3287901 DOI: 10.1186/1753-6561-5-s9-s62] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
We generalize recent work on graphical models for linkage disequilibrium to estimate the conditional independence structure between all variables for individuals in the Genetic Analysis Workshop 17 unrelated individuals data set. Using a stepwise approach for computational efficiency and an extension of our previously described methods, we estimate a model that describes the relationships between the disease trait, all quantitative variables, all covariates, ethnic origin, and the loci most strongly associated with these variables. We performed our analysis for the first 50 replicate data sets. We found that our approach was able to describe the relationships between the outcomes and covariates and that it could correctly detect associations of disease with several loci and with a reasonable false-positive detection rate.
Collapse
Affiliation(s)
- Haley J Abel
- Division of Genetic Epidemiology, University of Utah, 391 Chipeta Way, Salt Lake City, UT 84105, USA.
| | | |
Collapse
|
28
|
Zhou J, Wang D, Schlegel V, Zempleni J. Development of an internet based system for modeling biotin metabolism using Bayesian networks. COMPUTER METHODS AND PROGRAMS IN BIOMEDICINE 2011; 104:254-9. [PMID: 21356565 PMCID: PMC3132571 DOI: 10.1016/j.cmpb.2011.02.004] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Received: 10/14/2010] [Accepted: 02/04/2011] [Indexed: 05/30/2023]
Abstract
Biotin is an essential water-soluble vitamin crucial for maintaining normal body functions. The importance of biotin for human health has been under-appreciated but there is plenty of opportunity for future research with great importance for human health. Currently, carrying out predictions of biotin metabolism involves tedious manual manipulations. In this paper, we report the development of BiotinNet, an internet based program that uses Bayesian networks to integrate published data on various aspects of biotin metabolism. Users can provide a combination of values on the levels of biotin related metabolites to obtain the predictions on other metabolites that are not specified. As an inherent feature of Bayesian networks, the uncertainty of the prediction is also quantified and reported to the user. This program enables convenient in silico experiments regarding biotin metabolism, which can help researchers design future experiments while new data can be continuously incorporated.
Collapse
Affiliation(s)
- Jinglei Zhou
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE 68583, USA
| | - Dong Wang
- Department of Statistics, University of Nebraska-Lincoln, Lincoln, NE 68583, USA, Tel: +1 4024724921 fax: +1 4024720736
| | - Vicki Schlegel
- Department of Food Science and Technology, University of Nebraska-Lincoln, Lincoln, NE 68583, USA
| | - Janos Zempleni
- Department of Nutrition and Health Sciences, University of Nebraska-Lincoln, Lincoln, NE 68583, USA
| |
Collapse
|
29
|
Kirkpatrick B, Li SC, Karp RM, Halperin E. Pedigree reconstruction using identity by descent. J Comput Biol 2011; 18:1481-93. [PMID: 22035331 DOI: 10.1089/cmb.2011.0156] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/04/2023] Open
Abstract
Can we find the family trees, or pedigrees, that relate the haplotypes of a group of individuals? Collecting the genealogical information for how individuals are related is a very time-consuming and expensive process. Methods for automating the construction of pedigrees could stream-line this process. While constructing single-generation families is relatively easy given whole genome data, reconstructing multi-generational, possibly inbred, pedigrees is much more challenging. This article addresses the important question of reconstructing monogamous, regular pedigrees, where pedigrees are regular when individuals mate only with other individuals at the same generation. This article introduces two multi-generational pedigree reconstruction methods: one for inbreeding relationships and one for outbreeding relationships. In contrast to previous methods that focused on the independent estimation of relationship distances between every pair of typed individuals, here we present methods that aim at the reconstruction of the entire pedigree. We show that both our methods out-perform the state-of-the-art and that the outbreeding method is capable of reconstructing pedigrees at least six generations back in time with high accuracy. The two programs are available at http://cop.icsi.berkeley.edu/cop/.
Collapse
Affiliation(s)
- Bonnie Kirkpatrick
- Electrical Engineering and Computer Sciences, University of California, Berkeley, California 94720, USA.
| | | | | | | |
Collapse
|
30
|
Jin L, Zhu W, Guo J. Genome-wide association studies using haplotype clustering with a new haplotype similarity. Genet Epidemiol 2011; 34:633-41. [PMID: 20718046 DOI: 10.1002/gepi.20521] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
Abstract
Association analysis, with the aim of investigating genetic variations, is designed to detect genetic associations with observable traits, which has played an increasing part in understanding the genetic basis of diseases. Among these methods, haplotype-based association studies are believed to possess prominent advantages, especially for the rare diseases in case-control studies. However, when modeling these haplotypes, they are subjected to statistical problems caused by rare haplotypes. Fortunately, haplotype clustering offers an appealing solution. In this research, we have developed a new befitting haplotype similarity for "affinity propagation" clustering algorithm, which can account for the rare haplotypes primely, so as to control for the issue on degrees of freedom. The new similarity can incorporate haplotype structure information, which is believed to enhance the power and provide high resolution for identifying associations between genetic variants and disease. Our simulation studies show that the proposed approach offers merits in detecting disease-marker associations in comparison with the cladistic haplotype clustering method CLADHC. We also illustrate an application of our method to cystic fibrosis, which shows quite accurate estimates during fine mapping.
Collapse
Affiliation(s)
- Lina Jin
- Key Laboratory for Applied Statistics of MOE, Northeast Normal University, Changchun, Jilin, China
| | | | | |
Collapse
|
31
|
Kirkpatrick B, Li SC, Karp RM, Halperin E. Pedigree Reconstruction Using Identity by Descent. LECTURE NOTES IN COMPUTER SCIENCE 2011. [DOI: 10.1007/978-3-642-20036-6_15] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 02/04/2023]
|
32
|
Chang HH, Dreyfuss JM, Ramoni MF. A transcriptional network signature characterizes lung cancer subtypes. Cancer 2010; 117:353-60. [PMID: 20839314 DOI: 10.1002/cncr.25592] [Citation(s) in RCA: 25] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2010] [Revised: 07/20/2010] [Accepted: 07/20/2010] [Indexed: 11/07/2022]
Abstract
BACKGROUND Transcriptional networks play a central role in cancer development. The authors described a systems biology approach to cancer classification based on the reverse engineering of the transcriptional network surrounding the 2 most common types of lung cancer: adenocarcinoma (AC) and squamous cell carcinoma (SCC). METHODS A transcriptional network classifier was inferred from the molecular profiles of 111 human lung carcinomas. The authors tested its classification accuracy in 7 independent cohorts, for a total of 422 subjects of Caucasian, African, and Asian descent. RESULTS The model for distinguishing AC from SCC was a 25-gene network signature. Its performance on the 7 independent cohorts achieved 95.2% classification accuracy. Even more surprisingly, 95% of this accuracy was explained by the interplay of 3 genes (KRT6A, KRT6B, KRT6C) on a narrow cytoband of chromosome 12. The role of this chromosomal region in distinguishing AC and SCC was further confirmed by the analysis of another group of 28 independent subjects assayed by DNA copy number changes. The copy number variations of bands 12q12, 12q13, and 12q12-13 discriminated these samples with 84% accuracy. CONCLUSIONS These results suggest the existence of a robust signature localized in a relatively small area of the genome, and show the clinical potential of reverse engineering transcriptional networks from molecular profiles.
Collapse
Affiliation(s)
- Hsun-Hsien Chang
- Children's Hospital Informatics Program, Harvard-Massachusetts Institute of Technology Division of Health Sciences and Technology, Harvard Medical School, Boston, Massachusetts 02115, USA.
| | | | | |
Collapse
|
33
|
Kirkpatrick B, Halperin E, Karp RM. Haplotype Inference in Complex Pedigrees. J Comput Biol 2010; 17:269-80. [DOI: 10.1089/cmb.2009.0174] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/13/2022] Open
Affiliation(s)
- Bonnie Kirkpatrick
- Computer Science Department, University of California, Berkeley, and the International Computer Science Institute, Berkeley, California
| | - Eran Halperin
- School of Computer Science and the Department of Biotechnology, Tel-Aviv University, and the International Computer Science Institute, Berkeley, California
| | - Richard M. Karp
- Computer Science Department, University of California, Berkeley, and the International Computer Science Institute, Berkeley, California
| |
Collapse
|
34
|
Totir LR, Fernando RL, Abraham J. An efficient algorithm to compute marginal posterior genotype probabilities for every member of a pedigree with loops. Genet Sel Evol 2009; 41:52. [PMID: 19958551 PMCID: PMC2801663 DOI: 10.1186/1297-9686-41-52] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/22/2009] [Accepted: 12/03/2009] [Indexed: 11/10/2022] Open
Abstract
Background Marginal posterior genotype probabilities need to be computed for genetic analyses such as geneticcounseling in humans and selective breeding in animal and plant species. Methods In this paper, we describe a peeling based, deterministic, exact algorithm to compute efficiently genotype probabilities for every member of a pedigree with loops without recourse to junction-tree methods from graph theory. The efficiency in computing the likelihood by peeling comes from storing intermediate results in multidimensional tables called cutsets. Computing marginal genotype probabilities for individual i requires recomputing the likelihood for each of the possible genotypes of individual i. This can be done efficiently by storing intermediate results in two types of cutsets called anterior and posterior cutsets and reusing these intermediate results to compute the likelihood. Examples A small example is used to illustrate the theoretical concepts discussed in this paper, and marginal genotype probabilities are computed at a monogenic disease locus for every member in a real cattle pedigree.
Collapse
Affiliation(s)
- Liviu R Totir
- Department of Animal Science and Center for Integrated Animal Genomics, Iowa State University, Ames, Iowa 50011, USA.
| | | | | |
Collapse
|
35
|
Cowell RG. Efficient maximum likelihood pedigree reconstruction. Theor Popul Biol 2009; 76:285-91. [PMID: 19781561 DOI: 10.1016/j.tpb.2009.09.002] [Citation(s) in RCA: 24] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/08/2009] [Revised: 09/10/2009] [Accepted: 09/16/2009] [Indexed: 11/24/2022]
Abstract
A simple and efficient algorithm is presented for finding a maximum likelihood pedigree using microsatellite (STR) genotype information on a complete sample of related individuals. The computational complexity of the algorithm is at worst (O(n(3)2(n))), where n is the number of individuals. Thus it is possible to exhaustively search the space of all pedigrees of up to thirty individuals for one that maximizes the likelihood. A priori age and sex information can be used if available, but is not essential. The algorithm is applied in a simulation study, and to some real data on humans.
Collapse
Affiliation(s)
- Robert G Cowell
- Faculty of Actuarial Science and Insurance, Cass Business School, 106 Bunhill Row, London EC1Y 8TZ, UK.
| |
Collapse
|
36
|
Abstract
Background Gene interactions play a central role in transcriptional networks. Many studies have performed genome-wide expression analysis to reconstruct regulatory networks to investigate disease processes. Since biological processes are outcomes of regulatory gene interactions, this paper develops a system biology approach to infer function-dependent transcriptional networks modulating phenotypic traits, which serve as a classifier to identify tissue states. Due to gene interactions taken into account in the analysis, we can achieve higher classification accuracy than existing methods. Results Our system biology approach is carried out by the Bayesian networks framework. The algorithm consists of two steps: gene filtering by Bayes factor followed by collinearity elimination via network learning. We validate our approach with two clinical data. In the study of lung cancer subtypes discrimination, we obtain a 25-gene classifier from 111 training samples, and the test on 422 independent samples achieves 95% classification accuracy. In the study of thoracic aortic aneurysm (TAA) diagnosis, 61 samples determine a 34-gene classifier, whose diagnosis accuracy on 33 independent samples achieves 82%. The performance comparisons with three other popular methods, PCA/LDA, PAM, and Weighted Voting, confirm that our approach yields superior classification accuracy and a more compact signature. Conclusions The system biology approach presented in this paper is able to infer function-dependent transcriptional networks, which in turn can classify biological samples with high accuracy. The validation of our classifier using clinical data demonstrates the promising value of our proposed approach for disease diagnosis.
Collapse
Affiliation(s)
- Hsun-Hsien Chang
- Childrens' Hospital Informatics Program, Harvard-MIT Division of Health Sciences and Technology, Harvard Medical School, Boston, Massachusetts, USA.
| | | |
Collapse
|
37
|
Driss A, Asare K, Hibbert J, Gee B, Adamkiewicz T, Stiles J. Sickle Cell Disease in the Post Genomic Era: A Monogenic Disease with a Polygenic Phenotype. GENOMICS INSIGHTS 2009. [DOI: 10.4137/gei.s2626] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 11/05/2022]
Abstract
More than half a century after the discovery of the molecular basis of Sickle Cell Disease (SCD), the causes of the phenotypic heterogeneity of the disease remain unclear. This heterogeneity manifests with different clinical outcomes such as stroke, vaso-occlusive episodes, acute chest syndrome, avascular necrosis, leg ulcers, priapism and retinopathy. These outcomes cannot be explained by the single mutation in the beta-globin gene alone but may be attributed to genetic modifiers and environmental effects. Recent advances in the post human genome sequence era have opened the door for the identification of novel genetic modifiers in SCD. Studies are showing that phenotypes of SCD seem to be modulated by polymorphisms in genes that are involved in inflammation, cell–cell interaction and modulators of oxidant injury and nitric oxide biology. The discovery of genes implicated in different phenotypes will help understanding of the physiopathology of the disease and aid in establishing targeted cures. However, caution is needed in asserting that genetic modifiers are the cause of all SCD phenotypes, because there are other factors such as genetic background of the population, environmental components, socio-economics and psychology that can play significant roles in the clinical heterogeneity.
Collapse
Affiliation(s)
- A. Driss
- Department of Microbiology, Biochemistry and Immunology, Morehouse School of Medicine, Atlanta, Georgia, USA
| | - K.O. Asare
- Department of Microbiology, Biochemistry and Immunology, Morehouse School of Medicine, Atlanta, Georgia, USA
| | - J.M. Hibbert
- Department of Microbiology, Biochemistry and Immunology, Morehouse School of Medicine, Atlanta, Georgia, USA
| | - B.E. Gee
- Department of Clinical Pediatrics, Morehouse School of Medicine, Atlanta, Georgia, USA
| | - T.V. Adamkiewicz
- Department of Family Medicine, Morehouse School of Medicine, Atlanta, Georgia, USA
| | - J.K. Stiles
- Department of Microbiology, Biochemistry and Immunology, Morehouse School of Medicine, Atlanta, Georgia, USA
| |
Collapse
|
38
|
Ramoni RB, Saccone NL, Hatsukami DK, Bierut LJ, Ramoni MF. A testable prognostic model of nicotine dependence. J Neurogenet 2009; 23:283-92. [PMID: 19184766 DOI: 10.1080/01677060802572911] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
Individuals' dependence on nicotine, primarily through cigarette smoking, is a major source of morbidity and mortality worldwide. Many smokers attempt but fail to quit smoking, motivating researchers to identify the origins of this dependence. Because of the known heritability of nicotine-dependence phenotypes, considerable interest has been focused on discovering the genetic factors underpinning the trait. This goal, however, is not easily attained: no single factor is likely to explain any great proportion of dependence because nicotine dependence is thought to be a complex trait (i.e., the result of many interacting factors). Genomewide association studies are powerful tools in the search for the genomic bases of complex traits, and in this context, novel candidate genes have been identified through single nucleotide polymorphism (SNP) association analyses. Beyond association, however, genetic data can be used to generate predictive models of nicotine dependence. As expected in the context of a complex trait, individual SNPs fail to accurately predict nicotine dependence, demanding the use of multivariate models. Standard approaches, such as logistic regression, are unable to consider large numbers of SNPs given existing sample sizes. However, using Bayesian networks, one can overcome these limitations to generate a multivariate predictive model, which has markedly enhanced predictive accuracy on fitted values relative to that of individual SNPs. This approach, combined with the data being generated by genomewide association studies, promises to shed new light on the common, complex trait nicotine dependence.
Collapse
Affiliation(s)
- Rachel Badovinac Ramoni
- Department of Developmental Biology, Harvard School of Dental Medicine, Boston, Massachusetts, USA
| | | | | | | | | |
Collapse
|
39
|
|
40
|
Albers CA, Stankovich J, Thomson R, Bahlo M, Kappen HJ. Multipoint approximations of identity-by-descent probabilities for accurate linkage analysis of distantly related individuals. Am J Hum Genet 2008; 82:607-22. [PMID: 18319071 PMCID: PMC2427226 DOI: 10.1016/j.ajhg.2007.12.016] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/15/2007] [Revised: 10/22/2007] [Accepted: 12/11/2007] [Indexed: 12/22/2022] Open
Abstract
We propose an analytical approximation method for the estimation of multipoint identity by descent (IBD) probabilities in pedigrees containing a moderate number of distantly related individuals. We show that in large pedigrees where cases are related through untyped ancestors only, it is possible to formulate the hidden Markov model of the Lander-Green algorithm in terms of the IBD configurations of the cases. We use a first-order Markov approximation to model the changes in this IBD-configuration variable along the chromosome. In simulated and real data sets, we demonstrate that estimates of parametric and nonparametric linkage statistics based on the first-order Markov approximation are accurate. The computation time is exponential in the number of cases instead of in the number of meioses separating the cases. We have implemented our approach in the computer program ALADIN (accurate linkage analysis of distantly related individuals). ALADIN can be applied to general pedigrees and marker types and has the ability to model marker-marker linkage disequilibrium with a clustered-markers approach. Using ALADIN is straightforward: It requires no parameters to be specified and accepts standard input files.
Collapse
Affiliation(s)
- Cornelis A Albers
- Department of Biophysics, Institute for Computing and Information Sciences, Radboud University, 6525 EZ Nijmegen, The Netherlands.
| | | | | | | | | |
Collapse
|
41
|
|
42
|
Thomas A. Towards linkage analysis with markers in linkage disequilibrium by graphical modelling. Hum Hered 2007; 64:16-26. [PMID: 17483593 DOI: 10.1159/000101419] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/19/2022] Open
Abstract
We review recent developments of MCMC integration methods for computations on graphical models for two applications in statistical genetics: modelling allelic association and pedigree based linkage analysis. We discuss and illustrate estimation of graphical models from haploid and diploid genotypes, and the importance of MCMC updating schemes beyond what is strictly necessary for irreducibility. We then outline an approach combining these methods to compute linkage statistics when alleles at the marker loci are in linkage disequilibrium. Other extensions suitable for analysis of SNP genotype data in pedigrees are also discussed and programs that implement these methods, and which are available from the author's web site, are described. We conclude with a discussion of how this still experimental approach might be further developed.
Collapse
Affiliation(s)
- Alun Thomas
- Department of Biomedical Informatics, Genetic Epidemiology, University of Utah, Salt Lake City, Utah 84108, USA.
| |
Collapse
|
43
|
Funke BH, Brown AC, Ramoni MF, Regan ME, Baglieri C, Finn CT, Babcock M, Shprintzen RJ, Morrow BE, Kucherlapati R. A Novel, Single Nucleotide Polymorphism-Based Assay to Detect 22q11 Deletions. ACTA ACUST UNITED AC 2007; 11:91-100. [PMID: 17394398 DOI: 10.1089/gte.2006.0507] [Citation(s) in RCA: 5] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/06/2022]
Abstract
Velocardiofacial syndrome, DiGeorge syndrome, and conotruncal anomaly face syndrome, now collectively referred to as 22q11deletion syndrome (22q11DS) are caused by microdeletions on chromosome 22q11. The great majority ( approximately 90%) of these deletions are 3 Mb in size. The remaining deleted patients have nested break-points resulting in overlapping regions of hemizygosity. Diagnostic testing for the disorder is traditionally done by fluorescent in situ hybridization (FISH) using probes located in the proximal half of the region common to all deletions. We developed a novel, high-resolution single-nucleotide polymorphism (SNP) genotyping assay to detect 22q11 deletions. We validated this assay using DNA from 110 nondeleted controls and 77 patients with 22q11DS that had previously been tested by FISH. The assay was 100% sensitive (all deletions were correctly identified). Our assay was also able to detect a case of segmental uniparental disomy at 22q11 that was not detected by the FISH assay. We used Bayesian networks to identify a set of 17 SNPs that are sufficient to ascertain unambiguously the deletion status of 22q11DS patients. Our SNP based assay is a highly accurate, sensitive, and specific method for the diagnosis of 22q11 deletion syndrome.
Collapse
Affiliation(s)
- Birgit H Funke
- Harvard Medical School-Partners Healthcare Center for Genetics and Genomics, Cambridge, MA 02139, USA.
| | | | | | | | | | | | | | | | | | | |
Collapse
|
44
|
Katki HA. Incorporating medical interventions into carrier probability estimation for genetic counseling. BMC MEDICAL GENETICS 2007; 8:13. [PMID: 17378937 PMCID: PMC1847675 DOI: 10.1186/1471-2350-8-13] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 08/30/2006] [Accepted: 03/22/2007] [Indexed: 01/07/2023]
Abstract
Background Mendelian models for predicting who may carry an inherited deleterious mutation of known disease genes based on family history are used in a variety of clinical and research activities. People presenting for genetic counseling are increasingly reporting risk-reducing medical interventions in their family histories because, recently, a slew of prophylactic interventions have become available for certain diseases. For example, oophorectomy reduces risk of breast and ovarian cancers, and is now increasingly being offered to women with family histories of breast and ovarian cancer. Mendelian models should account for medical interventions because interventions modify mutation penetrances and thus affect the carrier probability estimate. Methods We extend Mendelian models to account for medical interventions by accounting for post-intervention disease history through an extra factor that can be estimated from published studies of the effects of interventions. We apply our methods to incorporate oophorectomy into the BRCAPRO model, which predicts a woman's risk of carrying mutations in BRCA1 and BRCA2 based on her family history of breast and ovarian cancer. This new BRCAPRO is available for clinical use. Results We show that accounting for interventions undergone by family members can seriously affect the mutation carrier probability estimate, especially if the family member has lived many years post-intervention. We show that interventions have more impact on the carrier probability as the benefits of intervention differ more between carriers and non-carriers. Conclusion These findings imply that carrier probability estimates that do not account for medical interventions may be seriously misleading and could affect a clinician's recommendation about offering genetic testing. The BayesMendel software, which allows one to implement any Mendelian carrier probability model, has been extended to allow medical interventions, so future Mendelian models can easily account for interventions.
Collapse
Affiliation(s)
- Hormuzd A Katki
- Division of Cancer Epidemiology and Genetics, National Cancer Institute, NIH, DHHS, 6120 Executive Blvd, Room 8044 Rockville, MD 20852, USA.
| |
Collapse
|
45
|
Sheehan NA, Egeland T. Structured Incorporation of Prior Information in Relationship Identification Problems. Ann Hum Genet 2007; 71:501-18. [PMID: 17233753 DOI: 10.1111/j.1469-1809.2006.00345.x] [Citation(s) in RCA: 16] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/25/2022]
Abstract
The objective of this paper is to show how various sources of information can be modelled and integrated to address relationship identification problems. Applications come from areas as diverse as evolution and conservation research, genealogical research in human, plant and animal populations, and forensic problems including paternity cases, identification following disasters, family reunions and immigration issues. We propose assigning a prior probability distribution to the sample space of pedigrees, calculating the likelihood based on DNA data using available software and posterior probabilities using Bayes' Theorem. Our emphasis here is on the modelling of this prior information in a formal and consistent manner. We introduce the distinction between local and global prior information, whereby local information usually applies to particular components of the pedigree and global prior information refers to more general features. When it is difficult to decide on a prior distribution, robustness to various choices should be studied. When suitable prior information is not available, a flat prior can be used which will then correspond to a strict likelihood approach. In practice, prior information is often considered for these problems, but in a generally ad hoc manner. This paper offers a consistent alternative. We emphasise that many practical problems can be addressed using freely available software.
Collapse
Affiliation(s)
- N A Sheehan
- Department of Health Sciences, University of Leicester, University Road, Leicester LE1 7RH, UK.
| | | |
Collapse
|
46
|
Wermuth N. Statistics for Studies of Human Welfare. Int Stat Rev 2007. [DOI: 10.1111/j.1751-5823.2005.tb00289.x] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
|
47
|
A simple method to approximate gene content in large pedigree populations: application to the myostatin gene in dual-purpose Belgian Blue cattle. Animal 2007; 1:21-8. [DOI: 10.1017/s1751731107392628] [Citation(s) in RCA: 106] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/06/2022] Open
|
48
|
Abstract
People with familial history of disease often consult with genetic counselors about their chance of carrying mutations that increase disease risk. To aid them, genetic counselors use Mendelian models that predict whether the person carries deleterious mutations based on their reported family history. Such models rely on accurate reporting of each member's diagnosis and age of diagnosis, but this information may be inaccurate. Commonly encountered errors in family history can significantly distort predictions, and thus can alter the clinical management of people undergoing counseling, screening, or genetic testing. We derive general results about the distortion in the carrier probability estimate caused by misreported diagnoses in relatives. We show that the Bayes factor that channels all family history information has a convenient and intuitive interpretation. We focus on the ratio of the carrier odds given correct diagnosis versus given misreported diagnosis to measure the impact of errors. We derive the general form of this ratio and approximate it in realistic cases. Misreported age of diagnosis usually causes less distortion than misreported diagnosis. This is the first systematic quantitative assessment of the effect of misreported family history on mutation prediction. We apply the results to the BRCAPRO model, which predicts the risk of carrying a mutation in the breast and ovarian cancer genes BRCA1 and BRCA2.
Collapse
Affiliation(s)
- Hormuzd A Katki
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, Maryland 21205, USA.
| |
Collapse
|
49
|
Pirinen M, Gasbarra D. Finding consistent gene transmission patterns on large and complex pedigrees. IEEE/ACM TRANSACTIONS ON COMPUTATIONAL BIOLOGY AND BIOINFORMATICS 2006; 3:252-62. [PMID: 17048463 DOI: 10.1109/tcbb.2006.36] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 05/12/2023]
Abstract
A heuristic algorithm for finding gene transmission patterns on large and complex pedigrees with partially observed genotype data is proposed. The method can be used to generate an initial point for a Markov chain Monte Carlo simulation or to check that the given pedigree and the genotype data are consistent. In small pedigrees, the algorithm is exact by exhaustively enumerating all possibilities, but, in large pedigrees, with a considerable amount of unknown data, only a subset of promising configurations can actually be checked. For that purpose, the configurations are ordered by combining the approximative conditional probability distribution of the unknown genotypes with the information on the relationships between individuals. We also introduce a way to divide the task into subparts, which has been shown to be useful in large pedigrees. The algorithm has been implemented in a program called APE (Allelic Path Explorer) and tested in three different settings with good results.
Collapse
Affiliation(s)
- Matti Pirinen
- Department of Mathematics and Statistics, PO Box 68, University of Helsinki, Finland.
| | | |
Collapse
|
50
|
|