1
|
Burris C, Nikolaev A, Zhong S, Bian L. Network effects in influenza spread: The impact of mobility and socio-economic factors. SOCIO-ECONOMIC PLANNING SCIENCES 2021; 78:101081. [PMID: 35812715 PMCID: PMC9264374 DOI: 10.1016/j.seps.2021.101081] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 06/01/2023]
Abstract
This paper introduces new methods of modeling and analyzing social networks that emerge in the context of disease spread. Four methods of constructing informative networks are presented, two of which use. static data and two use temporal data, namely individual citizen mobility observations taken over an extensive period of time. We show how the built networks can be analyzed, and how the numerical results can be interpreted, using network permutation-based surprise analysis. In doing so, we explain the relationship of surprise analysis with conventional network hypothesis testing and Quadratic Assignment Procedure regression. Surprise analysis is more comprehensive, and can be without limitation performed with any form(s) of network subgraphs, including those with multiple nodal attributes, weighted links, and temporal features. To illustrate our methodological work in application, we put them to use for interpreting networks constructed from the data collected over one year in an observational study in Buffalo and Erie counties in New York state during the 2016-2017 influenza season. Even with the limitations in the data size, our methods are able to reveal the global (city- and season-wide) patterns in the spread of influenza, taking into account population mobility and socio-economic factors.
Collapse
Affiliation(s)
- Courtney Burris
- Department of Industrial and Systems Engineering, University at Buffalo, USA
| | - Alexander Nikolaev
- Department of Industrial and Systems Engineering, University at Buffalo, USA
| | - Shiran Zhong
- Department of Geography, University at Buffalo, USA
| | - Ling Bian
- Department of Geography, University at Buffalo, USA
| |
Collapse
|
2
|
Li Q, Schwender H, Louis TA, Fallin MD, Ruczinski I. Efficient simulation of epistatic interactions in case-parent trios. Hum Hered 2013; 75:12-22. [PMID: 23548797 DOI: 10.1159/000348789] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/24/2012] [Accepted: 02/11/2013] [Indexed: 12/26/2022] Open
Abstract
Statistical approaches to evaluate interactions between single nucleotide polymorphisms (SNPs) and SNP-environment interactions are of great importance in genetic association studies, as susceptibility to complex disease might be related to the interaction of multiple SNPs and/or environmental factors. With these methods under active development, algorithms to simulate genomic data sets are needed to ensure proper type I error control of newly proposed methods and to compare power with existing methods. In this paper we propose an efficient method for a haplotype-based simulation of case-parent trios when the disease risk is thought to depend on possibly higher-order epistatic interactions or gene-environment interactions with binary exposures.
Collapse
Affiliation(s)
- Qing Li
- Statistical Genetics Section, National Human Genome Research Institute, National Institutes of Health, Baltimore, MD, USA
| | | | | | | | | |
Collapse
|
3
|
Lane HY, Tsai GE, Lin E. Assessing Gene-Gene Interactions in Pharmacogenomics. Mol Diagn Ther 2012; 16:15-27. [DOI: 10.1007/bf03256426] [Citation(s) in RCA: 37] [Impact Index Per Article: 3.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/08/2023]
|
4
|
Bridges M, Heron EA, O'Dushlaine C, Segurado R, Morris D, Corvin A, Gill M, Pinto C. Genetic classification of populations using supervised learning. PLoS One 2011; 6:e14802. [PMID: 21589856 PMCID: PMC3093382 DOI: 10.1371/journal.pone.0014802] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2010] [Accepted: 12/01/2010] [Indexed: 11/18/2022] Open
Abstract
There are many instances in genetics in which we wish to determine whether two
candidate populations are distinguishable on the basis of their genetic
structure. Examples include populations which are geographically separated,
case–control studies and quality control (when participants in a study
have been genotyped at different laboratories). This latter application is of
particular importance in the era of large scale genome wide association studies,
when collections of individuals genotyped at different locations are being
merged to provide increased power. The traditional method for detecting
structure within a population is some form of exploratory technique such as
principal components analysis. Such methods, which do not utilise our prior
knowledge of the membership of the candidate populations. are termed
unsupervised. Supervised methods, on the other hand are
able to utilise this prior knowledge when it is available. In this paper we demonstrate that in such cases modern supervised approaches are
a more appropriate tool for detecting genetic differences between populations.
We apply two such methods, (neural networks and support vector machines) to the
classification of three populations (two from Scotland and one from Bulgaria).
The sensitivity exhibited by both these methods is considerably higher than that
attained by principal components analysis and in fact comfortably exceeds a
recently conjectured theoretical limit on the sensitivity of unsupervised
methods. In particular, our methods can distinguish between the two Scottish
populations, where principal components analysis cannot. We suggest, on the
basis of our results that a supervised learning approach should be the method of
choice when classifying individuals into pre-defined populations, particularly
in quality control for large scale genome wide association studies.
Collapse
Affiliation(s)
- Michael Bridges
- Astrophysics Group, Cavendish Laboratory, Cambridge, United
Kingdom
| | - Elizabeth A. Heron
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Colm O'Dushlaine
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Ricardo Segurado
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | | | - Derek Morris
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Aiden Corvin
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Michael Gill
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
| | - Carlos Pinto
- Neuropsychiatric Genetics Research Group, Department of Psychiatry,
Trinity College, Dublin, Ireland
- * E-mail:
| |
Collapse
|
5
|
Schwender H, Bowers K, Fallin MD, Ruczinski I. Importance measures for epistatic interactions in case-parent trios. Ann Hum Genet 2010; 75:122-32. [PMID: 21118192 DOI: 10.1111/j.1469-1809.2010.00623.x] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/20/2022]
Abstract
Ensemble methods (such as Bagging and Random Forests) take advantage of unstable base learners (such as decision trees) to improve predictions, and offer measures of variable importance useful for variable selection. LogicFS has been proposed as such an ensemble learner for case-control studies when interactions of single nucleotide polymorphisms (SNPs) are of particular interest. LogicFS uses bootstrap samples of the data and employs the Boolean trees derived in logic regression as base learners to create ensembles of models that allow for the quantification of the contributions of epistatic interactions to the disease risk. In this article, we propose an extension of logicFS suitable for case-parent trio data, and derive an additional importance measure that is much less influenced by linkage disequilibrium between SNPs than the measure originally used in logicFS. We illustrate the performance of the novel procedure in simulation studies and in a case study of 461 case-parent trios with autistic children.
Collapse
Affiliation(s)
- Holger Schwender
- Department of Biostatistics, Johns Hopkins Bloomberg School of Public Health, Baltimore, MD 21218, USA
| | | | | | | |
Collapse
|
6
|
A multifactorial analysis of obesity as CVD risk factor: use of neural network based methods in a nutrigenetics context. BMC Bioinformatics 2010; 11:453. [PMID: 20825661 PMCID: PMC2941694 DOI: 10.1186/1471-2105-11-453] [Citation(s) in RCA: 20] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/15/2009] [Accepted: 09/08/2010] [Indexed: 01/17/2023] Open
Abstract
Background Obesity is a multifactorial trait, which comprises an independent risk factor for cardiovascular disease (CVD). The aim of the current work is to study the complex etiology beneath obesity and identify genetic variations and/or factors related to nutrition that contribute to its variability. To this end, a set of more than 2300 white subjects who participated in a nutrigenetics study was used. For each subject a total of 63 factors describing genetic variants related to CVD (24 in total), gender, and nutrition (38 in total), e.g. average daily intake in calories and cholesterol, were measured. Each subject was categorized according to body mass index (BMI) as normal (BMI ≤ 25) or overweight (BMI > 25). Two artificial neural network (ANN) based methods were designed and used towards the analysis of the available data. These corresponded to i) a multi-layer feed-forward ANN combined with a parameter decreasing method (PDM-ANN), and ii) a multi-layer feed-forward ANN trained by a hybrid method (GA-ANN) which combines genetic algorithms and the popular back-propagation training algorithm. Results PDM-ANN and GA-ANN were comparatively assessed in terms of their ability to identify the most important factors among the initial 63 variables describing genetic variations, nutrition and gender, able to classify a subject into one of the BMI related classes: normal and overweight. The methods were designed and evaluated using appropriate training and testing sets provided by 3-fold Cross Validation (3-CV) resampling. Classification accuracy, sensitivity, specificity and area under receiver operating characteristics curve were utilized to evaluate the resulted predictive ANN models. The most parsimonious set of factors was obtained by the GA-ANN method and included gender, six genetic variations and 18 nutrition-related variables. The corresponding predictive model was characterized by a mean accuracy equal of 61.46% in the 3-CV testing sets. Conclusions The ANN based methods revealed factors that interactively contribute to obesity trait and provided predictive models with a promising generalization ability. In general, results showed that ANNs and their hybrids can provide useful tools for the study of complex traits in the context of nutrigenetics.
Collapse
|
7
|
Curtis D, Vine AE, Knight J. A simple method for assessing the strength of evidence for association at the level of the whole gene. Adv Appl Bioinform Chem 2008; 1:115-20. [PMID: 21918610 PMCID: PMC3169937 DOI: 10.2147/aabc.s4095] [Citation(s) in RCA: 12] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/23/2022] Open
Abstract
Introduction It is expected that different markers may show different patterns of association with different pathogenic variants within a given gene. It would be helpful to combine the evidence implicating association at the level of the whole gene rather than just for individual markers or haplotypes. Doing this is complicated by the fact that different markers do not represent independent sources of information. Method We propose combining the p values from all single locus and/or multilocus analyses of different markers according to the formula of Fisher, X = ∑(−2ln(pi)), and then assessing the empirical significance of this statistic using permutation testing. We present an example application to 19 markers around the HTRA2 gene in a case-control study of Parkinson’s disease. Results Applying our approach shows that, although some individual tests produce low p values, overall association at the level of the gene is not supported. Discussion Approaches such as this should be more widely used in assimilating the overall evidence supporting involvement of a gene in a particular disease. Information can be combined from biallelic and multiallelic markers and from single markers along with multimarker analyses. Single genes can be tested or results from groups of genes involved in the same pathway could be combined in order to test biologically relevant hypotheses. The approach has been implemented in a computer program called COMBASSOC which is made available for downloading.
Collapse
Affiliation(s)
- David Curtis
- Centre for Psychiatry, Queen Mary's School of Medicine and Dentistry, London E1 1BB, UK
| | | | | |
Collapse
|
8
|
Motsinger-Reif AA, Dudek SM, Hahn LW, Ritchie MD. Comparison of approaches for machine-learning optimization of neural networks for detecting gene-gene interactions in genetic epidemiology. Genet Epidemiol 2008; 32:325-40. [PMID: 18265411 DOI: 10.1002/gepi.20307] [Citation(s) in RCA: 57] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
The detection of genotypes that predict common, complex disease is a challenge for human geneticists. The phenomenon of epistasis, or gene-gene interactions, is particularly problematic for traditional statistical techniques. Additionally, the explosion of genetic information makes exhaustive searches of multilocus combinations computationally infeasible. To address these challenges, neural networks (NN), a pattern recognition method, have been used. One limitation of the NN approach is that its success is dependent on the architecture of the network. To solve this, machine-learning approaches have been suggested to evolve the best NN architecture for a particular data set. In this study we provide a detailed technical description of the use of grammatical evolution to optimize neural networks (GENN) for use in genetic association studies. We compare the performance of GENN to that of a previous machine-learning NN application--genetic programming neural networks in both simulated and real data. We show that GENN greatly outperforms genetic programming neural networks in data sets with a large number of single nucleotide polymorphisms. Additionally, we demonstrate that GENN has high power to detect disease-risk loci in a range of high-order epistatic models. Finally, we demonstrate the scalability of the GENN method with increasing numbers of variables--as many as 500,000 single nucleotide polymorphisms.
Collapse
Affiliation(s)
- Alison A Motsinger-Reif
- Bioinformatics Research Center, North Carolina State University, Raleigh, North Carolina, USA
| | | | | | | |
Collapse
|
9
|
Neural networks for genetic epidemiology: past, present, and future. BioData Min 2008; 1:3. [PMID: 18822147 PMCID: PMC2553772 DOI: 10.1186/1756-0381-1-3] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2008] [Accepted: 07/17/2008] [Indexed: 01/17/2023] Open
Abstract
During the past two decades, the field of human genetics has experienced an information explosion. The completion of the human genome project and the development of high throughput SNP technologies have created a wealth of data; however, the analysis and interpretation of these data have created a research bottleneck. While technology facilitates the measurement of hundreds or thousands of genes, statistical and computational methodologies are lacking for the analysis of these data. New statistical methods and variable selection strategies must be explored for identifying disease susceptibility genes for common, complex diseases. Neural networks (NN) are a class of pattern recognition methods that have been successfully implemented for data mining and prediction in a variety of fields. The application of NN for statistical genetics studies is an active area of research. Neural networks have been applied in both linkage and association analysis for the identification of disease susceptibility genes. In the current review, we consider how NN have been used for both linkage and association analyses in genetic epidemiology. We discuss both the successes of these initial NN applications, and the questions that arose during the previous studies. Finally, we introduce evolutionary computing strategies, Genetic Programming Neural Networks (GPNN) and Grammatical Evolution Neural Networks (GENN), for using NN in association studies of complex human diseases that address some of the caveats illuminated by previous work.
Collapse
|
10
|
Penco S, Buscema M, Patrosso MC, Marocchi A, Grossi E. New application of intelligent agents in sporadic amyotrophic lateral sclerosis identifies unexpected specific genetic background. BMC Bioinformatics 2008; 9:254. [PMID: 18513389 PMCID: PMC2443147 DOI: 10.1186/1471-2105-9-254] [Citation(s) in RCA: 16] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/09/2008] [Accepted: 05/30/2008] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Few genetic factors predisposing to the sporadic form of amyotrophic lateral sclerosis (ALS) have been identified, but the pathology itself seems to be a true multifactorial disease in which complex interactions between environmental and genetic susceptibility factors take place. The purpose of this study was to approach genetic data with an innovative statistical method such as artificial neural networks to identify a possible genetic background predisposing to the disease. A DNA multiarray panel was applied to genotype more than 60 polymorphisms within 35 genes selected from pathways of lipid and homocysteine metabolism, regulation of blood pressure, coagulation, inflammation, cellular adhesion and matrix integrity, in 54 sporadic ALS patients and 208 controls. Advanced intelligent systems based on novel coupling of artificial neural networks and evolutionary algorithms have been applied. The results obtained have been compared with those derived from the use of standard neural networks and classical statistical analysis RESULTS Advanced intelligent systems based on novel coupling of artificial neural networks and evolutionary algorithms have been applied. The results obtained have been compared with those derived from the use of standard neural networks and classical statistical analysis. An unexpected discovery of a strong genetic background in sporadic ALS using a DNA multiarray panel and analytical processing of the data with advanced artificial neural networks was found. The predictive accuracy obtained with Linear Discriminant Analysis and Standard Artificial Neural Networks ranged from 70% to 79% (average 75.31%) and from 69.1 to 86.2% (average 76.6%) respectively. The corresponding value obtained with Advanced Intelligent Systems reached an average of 96.0% (range 94.4 to 97.6%). This latter approach allowed the identification of seven genetic variants essential to differentiate cases from controls: apolipoprotein E arg158cys; hepatic lipase -480 C/T; endothelial nitric oxide synthase 690 C/T and glu298asp; vitamin K-dependent coagulation factor seven arg353glu, glycoprotein Ia/IIa 873 G/A and E-selectin ser128arg. CONCLUSION This study provides an alternative and reliable method to approach complex diseases. Indeed, the application of a novel artificial intelligence-based method offers a new insight into genetic markers of sporadic ALS pointing out the existence of a strong genetic background.
Collapse
Affiliation(s)
- Silvana Penco
- Medical Genetics, Clinical Chemistry and Clinical Pathology Laboratory, Niguarda Ca' Granda Hospital P.za Ospedale Maggiore 3, 20100 Milan, Italy
| | | | - Maria Cristina Patrosso
- Medical Genetics, Clinical Chemistry and Clinical Pathology Laboratory, Niguarda Ca' Granda Hospital P.za Ospedale Maggiore 3, 20100 Milan, Italy
| | - Alessandro Marocchi
- Medical Genetics, Clinical Chemistry and Clinical Pathology Laboratory, Niguarda Ca' Granda Hospital P.za Ospedale Maggiore 3, 20100 Milan, Italy
| | - Enzo Grossi
- Bracco SpA Medical Department Via E. Folli 50, 20134 Milan, Italy
| |
Collapse
|
11
|
Curtis D. Comparison of artificial neural network analysis with other multimarker methods for detecting genetic association. BMC Genet 2007; 8:49. [PMID: 17640352 PMCID: PMC1940019 DOI: 10.1186/1471-2156-8-49] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/21/2007] [Accepted: 07/18/2007] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Debate remains as to the optimal method for utilising genotype data obtained from multiple markers in case-control association studies. I and colleagues have previously described a method of association analysis using artificial neural networks (ANNs), whose performance compared favourably to single-marker methods. Here, the performance of ANN analysis is compared with other multi-marker methods, comprising different haplotype-based analyses and locus-based analyses. RESULTS Of several methods studied and applied to simulated SNP datasets, heterogeneity testing of estimated haplotype frequencies using asymptotic p values rather than permutation testing had the lowest power of the methods studied and ANN analysis had the highest power. The difference in power to detect association between these two methods was statistically significant (p = 0.001) but other comparisons between methods were not significant. The raw t statistic obtained from ANN analysis correlated highly with the empirical statistical significance obtained from permutation testing of the ANN results and with the p value obtained from the heterogeneity test. CONCLUSION Although ANN analysis was more powerful than the standard haplotype-based test it is unlikely to be taken up widely. The permutation testing necessary to obtain a valid p value makes it slow to perform and it is not underpinned by a theoretical model relating marker genotypes to disease phenotype. Nevertheless, the superior performance of this method does imply that the widely-used haplotype-based methods for detecting association with multiple markers are not optimal and efforts could be made to improve upon them. The fact that the t statistic obtained from ANN analysis is highly correlated with the statistical significance does suggest a possibility to use ANN analysis in situations where large numbers of markers have been genotyped, since the t value could be used as a proxy for the p value in preliminary analyses.
Collapse
Affiliation(s)
- David Curtis
- Academic Centre for Psychiatry, St Bartholomew's and Royal London School of Medicine and Dentistry, Royal London Hospital, Whitechapel, London, UK.
| |
Collapse
|
12
|
Lin E, Hwang Y, Liang KH, Chen EY. Pattern-recognition techniques with haplotype analysis in pharmacogenomics. Pharmacogenomics 2007; 8:75-83. [PMID: 17187511 DOI: 10.2217/14622416.8.1.75] [Citation(s) in RCA: 31] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023] Open
Abstract
Single nucleotide polymorphisms (SNPs) can be used in clinical association studies to determine the contribution of genes to drug efficacy. However, it would be extremely inefficient to test all the 10 million common SNPs for an association study. Here we review haplotype analysis and pattern-recognition techniques to systematically select candidate SNPs for candidate-gene association studies in pharmacogenomics. First, we survey linkage disequilibrium methods to identify tag SNPs and explore the use of haplotypes as genetic markers that are correlated and associated with drug efficacy. Secondly, we investigate pattern-recognition algorithms and statistical analyses to assess drug efficacy based on SNPs and other factors. Finally, we study pattern-recognition approaches to evaluate the epistasis among genes and SNPs. These techniques may provide tools for clinical association studies and help find genes/SNPs involved in responses to therapeutic drugs or adverse drug reactions.
Collapse
Affiliation(s)
- Eugene Lin
- Vita Genomics, Inc., Floor 7, Number 6, Section 1, Jung-Shing Road, Wugu Shiang, Taipei, Taiwan.
| | | | | | | |
Collapse
|
13
|
North BV, Sham PC, Knight J, Martin ER, Curtis D. Investigation of the ability of haplotype association and logistic regression to identify associated susceptibility loci. Ann Hum Genet 2006; 70:893-906. [PMID: 17044864 DOI: 10.1111/j.1469-1809.2006.00301.x] [Citation(s) in RCA: 9] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/27/2022]
Abstract
While finely spaced markers are increasingly being used in case-control association studies in attempts to identify susceptibility loci, not enough is yet known as to the optimal spacing of such markers, their likely power to detect association, the relative merits of single marker versus multimarker analysis, or which methods of analysis may be optimal. Some investigations of these issues have used markers simulated under different theoretical models of population evolution. However the HapMap project and other sources provide real datasets which can be used to obtain a more realistic view of the performance of these approaches. SNPs around APOE and from two HapMap regions were used to obtain information regarding linkage disequilibrium (LD) relationships between polymorphisms, and these real patterns of LD were used to simulate datasets such as would be obtained in case-control studies were these SNPs to influence susceptibility to disease. The datasets obtained were analysed using tests for heterogeneity of estimated haplotype frequencies and using logistic regression analyses in which only main effects from each marker were considered. All markers surrounding the putative susceptibility locus were analysed, using sets of either 1, 2, 3 or 4 markers at a time. Some markers within 150 kb of the susceptibility locus were able to detect association. At distances less than 100 kb there was no correlation between the distance from the susceptibility locus and the strength of evidence for association. When the average inter-locus spacing is 25 kb many loci would not be detected, while when the spacing is as low as 2 kb one can be fairly confident that at least one marker will be in strong enough LD with the susceptibility locus to enable association to be detected, if the susceptibility locus has a strong enough effect relative to the sample size. With an inter-locus spacing of 4 kb some susceptibility loci did not have a marker locus in strong LD, potentially undermining the ability to detect association. There was little difference in the performance of haplotype-based analysis compared with logistic regression considering effects of each marker as separate. Multimarker analysis on occasion produced results which were much more highly significant than single marker analysis, but only very rarely. Our results support the view that if markers are randomly selected then a spacing as low as 2 kb is desirable. Multimarker analysis can sometimes be more powerful than single marker analysis so both should be performed. However, because it is rare for multimarker analysis to be much more highly significant than single marker analysis one should strongly suspect that when such results occur they may be due to mistakes in genotyping or through some other artefact. Haplotype analysis may be more prone to such problems than logistic regression, suggesting that the latter method might be preferred.
Collapse
Affiliation(s)
- B V North
- Institute of Cancer Research, 15 Cotswold Road, Belmont, Sutton, Surrey SM2 5NG, UK
| | | | | | | | | |
Collapse
|
14
|
Motsinger AA, Reif DM, Dudek SM, Ritchie MD. Understanding the Evolutionary Process of Grammatical Evolution Neural Networks for Feature Selection in Genetic Epidemiology. PROCEEDINGS OF THE ... IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY : CIBCB. IEEE SYMPOSIUM ON COMPUTATIONAL INTELLIGENCE IN BIOINFORMATICS AND COMPUTATIONAL BIOLOGY 2006; 2006:1-8. [PMID: 20634919 PMCID: PMC2903766 DOI: 10.1109/cibcb.2006.330945] [Citation(s) in RCA: 10] [Impact Index Per Article: 0.6] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
The identification of genetic factors/features that predict complex diseases is an important goal of human genetics. The commonality of gene-gene interactions in the underlying genetic architecture of common diseases presents a daunting analytical challenge. Previously, we introduced a grammatical evolution neural network (GENN) approach that has high power to detect such interactions in the absence of any marginal main effects. While the success of this method is encouraging, it elicits questions regarding the evolutionary process of the algorithm itself and the feasibility of scaling the method to account for the immense dimensionality of datasets with enormous numbers of features. When the features of interest show no main effects, how is GENN able to build correct models? How and when should evolutionary parameters be adjusted according to the scale of a particular dataset? In the current study, we monitor the performance of GENN during its evolutionary process using different population sizes and numbers of generations. We also compare the evolutionary characteristics of GENN to that of a random search neural network strategy to better understand the benefits provided by the evolutionary learning process-including advantages with respect to chromosome size and the representation of functional versus non-functional features within the models generated by the two approaches. Finally, we apply lessons from the characterization of GENN to analyses of datasets containing increasing numbers of features to demonstrate the scalability of the method.
Collapse
|
15
|
Heidema AG, Boer JMA, Nagelkerke N, Mariman ECM, van der A DL, Feskens EJM. The challenge for genetic epidemiologists: how to analyze large numbers of SNPs in relation to complex diseases. BMC Genet 2006; 7:23. [PMID: 16630340 PMCID: PMC1479365 DOI: 10.1186/1471-2156-7-23] [Citation(s) in RCA: 116] [Impact Index Per Article: 6.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/11/2006] [Accepted: 04/21/2006] [Indexed: 12/31/2022] Open
Abstract
Genetic epidemiologists have taken the challenge to identify genetic polymorphisms involved in the development of diseases. Many have collected data on large numbers of genetic markers but are not familiar with available methods to assess their association with complex diseases. Statistical methods have been developed for analyzing the relation between large numbers of genetic and environmental predictors to disease or disease-related variables in genetic association studies. In this commentary we discuss logistic regression analysis, neural networks, including the parameter decreasing method (PDM) and genetic programming optimized neural networks (GPNN) and several non-parametric methods, which include the set association approach, combinatorial partitioning method (CPM), restricted partitioning method (RPM), multifactor dimensionality reduction (MDR) method and the random forests approach. The relative strengths and weaknesses of these methods are highlighted. Logistic regression and neural networks can handle only a limited number of predictor variables, depending on the number of observations in the dataset. Therefore, they are less useful than the non-parametric methods to approach association studies with large numbers of predictor variables. GPNN on the other hand may be a useful approach to select and model important predictors, but its performance to select the important effects in the presence of large numbers of predictors needs to be examined. Both the set association approach and random forests approach are able to handle a large number of predictors and are useful in reducing these predictors to a subset of predictors with an important contribution to disease. The combinatorial methods give more insight in combination patterns for sets of genetic and/or environmental predictor variables that may be related to the outcome variable. As the non-parametric methods have different strengths and weaknesses we conclude that to approach genetic association studies using the case-control design, the application of a combination of several methods, including the set association approach, MDR and the random forests approach, will likely be a useful strategy to find the important genes and interaction patterns involved in complex diseases.
Collapse
Affiliation(s)
- A Geert Heidema
- Centre for Nutrition and Health, National Institute for Public Health and the Environment, PO Box 1 3720 BA Bilthoven, The Netherlands
- Division of Human Nutrition, Wageningen University and Research Centre, PO Box 8129 6700 EV Wageningen, The Netherlands
| | - Jolanda MA Boer
- Centre for Nutrition and Health, National Institute for Public Health and the Environment, PO Box 1 3720 BA Bilthoven, The Netherlands
| | - Nico Nagelkerke
- Department of Community Medicine, United Arab Emirates University, PO Box 17172 Al Ain, UAE
| | - Edwin CM Mariman
- Functional Genomics, Maastricht University, PO Box 616 6200 MD Maastricht, The Netherlands
| | - Daphne L van der A
- Centre for Nutrition and Health, National Institute for Public Health and the Environment, PO Box 1 3720 BA Bilthoven, The Netherlands
| | - Edith JM Feskens
- Centre for Nutrition and Health, National Institute for Public Health and the Environment, PO Box 1 3720 BA Bilthoven, The Netherlands
- Division of Human Nutrition, Wageningen University and Research Centre, PO Box 8129 6700 EV Wageningen, The Netherlands
| |
Collapse
|
16
|
Sabbagh A, Darlu P. SNP selection at the NAT2 locus for an accurate prediction of the acetylation phenotype. Genet Med 2006; 8:76-85. [PMID: 16481889 DOI: 10.1097/01.gim.0000200951.54346.d6] [Citation(s) in RCA: 18] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/26/2022] Open
Abstract
PURPOSE Genetic polymorphisms in the N-acetyltransferase 2 gene determine the individual acetylator status, which influences both the toxicity and efficacy profile of acetylated drugs. Determination of an individual's acetylation phenotype prior to initiation of therapy, through DNA-based tests, should permit to improve therapy response and reduce adverse events. However, due to extensive linkage disequilibrium between markers within NAT2, the genotyping of closely spaced markers yields highly redundant data: testing them all is expensive and often unnecessary. The objective of this study is to establish the optimal strategy to define, in the genetic context of a given ethnic group, the most informative set of single-nucleotide polymorphisms that best enables accurate prediction of acetylation phenotype. METHODS Three classification methods have been investigated (classification trees, artificial neural networks and multifactor dimensionality reduction method) in order to find the optimal set of single-nucleotide polymorphisms enabling the most efficient classification of individuals in rapid and slow acetylators. RESULTS Our results show that, in almost all population samples, only one or two single-nucleotide polymorphisms would be enough to obtain a good predictive capacity with no or only a modest reduction in power relative to direct assays of all common markers. In contrast, in Black African populations, where lower levels of linkage disequilibrium are observed at NAT2, a larger number of single-nucleotide polymorphisms are required to predict acetylation phenotype. CONCLUSION The results of this study will be helpful for the design of time- and cost-effective pharmacogenetic tests (adapted to specific populations) that could be used as routine tools in clinical practice.
Collapse
Affiliation(s)
- Audrey Sabbagh
- Unité de Recherche en Génétique Epidémiologique et Structure des Populations Humaines, INSERM U535, Villejuif, France
| | | |
Collapse
|
17
|
Sabbagh A, Darlu P. Data-Mining Methods as Useful Tools for Predicting Individual Drug Response: Application to CYP2D6 Data. Hum Hered 2006; 62:119-34. [PMID: 17057402 DOI: 10.1159/000096416] [Citation(s) in RCA: 15] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/08/2006] [Accepted: 08/22/2006] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVES Selecting a maximally informative subset of polymorphisms to predict a clinical outcome, such as drug response, requires appropriate search methods due to the increased dimensionality associated with looking at multiple genotypes. In this study, we investigated the ability of several pattern recognition methods to identify the most informative markers in the CYP2D6 gene for the prediction of CYP2D6 metabolizer status. METHODS Four data-mining tools were explored: decision trees, random forests, artificial neural networks, and the multifactor dimensionality reduction (MDR) method. Marker selection was performed separately in eight population samples of different ethnic origin to evaluate to what extent the most informative markers differ across ethnic groups. RESULTS Our results show that the number of polymorphisms required to predict CYP2D6 metabolic phenotype with a high accuracy can be dramatically reduced owing to the strong haplotype block structure observed at CYP2D6. MDR and neural networks provided nearly identical results and performed the best. CONCLUSION Data-mining methods, such as MDR and neural networks, appear as promising tools to improve the efficiency of genotyping tests in pharmacogenetics with the ultimate goal of pre-screening patients for individual therapy selection with minimum genotyping effort.
Collapse
Affiliation(s)
- Audrey Sabbagh
- Unité de Recherche en Génétique Epidémiologique et Structure des Populations Humaines, INSERM U535, Villejuif, France.
| | | |
Collapse
|
18
|
Forero DA, Arboleda G, Yunis JJ, Pardo R, Arboleda H. Association study of polymorphisms in LRP1, tau and 5-HTT genes and Alzheimer’s disease in a sample of Colombian patients. J Neural Transm (Vienna) 2005; 113:1253-62. [PMID: 16362633 DOI: 10.1007/s00702-005-0388-z] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/13/2005] [Accepted: 09/10/2005] [Indexed: 10/25/2022]
Abstract
Analysis of genetic susceptibility factors for Alzheimer's disease (AD) in populations with different genetic and environmental background may be useful to understand AD etiology. There are few genetic association studies of AD in Latin America. In the present work, we analyzed polymorphisms in 3 candidate genes; the LDL receptor related protein-1, the microtubule-associated protein Tau and the serotonin transporter genes in a sample of 106 Colombian AD patients and 97 control subjects. We did not find a significant allelic or genotypic association with any of the three polymorphisms analyzed using different statistical analysis, including a neural network model or different sample stratifications. To date, APOE polymorphisms are the only genetic risk factors identified for AD in the Colombian population. It may be factible that future combination of high-throughput genotyping platforms and multivariate analysis models may lead to the identification of other genetic susceptibility factors for AD in the Colombian population.
Collapse
Affiliation(s)
- D A Forero
- Grupo de Neurociencias, Facultad de Medicina e Instituto de Genética, Universidad Nacional de Colombia, Bogotá, Colombia
| | | | | | | | | |
Collapse
|
19
|
Penco S, Grossi E, Cheng S, Intraligi M, Maurelli G, Patrosso MC, Marocchi A, Buscema M. Assessment of the role of genetic polymorphism in venous thrombosis through artificial neural networks. Ann Hum Genet 2005; 69:693-706. [PMID: 16266408 DOI: 10.1111/j.1529-8817.2005.00206.x] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 02/05/2023]
Abstract
PURPOSE To assess the role of genetic polymorphisms in venous thrombosis events (VTE) using Artificial Neural Networks (ANNs), a model for solving non-linear problems frequently associated with complex biological systems, due to interactions between biological, genetic and environmental factors. METHODS A database was generated from a case-control study of venous thrombosis, using 238 patients and 211 controls. The database of 64 variables included age, gender and a panel of 62 genetic variants. Three different ANNs were compared, with logistic regression for the accuracy of predicting cases and controls. RESULTS ANNs yielded a better performance than the logistic regression algorithm. Indeed, through ANNs models, the 62 variables related to genetic variants were first reduced to a set of 9, and then of 3 (MTHFR 677 C/T, FV arg506gln, ICAM1 gly214arg). CONCLUSIONS The findings of this study illustrate the power of ANN in evaluating multifactorial data, and show that the different sensitivities of the models of elaboration are related to the characteristics of the data. This may contribute to a better understanding of the role played by genetic polymorphisms in VTE, and help to define, if possible, a test panel of genetic variants to estimate an individual's probability of developing the disease.
Collapse
Affiliation(s)
- S Penco
- Medical Genetics, Clinical Chemistry and Clinical Pathology Laboratory, Niguarda Ca' Granda Hospital, Piazza Ospedale Maggiore 3, 20100 Milan, Italy.
| | | | | | | | | | | | | | | |
Collapse
|
20
|
Abstract
Calpain-10 (CAPN10) is the first diabetes gene to be identified through a genome scan. Many investigators, but not all, have subsequently found associations between CAPN10 polymorphism and type 2 diabetes (T2D) as well as insulin action, insulin secretion, aspects of adipocyte biology and microvascular function. However, this has not always been with the same single nucleotide polymorphism (SNP) or haplotype or the same phenotype, suggesting that there might be more than one disease-associated CAPN10 variant and that these might vary between ethnic groups and the phenotype under study. Our understanding of calpain-10 physiological action has also been greatly augmented by our knowledge of the calpain family domain structure and function, and the relationship between calpain-10 and other calpains is discussed here. Both genetic and functional data indicates that calpain-10 has an important role in insulin resistance and intermediate phenotypes, including those associated with the adipocyte. In this regard, emerging evidence would suggest that calpain-10 facilitates GLUT4 translocation and acts in reorganization of the cytoskeleton. Calpain-10 is also an important molecule in the beta-cell. It is likely to be a determinant of fuel sensing and insulin exocytosis, with actions at the mitochondria and plasma membrane respectively. We postulate that the multiple actions of calpain-10 may relate to its different protein isoforms. In conclusion, the discovery of calpain-10 by a genetic approach has identified it as a molecule of importance to insulin signaling and secretion that may have relevance to the future development of novel therapeutic targets for the treatment of T2D.
Collapse
Affiliation(s)
- Mark D Turner
- Centre for Diabetes and Metabolic Medicine, Institute of Cell and Molecular Science, Barts and The London Queen Mary's School of Medicine and Dentistry, University of London, London, E1 2AT United Kingdom.
| | | | | |
Collapse
|
21
|
Di Luca M, Grossi E, Borroni B, Zimmermann M, Marcello E, Colciaghi F, Gardoni F, Intraligi M, Padovani A, Buscema M. Artificial neural networks allow the use of simultaneous measurements of Alzheimer disease markers for early detection of the disease. J Transl Med 2005; 3:30. [PMID: 16048651 PMCID: PMC1198261 DOI: 10.1186/1479-5876-3-30] [Citation(s) in RCA: 17] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/12/2005] [Accepted: 07/27/2005] [Indexed: 02/08/2023] Open
Abstract
BACKGROUND Previous studies have shown that in platelets of mild Alzheimer Disease (AD) patients there are alterations of specific APP forms, paralleled by alteration in expression level of both ADAM 10 and BACE when compared to control subjects. Due to the poor linear relation among each key-element of beta-amyloid cascade and the target diagnosis, the use of systems able to afford non linear tasks, like artificial neural networks (ANNs), should allow a better discriminating capacity in comparison with classical statistics. OBJECTIVE To evaluate the accuracy of ANNs in AD diagnosis. METHODS 37 mild-AD patients and 25 control subjects were enrolled, and APP, ADM10 and BACE measures were performed. Fifteen different models of feed-forward and complex-recurrent ANNs (provided by Semeion Research Centre), based on different learning laws (back propagation, sine-net, bi-modal) were compared with the linear discriminant analysis (LDA). RESULTS The best ANN model correctly identified mild AD patients in the 94% of cases and the control subjects in the 92%. The corresponding diagnostic performance obtained with LDA was 90% and 73%. CONCLUSION This preliminary study suggests that the processing of biochemical tests related to beta-amyloid cascade with ANNs allows a very good discrimination of AD in early stages, higher than that obtainable with classical statistics methods.
Collapse
Affiliation(s)
- Monica Di Luca
- Centre of Excellence for Neurodegenerative Disorders and Department of Pharmacological Sciences, University of Milan, Italy
| | - Enzo Grossi
- Medical Department, Bracco Spa, Milan, Italy
| | - Barbara Borroni
- Department of Neurological Sciences, University of Brescia, Italy
| | - Martina Zimmermann
- Centre of Excellence for Neurodegenerative Disorders and Department of Pharmacological Sciences, University of Milan, Italy
| | - Elena Marcello
- Centre of Excellence for Neurodegenerative Disorders and Department of Pharmacological Sciences, University of Milan, Italy
| | - Francesca Colciaghi
- Centre of Excellence for Neurodegenerative Disorders and Department of Pharmacological Sciences, University of Milan, Italy
| | - Fabrizio Gardoni
- Centre of Excellence for Neurodegenerative Disorders and Department of Pharmacological Sciences, University of Milan, Italy
| | | | | | | |
Collapse
|
22
|
Palsson A, Gibson G. Association between nucleotide variation in Egfr and wing shape in Drosophila melanogaster. Genetics 2005; 167:1187-98. [PMID: 15280234 PMCID: PMC1470961 DOI: 10.1534/genetics.103.021766] [Citation(s) in RCA: 52] [Impact Index Per Article: 2.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022] Open
Abstract
As part of an effort to dissect quantitative trait locus effects to the nucleotide level, association was assessed between 238 single-nucleotide and 20 indel polymorphisms spread over 11 kb of the Drosophila melanogaster Egfr locus and nine relative warp measures of wing shape. One SNP in a conserved potential regulatory site for a GAGA factor in the promoter of alternate first exon 2 approaches conservative experiment-wise significance (P < 0.00003) in the sample of 207 lines for association with the location of the crossveins in the central region of the wing. Several other sites indicate marginal association with one or more other aspects of shape. No strong effects of sex or population of origin were detected with measures of shape, but two different sites were strongly associated with overall wing size in interaction with these fixed factors. Whole-gene sequencing in very large samples, rather than selective genotyping, would appear to be the only strategy likely to be successful for detecting subtle associations in species with high polymorphism and little haplotype structure. However, these features severely limit the ability of linkage disequilibrium mapping in Drosophila to resolve quantitative effects to single nucleotides.
Collapse
Affiliation(s)
- Arnar Palsson
- Department of Genetics, North Carolina State University, Raleigh, North Carolina 27695, USA
| | | |
Collapse
|
23
|
Serretti A, Smeraldi E. Neural network analysis in pharmacogenetics of mood disorders. BMC MEDICAL GENETICS 2004; 5:27. [PMID: 15588300 PMCID: PMC539307 DOI: 10.1186/1471-2350-5-27] [Citation(s) in RCA: 40] [Impact Index Per Article: 2.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Subscribe] [Scholar Register] [Received: 07/23/2004] [Accepted: 12/09/2004] [Indexed: 01/17/2023]
Abstract
Background The increasing number of available genotypes for genetic studies in humans requires more advanced techniques of analysis. We previously reported significant univariate associations between gene polymorphisms and antidepressant response in mood disorders. However the combined analysis of multiple gene polymorphisms and clinical variables requires the use of non linear methods. Methods In the present study we tested a neural network strategy for a combined analysis of two gene polymorphisms. A Multi Layer Perceptron model showed the best performance and was therefore selected over the other networks. One hundred and twenty one depressed inpatients treated with fluvoxamine in the context of previously reported pharmacogenetic studies were included. The polymorphism in the transcriptional control region upstream of the 5HTT coding sequence (SERTPR) and in the Tryptophan Hydroxylase (TPH) gene were analysed simultaneously. Results A multi layer perceptron network composed by 1 hidden layer with 7 nodes was chosen. 77.5 % of responders and 51.2% of non responders were correctly classified (ROC area = 0.731 – empirical p value = 0.0082). Finally, we performed a comparison with traditional techniques. A discriminant function analysis correctly classified 34.1 % of responders and 68.1 % of non responders (F = 8.16 p = 0.0005). Conclusions Overall, our findings suggest that neural networks may be a valid technique for the analysis of gene polymorphisms in pharmacogenetic studies. The complex interactions modelled through NN may be eventually applied at the clinical level for the individualized therapy.
Collapse
Affiliation(s)
- Alessandro Serretti
- Istituto Scientifico Universitario Ospedale San Raffaele, Department of Neuropsychiatric Sciences, Milano, Italy
- Università Vita-Salute San Raffaele, School of Medicine, Milano, Italy
| | - Enrico Smeraldi
- Istituto Scientifico Universitario Ospedale San Raffaele, Department of Neuropsychiatric Sciences, Milano, Italy
- Università Vita-Salute San Raffaele, School of Medicine, Milano, Italy
| |
Collapse
|
24
|
Neale BM, Sham PC. The future of association studies: gene-based analysis and replication. Am J Hum Genet 2004; 75:353-62. [PMID: 15272419 PMCID: PMC1182015 DOI: 10.1086/423901] [Citation(s) in RCA: 473] [Impact Index Per Article: 23.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/19/2004] [Accepted: 06/21/2004] [Indexed: 11/03/2022] Open
Abstract
Historically, association tests were limited to single variants, so that the allele was considered the basic unit for association testing. As marker density increases and indirect approaches are used to assess association through linkage disequilibrium, association is now frequently considered at the haplotypic level. We suggest that there are difficulties in replicating association findings at the single-nucleotide-polymorphism (SNP) or the haplotype level, and we propose a shift toward a gene-based approach in which all common variation within a candidate gene is considered jointly. Inconsistencies arising from population differences are more readily resolved by use of a gene-based approach rather than either a SNP-based or a haplotype-based approach. A gene-based approach captures all of the potential risk-conferring variations; thus, negative findings are subject only to the issue of power. In addition, chance findings due to multiple testing can be readily accounted for by use of a genewide-significance level. Meta-analysis procedures can be formalized for gene-based methods through the combination of P values. It is only a matter of time before all variation within genes is mapped, at which point the gene-based approach will become the natural end point for association analysis and will inform our search for functional variants relevant to disease etiology.
Collapse
Affiliation(s)
- Benjamin M Neale
- Social, Genetic, and Developmental Psychiatry Centre, Institute of Psychiatry, King's College London, London, United Kingdom
| | | |
Collapse
|
25
|
Kang H, Qin ZS, Niu T, Liu JS. Incorporating genotyping uncertainty in haplotype inference for single-nucleotide polymorphisms. Am J Hum Genet 2004; 74:495-510. [PMID: 14966673 PMCID: PMC1182263 DOI: 10.1086/382284] [Citation(s) in RCA: 33] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/01/2003] [Accepted: 12/22/2003] [Indexed: 12/29/2022] Open
Abstract
The accuracy of the vast amount of genotypic information generated by high-throughput genotyping technologies is crucial in haplotype analyses and linkage-disequilibrium mapping for complex diseases. To date, most automated programs lack quality measures for the allele calls; therefore, human interventions, which are both labor intensive and error prone, have to be performed. Here, we propose a novel genotype clustering algorithm, GeneScore, based on a bivariate t-mixture model, which assigns a set of probabilities for each data point belonging to the candidate genotype clusters. Furthermore, we describe an expectation-maximization (EM) algorithm for haplotype phasing, GenoSpectrum (GS)-EM, which can use probabilistic multilocus genotype matrices (called "GenoSpectrum") as inputs. Combining these two model-based algorithms, we can perform haplotype inference directly on raw readouts from a genotyping machine, such as the TaqMan assay. By using both simulated and real data sets, we demonstrate the advantages of our probabilistic approach over the current genotype scoring methods, in terms of both the accuracy of haplotype inference and the statistical power of haplotype-based association analyses.
Collapse
Affiliation(s)
- Hosung Kang
- Department of Statistics, Harvard University, Cambridge, MA; Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor; and Division of Preventive Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston
| | - Zhaohui S. Qin
- Department of Statistics, Harvard University, Cambridge, MA; Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor; and Division of Preventive Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston
| | - Tianhua Niu
- Department of Statistics, Harvard University, Cambridge, MA; Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor; and Division of Preventive Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston
| | - Jun S. Liu
- Department of Statistics, Harvard University, Cambridge, MA; Department of Biostatistics, School of Public Health, University of Michigan, Ann Arbor; and Division of Preventive Medicine, Department of Medicine, Brigham and Women’s Hospital, Harvard Medical School, Boston
| |
Collapse
|