1
|
Song C, Zhang H. TARV: tree-based analysis of rare variants identifying risk modifying variants in CTNNA2 and CNTNAP2 for alcohol addiction. Genet Epidemiol 2014; 38:552-9. [PMID: 25041903 DOI: 10.1002/gepi.21843] [Citation(s) in RCA: 13] [Impact Index Per Article: 1.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 04/07/2014] [Revised: 06/02/2014] [Accepted: 06/16/2014] [Indexed: 12/18/2022]
Abstract
Since the development of next generation sequencing (NGS) technology, researchers have been extending their efforts on genome-wide association studies (GWAS) from common variants to rare variants to find the missing inheritance. Although various statistical methods have been proposed to analyze rare variants data, they generally face difficulties for complex disease models involving multiple genes. In this paper, we propose a tree-based analysis of rare variants (TARV) that adopts a nonparametric disease model and is capable of exploring gene-gene interactions. We found that TARV outperforms the sequence kernel association test (SKAT) in most of our simulation scenarios, and by notable margins in some cases. By applying TARV to the study of addiction: genetics and environment (SAGE) data, we successfully detected gene CTNNA2 and its 43 specific variants that increase the risk of alcoholism in women, with an odds ratio (OR) of 1.94. This gene has not been detected in the SAGE data. Post hoc literature search also supports the role of CTNNA2 as a likely risk gene for alcohol addiction. In addition, we also detected a plausible protective gene CNTNAP2, whose 97 rare variants can reduce the risk of alcoholism in women, with an OR of 0.55. These findings suggest that TARV can be effective in dissecting genetic variants for complex diseases using rare variants data.
Collapse
Affiliation(s)
- Chi Song
- Department of Biostatistics, School of Public Health, Yale University, New Haven, Connecticut, United States of America
| | | |
Collapse
|
2
|
Amos C, George V, Bailey-Wilson J, Demenais F. George Bonney (1947-2013) Remembered. Genet Epidemiol 2013. [DOI: 10.1002/gepi.21780] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022]
Affiliation(s)
- Christopher Amos
- Department of Community and Family Medicine; Dartmouth College; Hanover New Hampshire United States of America
| | - Varghese George
- Department of Biostatistics & Epidemiology; Georgia Regents University; Augusta Georgia United States of America
| | - Joan Bailey-Wilson
- Inherited Disease Research Branch, National Human Genome Research Institute; National Institutes of Health; Baltimore Maryland United States of America
| | - Florence Demenais
- INSERM, U946; Genetic Variation and Human Diseases Unit; Paris France
- Université Paris Diderot, Sorbonne Paris Cité; Institut Universitaire d'Hématologie; Paris France
| |
Collapse
|
3
|
Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P. Screening large-scale association study data: exploiting interactions using random forests. BMC Genet 2004. [PMID: 15588316 DOI: 10.1186/1471‐2156‐5‐32] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for further study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction. RESULTS Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact. CONCLUSIONS In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.
Collapse
Affiliation(s)
- Kathryn L Lunetta
- Oscient Pharmaceuticals, Inc, (formerly Genome Therapeutics Corporation), Waltham, Massachusetts, USA.
| | | | | | | |
Collapse
|
4
|
Lunetta KL, Hayward LB, Segal J, Van Eerdewegh P. Screening large-scale association study data: exploiting interactions using random forests. BMC Genet 2004; 5:32. [PMID: 15588316 PMCID: PMC545646 DOI: 10.1186/1471-2156-5-32] [Citation(s) in RCA: 264] [Impact Index Per Article: 13.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/25/2004] [Accepted: 12/10/2004] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Genome-wide association studies for complex diseases will produce genotypes on hundreds of thousands of single nucleotide polymorphisms (SNPs). A logical first approach to dealing with massive numbers of SNPs is to use some test to screen the SNPs, retaining only those that meet some criterion for further study. For example, SNPs can be ranked by p-value, and those with the lowest p-values retained. When SNPs have large interaction effects but small marginal effects in a population, they are unlikely to be retained when univariate tests are used for screening. However, model-based screens that pre-specify interactions are impractical for data sets with thousands of SNPs. Random forest analysis is an alternative method that produces a single measure of importance for each predictor variable that takes into account interactions among variables without requiring model specification. Interactions increase the importance for the individual interacting variables, making them more likely to be given high importance relative to other variables. We test the performance of random forests as a screening procedure to identify small numbers of risk-associated SNPs from among large numbers of unassociated SNPs using complex disease models with up to 32 loci, incorporating both genetic heterogeneity and multi-locus interaction. RESULTS Keeping other factors constant, if risk SNPs interact, the random forest importance measure significantly outperforms the Fisher Exact test as a screening tool. As the number of interacting SNPs increases, the improvement in performance of random forest analysis relative to Fisher Exact test for screening also increases. Random forests perform similarly to the univariate Fisher Exact test as a screening tool when SNPs in the analysis do not interact. CONCLUSIONS In the context of large-scale genetic association studies where unknown interactions exist among true risk-associated SNPs or SNPs and environmental covariates, screening SNPs using random forest analyses can significantly reduce the number of SNPs that need to be retained for further study compared to standard univariate screening methods.
Collapse
Affiliation(s)
- Kathryn L Lunetta
- Oscient Pharmaceuticals, Inc. (formerly Genome Therapeutics Corporation), Waltham, Massachusetts, USA
- Department of Biostatistics, Boston University School of Public Health, Boston, Massachusetts, USA
| | - L Brooke Hayward
- Oscient Pharmaceuticals, Inc. (formerly Genome Therapeutics Corporation), Waltham, Massachusetts, USA
| | - Jonathan Segal
- Oscient Pharmaceuticals, Inc. (formerly Genome Therapeutics Corporation), Waltham, Massachusetts, USA
- Genizon BioSciences Inc., Montreal, Quebec, Canada
| | - Paul Van Eerdewegh
- Oscient Pharmaceuticals, Inc. (formerly Genome Therapeutics Corporation), Waltham, Massachusetts, USA
- Genizon BioSciences Inc., Montreal, Quebec, Canada
- Department of Psychiatry, Harvard Medical School, Boston, Massachusetts, USA
| |
Collapse
|
5
|
Costello TJ, Falk CT, Ye KQ. Data mining and computationally intensive methods: summary of Group 7 contributions to Genetic Analysis Workshop 13. Genet Epidemiol 2004; 25 Suppl 1:S57-63. [PMID: 14635170 DOI: 10.1002/gepi.10285] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/07/2022]
Abstract
The Framingham Heart Study data, as well as a related simulated data set, were generously provided to the participants of the Genetic Analysis Workshop 13 in order that newly developed and emerging statistical methodologies could be tested on that well-characterized data set. The impetus driving the development of novel methods is to elucidate the contributions of genes, environment, and interactions between and among them, as well as to allow comparison between and validation of methods. The seven papers that comprise this group used data-mining methodologies (tree-based methods, neural networks, discriminant analysis, and Bayesian variable selection) in an attempt to identify the underlying genetics of cardiovascular disease and related traits in the presence of environmental and genetic covariates. Data-mining strategies are gaining popularity because they are extremely flexible and may have greater efficiency and potential in identifying the factors involved in complex disorders. While the methods grouped together here constitute a diverse collection, some papers asked similar questions with very different methods, while others used the same underlying methodology to ask very different questions. This paper briefly describes the data-mining methodologies applied to the Genetic Analysis Workshop 13 data sets and the results of those investigations.
Collapse
Affiliation(s)
- Tracy J Costello
- Department of Epidemiology, University of Texas M.D. Anderson Cancer Center, Houston, USA
| | | | | |
Collapse
|
6
|
Ghazalpour A, Doss S, Yang X, Aten J, Toomey EM, Van Nas A, Wang S, Drake TA, Lusis AJ. Thematic review series: The pathogenesis of atherosclerosis. Toward a biological network for atherosclerosis. J Lipid Res 2004; 45:1793-805. [PMID: 15292376 DOI: 10.1194/jlr.r400006-jlr200] [Citation(s) in RCA: 36] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/14/2023] Open
Abstract
The goal of systems biology is to define all of the elements present in a given system and to create an interaction network between these components so that the behavior of the system, as a whole and in parts, can be explained under specified conditions. The elements constituting the network that influences the development of atherosclerosis could be genes, pathways, transcript levels, proteins, or physiologic traits. In this review, we discuss how the integration of genetics and technologies such as transcriptomics and proteomics, combined with mathematical modeling, may lead to an understanding of such networks.
Collapse
Affiliation(s)
- Anatole Ghazalpour
- Department of Human Genetics, Molecular Biology Institute, University of California-Los Angeles, Los Angeles, CA 90095-1679, USA
| | | | | | | | | | | | | | | | | |
Collapse
|
7
|
Bureau A, Dupuis J, Hayward B, Falls K, Van Eerdewegh P. Mapping complex traits using Random Forests. BMC Genet 2003. [PMID: 14975132 DOI: 10.1186/1471‐2156‐4‐s1‐s64] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 04/03/2023] Open
Abstract
Random Forest is a prediction technique based on growing trees on bootstrap samples of data, in conjunction with a random selection of explanatory variables to define the best split at each node. In the case of a quantitative outcome, the tree predictor takes on a numerical value. We applied Random Forest to the first replicate of the Genetic Analysis Workshop 13 simulated data set, with the sibling pairs as our units of analysis and identity by descent (IBD) at selected loci as our explanatory variables. With the knowledge of the true model, we performed two sets of analyses on three phenotypes: HDL, triglycerides, and glucose. The goal was to approach the mapping of complex traits from a multivariate perspective. The first set of analyses mimics a candidate gene approach with a high proportion of true genes among the predictors while the second set represents a genome scan analysis using microsatellite markers. Random Forest was able to identify a few of the major genes influencing the phenotypes, such as baseline HDL and triglycerides, but failed to identify the major genes regulating baseline glucose levels.
Collapse
Affiliation(s)
- Alexandre Bureau
- Genome Therapeutics Corporation, Waltham, Massachusetts 02453, USA.
| | | | | | | | | |
Collapse
|
8
|
Abstract
Random Forest is a prediction technique based on growing trees on bootstrap samples of data, in conjunction with a random selection of explanatory variables to define the best split at each node. In the case of a quantitative outcome, the tree predictor takes on a numerical value. We applied Random Forest to the first replicate of the Genetic Analysis Workshop 13 simulated data set, with the sibling pairs as our units of analysis and identity by descent (IBD) at selected loci as our explanatory variables. With the knowledge of the true model, we performed two sets of analyses on three phenotypes: HDL, triglycerides, and glucose. The goal was to approach the mapping of complex traits from a multivariate perspective. The first set of analyses mimics a candidate gene approach with a high proportion of true genes among the predictors while the second set represents a genome scan analysis using microsatellite markers. Random Forest was able to identify a few of the major genes influencing the phenotypes, such as baseline HDL and triglycerides, but failed to identify the major genes regulating baseline glucose levels.
Collapse
MESH Headings
- Chromosome Mapping/statistics & numerical data
- Chromosomes, Human, Pair 1/genetics
- Chromosomes, Human, Pair 12/genetics
- Chromosomes, Human, Pair 17/genetics
- Chromosomes, Human, Pair 19/genetics
- Chromosomes, Human, Pair 9/genetics
- Computer Simulation/statistics & numerical data
- Genetic Markers/genetics
- Genome, Human
- Humans
- Matched-Pair Analysis
- Microsatellite Repeats/genetics
- Multifactorial Inheritance/genetics
- Multivariate Analysis
- Pedigree
- Phenotype
- Predictive Value of Tests
- Quantitative Trait Loci/genetics
- Quantitative Trait, Heritable
- Siblings
- Software/statistics & numerical data
Collapse
Affiliation(s)
- Alexandre Bureau
- Genome Therapeutics Corporation, Waltham, Massachusetts, 02453, USA
- Current address: School of Health Sciences, University of Lethbridge, Lethbridge, Alberta, T1K 3M4, Canada
| | - Josée Dupuis
- Genome Therapeutics Corporation, Waltham, Massachusetts, 02453, USA
- Current address: Department of Biostatistics, Boston University, Boston, Massachusetts, 02215, USA
| | - Brooke Hayward
- Genome Therapeutics Corporation, Waltham, Massachusetts, 02453, USA
| | - Kathleen Falls
- Genome Therapeutics Corporation, Waltham, Massachusetts, 02453, USA
| | - Paul Van Eerdewegh
- Genome Therapeutics Corporation, Waltham, Massachusetts, 02453, USA
- Department of Psychiatry, Harvard Medical School, Boston, Massachusetts, 02115, USA
| |
Collapse
|
9
|
Abstract
Statistical analysis methods for gene mapping originated in counting recombinant and non-recombinant offspring, but have now progressed to sophisticated approaches for the mapping of complex trait genes. Here, we outline new statistical methods that capture the simultaneous effects of multiple gene loci and thereby achieve a more global view of gene action and interaction than is possible by traditional gene-by-gene analysis. We aim to show that the work of statisticians goes far beyond the running of computer programs.
Collapse
Affiliation(s)
- Josephine Hoh
- Laboratory of Statistical Genetics, Rockefeller University, New York 10021, USA
| | | |
Collapse
|
10
|
Yeh CB, Leckman JF, Wan FJ, Shiah IS, Lu RB. Characteristics of acute stress symptoms and nitric oxide concentration in young rescue workers in Taiwan. Psychiatry Res 2002; 112:59-68. [PMID: 12379451 DOI: 10.1016/s0165-1781(02)00179-8] [Citation(s) in RCA: 22] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
Disaster workers as well as victims are at increased risk for acute stress disorder (ASD). The present study was undertaken to study the course of the stress response in a group 187 young, male military personnel who served as rescue workers for 3 days after an earthquake in central Taiwan. A control group of 83 young, male military personnel who remained on the base was also studied. The initial evaluation took place within 16 days of the earthquake. Participants were interviewed using the Mini International Neuropsychological Interview. Thirty-one individuals met DSM-IV criteria for ASD at the initial evaluation. These 31 individuals were interviewed a second time 1 month after the earthquake. Plasma samples were also collected and assayed for nitric oxide (NO). The point prevalence rates of ASD 2 weeks after the earthquake in the initial evaluation were 9 and 16% in the rescue worker and control groups, respectively. At 1 month, the prevalence was substantially lower, in the range of 2-3%. Significant inverse correlations were observed between severity of stress symptoms and the plasma concentration of NO in the rescue worker group (r=-0.36 to -0.64, n=17, P<0.05). We conclude that young military personnel without formal training in rescue operations are at risk for ASD, but their risk appears to be no higher than that in a similarly composed control group of young military personnel. Longitudinal studies with plasma measures of NO are needed to clarify its potential role in the development and course of ASD and related syndromes.
Collapse
Affiliation(s)
- Chin-bin Yeh
- Department of Psychiatry, Tri-Service General Hospital, Taipei, Taiwan
| | | | | | | | | |
Collapse
|
11
|
Zhang H, Leckman JF, Pauls DL, Tsai CP, Kidd KK, Campos MR. Genomewide scan of hoarding in sib pairs in which both sibs have Gilles de la Tourette syndrome. Am J Hum Genet 2002; 70:896-904. [PMID: 11840360 PMCID: PMC379118 DOI: 10.1086/339520] [Citation(s) in RCA: 136] [Impact Index Per Article: 6.2] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/30/2001] [Accepted: 01/11/2002] [Indexed: 11/04/2022] Open
Abstract
A genome scan of the hoarding phenotype (a component of obsessive-compulsive disorder) was conducted on 77 sib pairs collected by the Tourette Syndrome Association International Consortium for Genetics (TSAICG). All sib pairs were concordant for a diagnosis of Gilles de la Tourette syndrome (GTS). However, the analyses reported here were conducted for hoarding as both a dichotomous trait and a quantitative trait. Not all sib pairs in the sample were concordant for hoarding. Standard linkage analyses were performed using GENEHUNTER and Haseman-Elston methods. In addition, novel analyses with a recursive-partitioning technique were employed. Significant allele sharing was observed for both the dichotomous and the quantitative hoarding phenotypes for markers at 4q34-35 (P=.0007), by use of GENEHUNTER, and at 5q35.2-35.3 (P=.000002) and 17q25 (P=.00002), by use of the revisited Haseman-Elston method. The 4q site is in proximity to D4S1625, which was identified by the TSAICG as a region linked to the GTS phenotype. The recursive-partitioning technique examined multiple markers simultaneously. Results suggest joint effects of specific loci on 5q and 4q, with an overall P value of.000003. Although P values were not adjusted for multiple comparison, nearly all were much smaller than the customary significance level of.0001 for genomewide scans.
Collapse
MESH Headings
- Alleles
- Behavioral Symptoms/complications
- Behavioral Symptoms/genetics
- Chromosomes, Human/genetics
- Chromosomes, Human, Pair 17/genetics
- Chromosomes, Human, Pair 4/genetics
- Chromosomes, Human, Pair 5/genetics
- Gene Frequency
- Genetic Linkage/genetics
- Genetic Markers/genetics
- Genome, Human
- Humans
- Matched-Pair Analysis
- Nuclear Family
- Obsessive-Compulsive Disorder/complications
- Obsessive-Compulsive Disorder/genetics
- Phenotype
- Quantitative Trait, Heritable
- Software
- Statistics, Nonparametric
- Tourette Syndrome/complications
- Tourette Syndrome/genetics
Collapse
Affiliation(s)
- Heping Zhang
- Department of Epidemiology and Public Health, Yale Child Study Center, Yale University School of Medicine, 60 College Street, New Haven, CT 06520-8034, USA.
| | | | | | | | | | | |
Collapse
|