1
|
Multi-trait GWAS for diverse ancestries: mapping the knowledge gap. BMC Genomics 2024; 25:375. [PMID: 38627641 PMCID: PMC11022331 DOI: 10.1186/s12864-024-10293-3] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/13/2023] [Accepted: 04/09/2024] [Indexed: 04/19/2024] Open
Abstract
BACKGROUND Approximately 95% of samples analyzed in univariate genome-wide association studies (GWAS) are of European ancestry. This bias toward European ancestry populations in association screening also exists for other analyses and methods that are often developed and tested on European ancestry only. However, existing data in non-European populations, which are often of modest sample size, could benefit from innovative approaches as recently illustrated in the context of polygenic risk scores. METHODS Here, we extend and assess the potential limitations and gains of our multi-trait GWAS pipeline, JASS (Joint Analysis of Summary Statistics), for the analysis of non-European ancestries. To this end, we conducted the joint GWAS of 19 hematological traits and glycemic traits across five ancestries (European (EUR), admixed American (AMR), African (AFR), East Asian (EAS), and South-East Asian (SAS)). RESULTS We detected 367 new genome-wide significant associations in non-European populations (15 in Admixed American (AMR), 72 in African (AFR) and 280 in East Asian (EAS)). New associations detected represent 5%, 17% and 13% of associations in the AFR, AMR and EAS populations, respectively. Overall, multi-trait testing increases the replication of European associated loci in non-European ancestry by 15%. Pleiotropic effects were highly similar at significant loci across ancestries (e.g. the mean correlation between multi-trait genetic effects of EUR and EAS ancestries was 0.88). For hematological traits, strong discrepancies in multi-trait genetic effects are tied to known evolutionary divergences: the ARKC1 loci, which is adaptive to overcome p.vivax induced malaria. CONCLUSIONS Multi-trait GWAS can be a valuable tool to narrow the genetic knowledge gap between European and non-European populations.
Collapse
|
2
|
Analyzing microbial evolution through gene and genome phylogenies. Biostatistics 2023:kxad025. [PMID: 37897441 DOI: 10.1093/biostatistics/kxad025] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 02/10/2023] [Revised: 08/15/2023] [Accepted: 08/27/2023] [Indexed: 10/30/2023] Open
Abstract
Microbiome scientists critically need modern tools to explore and analyze microbial evolution. Often this involves studying the evolution of microbial genomes as a whole. However, different genes in a single genome can be subject to different evolutionary pressures, which can result in distinct gene-level evolutionary histories. To address this challenge, we propose to treat estimated gene-level phylogenies as data objects, and present an interactive method for the analysis of a collection of gene phylogenies. We use a local linear approximation of phylogenetic tree space to visualize estimated gene trees as points in low-dimensional Euclidean space, and address important practical limitations of existing related approaches, allowing an intuitive visualization of complex data objects. We demonstrate the utility of our proposed approach through microbial data analyses, including by identifying outlying gene histories in strains of Prevotella, and by contrasting Streptococcus phylogenies estimated using different gene sets. Our method is available as an open-source R package, and assists with estimating, visualizing, and interacting with a collection of bacterial gene phylogenies.
Collapse
|
3
|
Priors, population sizes, and power in genome-wide hypothesis tests. BMC Bioinformatics 2023; 24:170. [PMID: 37101120 PMCID: PMC10134629 DOI: 10.1186/s12859-023-05261-9] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/14/2022] [Accepted: 03/27/2023] [Indexed: 04/28/2023] Open
Abstract
BACKGROUND Genome-wide tests, including genome-wide association studies (GWAS) of germ-line genetic variants, driver tests of cancer somatic mutations, and transcriptome-wide association tests of RNAseq data, carry a high multiple testing burden. This burden can be overcome by enrolling larger cohorts or alleviated by using prior biological knowledge to favor some hypotheses over others. Here we compare these two methods in terms of their abilities to boost the power of hypothesis testing. RESULTS We provide a quantitative estimate for progress in cohort sizes and present a theoretical analysis of the power of oracular hard priors: priors that select a subset of hypotheses for testing, with an oracular guarantee that all true positives are within the tested subset. This theory demonstrates that for GWAS, strong priors that limit testing to 100-1000 genes provide less power than typical annual 20-40% increases in cohort sizes. Furthermore, non-oracular priors that exclude even a small fraction of true positives from the tested set can perform worse than not using a prior at all. CONCLUSION Our results provide a theoretical explanation for the continued dominance of simple, unbiased univariate hypothesis tests for GWAS: if a statistical question can be answered by larger cohort sizes, it should be answered by larger cohort sizes rather than by more complicated biased methods involving priors. We suggest that priors are better suited for non-statistical aspects of biology, such as pathway structure and causality, that are not yet easily captured by standard hypothesis tests.
Collapse
|
4
|
A role for worm cutl-24 in background- and parent-of-origin-dependent ER stress resistance. BMC Genomics 2022; 23:842. [PMID: 36539699 PMCID: PMC9764823 DOI: 10.1186/s12864-022-09063-w] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 05/16/2022] [Accepted: 12/03/2022] [Indexed: 12/24/2022] Open
Abstract
BACKGROUND Organisms in the wild can acquire disease- and stress-resistance traits that outstrip the programs endogenous to humans. Finding the molecular basis of such natural resistance characters is a key goal of evolutionary genetics. Standard statistical-genetic methods toward this end can perform poorly in organismal systems that lack high rates of meiotic recombination, like Caenorhabditis worms. RESULTS Here we discovered unique ER stress resistance in a wild Kenyan C. elegans isolate, which in inter-strain crosses was passed by hermaphrodite mothers to hybrid offspring. We developed an unbiased version of the reciprocal hemizygosity test, RH-seq, to explore the genetics of this parent-of-origin-dependent phenotype. Among top-scoring gene candidates from a partial-coverage RH-seq screen, we focused on the neuronally-expressed, cuticlin-like gene cutl-24 for validation. In gene-disruption and controlled crossing experiments, we found that cutl-24 was required in Kenyan hermaphrodite mothers for ER stress tolerance in their inter-strain hybrid offspring; cutl-24 was also a contributor to the trait in purebred backgrounds. CONCLUSIONS These data establish the Kenyan strain allele of cutl-24 as a determinant of a natural stress-resistant state, and they set a precedent for the dissection of natural trait diversity in invertebrate animals without the need for a panel of meiotic recombinants.
Collapse
|
5
|
Abstract
Genetic studies of human traits have revolutionized our understanding of the variation between individuals, and yet, the genetics of most traits is still poorly understood. In this review, we highlight the major open problems that need to be solved, and by discussing these challenges provide a primer to the field. We cover general issues such as population structure, epistasis and gene-environment interactions, data-related issues such as ancestry diversity and rare genetic variants, and specific challenges related to heritability estimates, genetic association studies, and polygenic risk scores. We emphasize the interconnectedness of these problems and suggest promising avenues to address them.
Collapse
|
6
|
Abstract
AIMS Coronary artery disease (CAD) has a strong genetic predisposition. However, despite substantial discoveries made by genome-wide association studies (GWAS), a large proportion of heritability awaits identification. Non-additive genetic effects might be responsible for part of the unaccounted genetic variance. Here, we attempted a proof-of-concept study to identify non-additive genetic effects, namely epistatic interactions, associated with CAD. METHODS AND RESULTS We tested for epistatic interactions in 10 CAD case-control studies and UK Biobank with focus on 8068 SNPs at 56 loci with known associations with CAD risk. We identified a SNP pair located in cis at the LPA locus, rs1800769 and rs9458001, to be jointly associated with risk for CAD [odds ratio (OR) = 1.37, P = 1.07 × 10-11], peripheral arterial disease (OR = 1.22, P = 2.32 × 10-4), aortic stenosis (OR = 1.47, P = 6.95 × 10-7), hepatic lipoprotein(a) (Lp(a)) transcript levels (beta = 0.39, P = 1.41 × 10-8), and Lp(a) serum levels (beta = 0.58, P = 8.7 × 10-32), while individual SNPs displayed no association. Further exploration of the LPA locus revealed a strong dependency of these associations on a rare variant, rs140570886, that was previously associated with Lp(a) levels. We confirmed increased CAD risk for heterozygous (relative OR = 1.46, P = 9.97 × 10-32) and individuals homozygous for the minor allele (relative OR = 1.77, P = 0.09) of rs140570886. Using forward model selection, we also show that epistatic interactions between rs140570886, rs9458001, and rs1800769 modulate the effects of the rs140570886 risk allele. CONCLUSIONS These results demonstrate the feasibility of a large-scale knowledge-based epistasis scan and provide rare evidence of an epistatic interaction in a complex human disease. We were directed to a variant (rs140570886) influencing risk through additive genetic as well as epistatic effects. In summary, this study provides deeper insights into the genetic architecture of a locus important for cardiovascular diseases.
Collapse
|
7
|
Compression for population genetic data through finite-state entropy. J Bioinform Comput Biol 2021; 19:2150026. [PMID: 34590992 DOI: 10.1142/s0219720021500268] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/18/2022]
Abstract
We improve the efficiency of population genetic file formats and GWAS computation by leveraging the distribution of samples in population-level genetic data. We identify conditional exchangeability of these data, recommending finite state entropy algorithms as an arithmetic code naturally suited for compression of population genetic data. We show between [Formula: see text] and [Formula: see text] speed and size improvements over modern dictionary compression methods that are often used for population genetic data such as Zstd and Zlib in computation and decompression tasks. We provide open source prototype software for multi-phenotype GWAS with finite state entropy compression demonstrating significant space saving and speed comparable to the state-of-the-art.
Collapse
|
8
|
Modern simulation utilities for genetic analysis. BMC Bioinformatics 2021; 22:228. [PMID: 33941078 PMCID: PMC8091532 DOI: 10.1186/s12859-021-04086-8] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/15/2020] [Accepted: 03/17/2021] [Indexed: 11/10/2022] Open
Abstract
BACKGROUND Statistical geneticists employ simulation to estimate the power of proposed studies, test new analysis tools, and evaluate properties of causal models. Although there are existing trait simulators, there is ample room for modernization. For example, most phenotype simulators are limited to Gaussian traits or traits transformable to normality, while ignoring qualitative traits and realistic, non-normal trait distributions. Also, modern computer languages, such as Julia, that accommodate parallelization and cloud-based computing are now mainstream but rarely used in older applications. To meet the challenges of contemporary big studies, it is important for geneticists to adopt new computational tools. RESULTS We present TraitSimulation, an open-source Julia package that makes it trivial to quickly simulate phenotypes under a variety of genetic architectures. This package is integrated into our OpenMendel suite for easy downstream analyses. Julia was purpose-built for scientific programming and provides tremendous speed and memory efficiency, easy access to multi-CPU and GPU hardware, and to distributed and cloud-based parallelization. TraitSimulation is designed to encourage flexible trait simulation, including via the standard devices of applied statistics, generalized linear models (GLMs) and generalized linear mixed models (GLMMs). TraitSimulation also accommodates many study designs: unrelateds, sibships, pedigrees, or a mixture of all three. (Of course, for data with pedigrees or cryptic relationships, the simulation process must include the genetic dependencies among the individuals.) We consider an assortment of trait models and study designs to illustrate integrated simulation and analysis pipelines. Step-by-step instructions for these analyses are available in our electronic Jupyter notebooks on Github. These interactive notebooks are ideal for reproducible research. CONCLUSION The TraitSimulation package has three main advantages. (1) It leverages the computational efficiency and ease of use of Julia to provide extremely fast, straightforward simulation of even the most complex genetic models, including GLMs and GLMMs. (2) It can be operated entirely within, but is not limited to, the integrated analysis pipeline of OpenMendel. And finally (3), by allowing a wider range of more realistic phenotype models, TraitSimulation brings power calculations and diagnostic tools closer to what investigators might see in real-world analyses.
Collapse
|
9
|
Notes on Three Decades of Methodology Workshops. Behav Genet 2021; 51:170-180. [PMID: 33585974 DOI: 10.1007/s10519-021-10049-9] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.7] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/05/2020] [Accepted: 01/27/2021] [Indexed: 01/20/2023]
Abstract
Since 1987, a group of behavior geneticists have been teaching an annual methodology workshop on how to use state-of-the-art methods to analyze genetically informative data. In the early years, the focus was on analyzing twin and family data, using information of their known genetic relatedness to infer the role of genetic and environmental factors on phenotypic variation. With the rapid evolution of genotyping and sequencing technology and availability of measured genetic data, new methods to detect genetic variants associated with human traits were developed and became the focus of workshop teaching in alternate years. Over the years, many of the methodological advances in the field of statistical genetics have been direct outgrowths of the workshop, as evidence by the software and methodological publications authored by workshop faculty. We provide data and demographics of workshop attendees and evaluate the impact of the methodology workshops on scientific output in the field by evaluating the number of papers applying specific statistical genetic methodologies authored by individuals who have attended workshops.
Collapse
|
10
|
Modeling Parent-Specific Genetic Nurture in Families with Missing Parental Genotypes: Application to Birthweight and BMI. Behav Genet 2021; 51:289-300. [PMID: 33454873 DOI: 10.1007/s10519-020-10040-w] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/07/2020] [Revised: 11/15/2020] [Accepted: 12/15/2020] [Indexed: 12/25/2022]
Abstract
Disaggregation and estimation of genetic effects from offspring and parents has long been of interest to statistical geneticists. Recently, technical and methodological advances have made the genome-wide and loci-specific estimation of direct offspring and parental genetic nurture effects more possible. However, unbiased estimation using these methods requires datasets where both parents and at least one child have been genotyped, which are relatively scarce. Our group has recently developed a method and accompanying software (IMPISH; Hwang et al. in PLoS Genet 16:e1009154, 2020) which is able to impute missing parental genotypes from observed data on sibships and estimate their effects on an offspring phenotype conditional on the effects of genetic transmission. However, this method is unable to disentangle maternal and paternal effects, which may differ in magnitude and direction. Here, we introduce an extension to the original IMPISH routine which takes advantage of all available nuclear families to impute parent-specific missing genotypes and obtain asymptotically unbiased estimates of genetic effects on offspring phenotypes. We apply this this method to data from related individuals in the UK Biobank, showing concordance with previous estimates of maternal genetic effects on offspring birthweight. We also conduct the first GWAS jointly estimating offspring-, maternal-, and paternal-specific genetic effects on body-mass index.
Collapse
|
11
|
Abstract
Rice is the most salt-sensitive cereal, suffering yield losses above 50% with soil salinity of 6 dS/m. Thus, understanding the mechanisms of rice salinity tolerance is key to address food security. In this chapter, we provide guidelines to assess rice salinity tolerance using a high-throughput phenotyping platform (HTP) with digital imaging at seedling/early tillering stage and suggest improved analysis methods using stress indices. The protocols described here also include computer scripts for users to improve their experimental design, run genome-wide association studies (GWAS), perform multi-testing corrections, and obtain the Manhattan plots, enabling the identification of loci associated with salinity tolerance. Notably, the computer scripts provided here can be used for any stress or GWAS experiment and independently of HTP.
Collapse
|
12
|
Abstract
The identification of genetic variation that directly impacts infection susceptibility to SARS-CoV-2 and disease severity of COVID-19 is an important step towards risk stratification, personalized treatment plans, therapeutic, and vaccine development and deployment. Given the importance of study design in infectious disease genetic epidemiology, we use simulation and draw on current estimates of exposure, infectivity, and test accuracy of COVID-19 to demonstrate the feasibility of detecting host genetic factors associated with susceptibility and severity in published COVID-19 study designs. We demonstrate that limited phenotypic data and exposure/infection information in the early stages of the pandemic significantly impact the ability to detect most genetic variants with moderate effect sizes, especially when studying susceptibility to SARS-CoV-2 infection. Our insights can aid in the interpretation of genetic findings emerging in the literature and guide the design of future host genetic studies.
Collapse
|
13
|
Shannon diversity index: a call to replace the original Shannon's formula with unbiased estimator in the population genetics studies. PeerJ 2020; 8:e9391. [PMID: 32655992 PMCID: PMC7331625 DOI: 10.7717/peerj.9391] [Citation(s) in RCA: 33] [Impact Index Per Article: 8.3] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/17/2019] [Accepted: 05/29/2020] [Indexed: 01/25/2023] Open
Abstract
BACKGROUND The Shannon diversity index has been widely used in population genetics studies. Recently, it was proposed as a unifying measure of diversity at different levels-from genes and populations to whole species and ecosystems. The index, however, was proven to be negatively biased at small sample sizes. Modifications to the original Shannon's formula have been proposed to obtain an unbiased estimator. METHODS In this study, the performance of four different estimators of Shannon index-the original Shannon's formula and those of Zahl, Chao and Shen and Chao et al.-was tested on simulated microsatellite data. Both the simulation and analysis of the results were performed in the R language environment. A new R function was created for the calculation of all four indices from the genind data format. RESULTS Sample size dependence was detected in all the estimators analysed; however, the deviation from parametric values was substantially smaller in the derived measures than in the original Shannon's formula. Error rate was negatively associated with population heterozygosity. Comparisons among loci showed that fast-mutating loci were less affected by the error, except for the original Shannon's estimator which, in the smallest sample, was more strongly affected by loci with a higher number of alleles. The Zahl and Chao et al. estimators performed notably better than the original Shannon's formula. CONCLUSION The results of this study show that the original Shannon index should no longer be used as a measure of genetic diversity and should be replaced by Zahl's unbiased estimator.
Collapse
|
14
|
Mathematical Properties of Linkage Disequilibrium Statistics Defined by Normalization of the Coefficient D = pAB - pApB. Hum Hered 2020; 84:127-143. [PMID: 32045910 DOI: 10.1159/000504171] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/16/2019] [Accepted: 10/10/2019] [Indexed: 11/19/2022] Open
Abstract
BACKGROUND Many statistics for measuring linkage disequilibrium (LD) take the form of a normalization of the LD coefficient D. Different normalizations produce statistics with different ranges, interpretations, and arguments favoring their use. METHODS Here, to compare the mathematical properties of these normalizations, we consider 5 of these normalized statistics, describing their upper bounds, the mean values of their maxima over the set of possible allele frequency pairs, and the size of the allele frequency regions accessible given specified values of the statistics. RESULTS We produce detailed characterizations of these properties for the statistics d and ρ, analogous to computations previously performed for r2. We examine the relationships among the statistics, uncovering conditions under which some of them have close connections. CONCLUSION The results contribute insight into LD measurement, particularly the understanding of differences in the features of different LD measures when computed on the same data.
Collapse
|
15
|
Age dependent association of inbreeding with risk for schizophrenia in Egypt. Schizophr Res 2020; 216:450-459. [PMID: 31928911 PMCID: PMC8054776 DOI: 10.1016/j.schres.2019.10.039] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Submit a Manuscript] [Subscribe] [Scholar Register] [Received: 07/25/2019] [Revised: 10/13/2019] [Accepted: 10/14/2019] [Indexed: 12/27/2022]
Abstract
BACKGROUND Self-reported consanguinity is associated with risk for schizophrenia (SZ) in several inbred populations, but estimates using DNA-based coefficients of inbreeding are unavailable. Further, it is not known whether recessively inherited risk mutations can be identified through homozygosity by descent (HBD) mapping. METHODS We studied self-reported and DNA-based estimates of inbreeding among Egyptian patients with SZ (n = 421, DSM IV criteria) and adult controls without psychosis (n = 301), who were evaluated using semi-structured diagnostic interview schedules and genotyped using the Illumina Infinium PsychArray. Following quality control checks, coefficients of inbreeding (F) and regions of homozygosity (ROH) were estimated using PLINK software for HBD analysis. Exome sequencing was conducted in selected cases. RESULTS Inbreeding was associated with schizophrenia based on self-reported consanguinity (χ2 = 4.506, 1 df, p = 0.034) and DNA-based estimates for inbreeding (F); the latter with a significant F × age interaction (β = 32.34, p = 0.0047). The association was most notable among patients older than age 40 years. Eleven ROH were over-represented in cases on chromosomes 1, 3, 6, 11, and 14; all but one region is novel for schizophrenia risk. Exome sequencing identified six recessively-acting genes in ROH with loss-of-function variants; one of which causes primary hereditary microcephaly. CONCLUSIONS We propose consanguinity as an age-dependent risk factor for SZ in Egypt. HBD mapping is feasible for SZ in adequately powered samples.
Collapse
|
16
|
Abstract
Testing hypotheses in human populations, then translating such findings into an experimental paradigm to test for causality can accelerate the rate of therapeutic discovery for many aging-related diseases. Integration of human genomics data has become much more accessible to molecular biologists in recent years due to the explosion of data availability and wealth of bioinformatic resources, tools, and methods that work together to minimize barriers related to its use. There are specific skill sets that can promote integration of human data into the work of molecular biologists, which include the ability to download, organize, store, and analyze human genomics data. In this chapter, key considerations and resources are presented, focusing on approaches that might be unfamiliar to molecular biologists, with regard to human subjects protection guidelines, heterogeneity in human genetics, data security and storage, programming languages, and training for data analysis.
Collapse
|
17
|
An evaluation of machine-learning for predicting phenotype: studies in yeast, rice, and wheat. Mach Learn 2019; 109:251-277. [PMID: 32174648 PMCID: PMC7048706 DOI: 10.1007/s10994-019-05848-5] [Citation(s) in RCA: 47] [Impact Index Per Article: 9.4] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2015] [Revised: 09/17/2019] [Accepted: 09/19/2019] [Indexed: 11/01/2022]
Abstract
In phenotype prediction the physical characteristics of an organism are predicted from knowledge of its genotype and environment. Such studies, often called genome-wide association studies, are of the highest societal importance, as they are of central importance to medicine, crop-breeding, etc. We investigated three phenotype prediction problems: one simple and clean (yeast), and the other two complex and real-world (rice and wheat). We compared standard machine learning methods; elastic net, ridge regression, lasso regression, random forest, gradient boosting machines (GBM), and support vector machines (SVM), with two state-of-the-art classical statistical genetics methods; genomic BLUP and a two-step sequential method based on linear regression. Additionally, using the clean yeast data, we investigated how performance varied with the complexity of the biological mechanism, the amount of observational noise, the number of examples, the amount of missing data, and the use of different data representations. We found that for almost all the phenotypes considered, standard machine learning methods outperformed the methods from classical statistical genetics. On the yeast problem, the most successful method was GBM, followed by lasso regression, and the two statistical genetics methods; with greater mechanistic complexity GBM was best, while in simpler cases lasso was superior. In the wheat and rice studies the best two methods were SVM and BLUP. The most robust method in the presence of noise, missing data, etc. was random forests. The classical statistical genetics method of genomic BLUP was found to perform well on problems where there was population structure. This suggests that standard machine learning methods need to be refined to include population structure information when this is present. We conclude that the application of machine learning methods to phenotype prediction problems holds great promise, but that determining which methods is likely to perform well on any given problem is elusive and non-trivial.
Collapse
|
18
|
Better-than-chance classification for signal detection. Biostatistics 2019; 22:365-380. [PMID: 31612223 DOI: 10.1093/biostatistics/kxz035] [Citation(s) in RCA: 5] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 11/09/2018] [Revised: 08/09/2019] [Accepted: 08/14/2019] [Indexed: 11/13/2022] Open
Abstract
The estimated accuracy of a classifier is a random quantity with variability. A common practice in supervised machine learning, is thus to test if the estimated accuracy is significantly better than chance level. This method of signal detection is particularly popular in neuroimaging and genetics. We provide evidence that using a classifier's accuracy as a test statistic can be an underpowered strategy for finding differences between populations, compared to a bona fide statistical test. It is also computationally more demanding than a statistical test. Via simulation, we compare test statistics that are based on classification accuracy, to others based on multivariate test statistics. We find that the probability of detecting differences between two distributions is lower for accuracy-based statistics. We examine several candidate causes for the low power of accuracy-tests. These causes include: the discrete nature of the accuracy-test statistic, the type of signal accuracy-tests are designed to detect, their inefficient use of the data, and their suboptimal regularization. When the purpose of the analysis is the evaluation of a particular classifier, not signal detection, we suggest several improvements to increase power. In particular, to replace V-fold cross-validation with the Leave-One-Out Bootstrap.
Collapse
|
19
|
Abstract
Genetics provides two major opportunities for understanding human disease-as a transformative line of etiological inquiry and as a biomarker for heritable diseases. In psychiatry, biomarkers are very much needed for both research and treatment, given the heterogenous populations identified by current phenomenologically based diagnostic systems. To date, however, useful and valid biomarkers have been scant owing to the inaccessibility and complexity of human brain tissue and consequent lack of insight into disease mechanisms. Genetic biomarkers are therefore especially promising for psychiatric disorders. Genome-wide association studies of common diseases have matured over the last decade, generating the knowledge base for increasingly informative individual-level genetic risk prediction. In this review, we discuss fundamental concepts involved in computing genetic risk with current methods, strengths and weaknesses of various approaches, assessments of utility, and applications to various psychiatric disorders and related traits. Although genetic risk prediction has become increasingly straightforward to apply and common in published studies, there are important pitfalls to avoid. At present, the clinical utility of genetic risk prediction is still low; however, there is significant promise for future clinical applications as the ancestral diversity and sample sizes of genome-wide association studies increase. We discuss emerging data and methods aimed at improving the value of genetic risk prediction for disentangling disease mechanisms and stratifying subjects for epidemiological and clinical studies. For all applications, it is absolutely critical that polygenic risk prediction is applied with appropriate methodology and control for confounding to avoid repeating some mistakes of the candidate gene era.
Collapse
|
20
|
The causal influence of brain size on human intelligence: Evidence from within-family phenotypic associations and GWAS modeling. INTELLIGENCE 2019; 75:48-58. [PMID: 32831433 DOI: 10.1016/j.intell.2019.01.011] [Citation(s) in RCA: 25] [Impact Index Per Article: 5.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/26/2022]
Abstract
There exists a moderate correlation between MRI-measured brain size and the general factor of IQ performance (g), but the question of whether the association reflects a theoretically important causal relationship or spurious confounding remains somewhat open. Previous small studies (n < 100) looking for the persistence of this correlation within families failed to find a tendency for the sibling with the larger brain to obtain a higher test score. We studied the within-family relationship between brain volume and intelligence in the much larger sample provided by the Human Connectome Project (n = 1,022) and found a highly significant correlation (disattenuated ρ = 0.18, p < .001). We replicated this result in the Minnesota Center for Twin and Family Research (n = 2,698), finding a highly significant within-family correlation between head circumference and intelligence (disattenuated ρ = 0.19, p < .001). We also employed novel methods of causal inference relying on summary statistics from genome-wide association studies (GWAS) of head size (n ≈ 10,000) and measures of cognition (257,000 < n < 767,000). Using bivariate LD Score regression, we found a genetic correlation between intracranial volume (ICV) and years of education (EduYears) of 0.41 (p < .001). Using the Latent Causal Variable method, we found a genetic causality proportion of 0.72 (p < .001); thus the genetic correlation arises from an asymmetric pattern, extending to sub-significant loci, of genetic variants associated with ICV also being associated with EduYears but many genetic variants associated with EduYears not being associated with ICV. This is the pattern of genetic results expected from a causal effect of brain size on intelligence. These findings give reason to take up the hypothesis that the dramatic increase in brain volume over the course of human evolution has been the result of natural selection favoring general intelligence.
Collapse
|
21
|
Modelling strategies for assessing and increasing the effectiveness of new phenotyping techniques in plant breeding. PLANT SCIENCE : AN INTERNATIONAL JOURNAL OF EXPERIMENTAL PLANT BIOLOGY 2019; 282:23-39. [PMID: 31003609 DOI: 10.1016/j.plantsci.2018.06.018] [Citation(s) in RCA: 73] [Impact Index Per Article: 14.6] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Track Full Text] [Subscribe] [Scholar Register] [Received: 12/06/2017] [Revised: 06/05/2018] [Accepted: 06/19/2018] [Indexed: 05/18/2023]
Abstract
New types of phenotyping tools generate large amounts of data on many aspects of plant physiology and morphology with high spatial and temporal resolution. These new phenotyping data are potentially useful to improve understanding and prediction of complex traits, like yield, that are characterized by strong environmental context dependencies, i.e., genotype by environment interactions. For an evaluation of the utility of new phenotyping information, we will look at how this information can be incorporated in different classes of genotype-to-phenotype (G2P) models. G2P models predict phenotypic traits as functions of genotypic and environmental inputs. In the last decade, access to high-density single nucleotide polymorphism markers (SNPs) and sequence information has boosted the development of a class of G2P models called genomic prediction models that predict phenotypes from genome wide marker profiles. The challenge now is to build G2P models that incorporate simultaneously extensive genomic information alongside with new phenotypic information. Beyond the modification of existing G2P models, new G2P paradigms are required. We present candidate G2P models for the integration of genomic and new phenotyping information and illustrate their use in examples. Special attention will be given to the modelling of genotype by environment interactions. The G2P models provide a framework for model based phenotyping and the evaluation of the utility of phenotyping information in the context of breeding programs.
Collapse
|
22
|
Transforming Summary Statistics from Logistic Regression to the Liability Scale: Application to Genetic and Environmental Risk Scores. Hum Hered 2019; 83:210-224. [PMID: 30865946 DOI: 10.1159/000495697] [Citation(s) in RCA: 18] [Impact Index Per Article: 3.6] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 07/28/2018] [Accepted: 11/21/2018] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVE Stratified medicine requires models of disease risk incorporating genetic and environmental factors. These may combine estimates from different studies, and the models must be easily updatable when new estimates become available. The logit scale is often used in genetic and environmental association studies; however, the liability scale is used for polygenic risk scores and measures of heritability, but combining parameters across studies requires a common scale for the estimates. METHODS We present equations to approximate the relationship between univariate effect size estimates on the logit scale and the liability scale, allowing model parameters to be translated between scales. RESULTS These equations are used to build a risk score on the liability scale, using effect size estimates originally estimated on the logit scale. Such a score can then be used in a joint effects model to estimate the risk of disease, and this is demonstrated for schizophrenia using a polygenic risk score and environmental risk factors. CONCLUSION This straightforward method allows the conversion of model parameters between the logit and liability scales and may be a key tool to integrate risk estimates into a comprehensive risk model, particularly for joint models with environmental and genetic risk factors.
Collapse
|
23
|
Learning the optimal scale for GWAS through hierarchical SNP aggregation. BMC Bioinformatics 2018; 19:459. [PMID: 30497371 PMCID: PMC6267789 DOI: 10.1186/s12859-018-2475-9] [Citation(s) in RCA: 10] [Impact Index Per Article: 1.7] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 09/29/2017] [Accepted: 11/09/2018] [Indexed: 11/16/2022] Open
Abstract
Background Genome-Wide Association Studies (GWAS) seek to identify causal genomic variants associated with rare human diseases. The classical statistical approach for detecting these variants is based on univariate hypothesis testing, with healthy individuals being tested against affected individuals at each locus. Given that an individual’s genotype is characterized by up to one million SNPs, this approach lacks precision, since it may yield a large number of false positives that can lead to erroneous conclusions about genetic associations with the disease. One way to improve the detection of true genetic associations is to reduce the number of hypotheses to be tested by grouping SNPs. Results We propose a dimension-reduction approach which can be applied in the context of GWAS by making use of the haplotype structure of the human genome. We compare our method with standard univariate and group-based approaches on both synthetic and real GWAS data. Conclusion We show that reducing the dimension of the predictor matrix by aggregating SNPs gives a greater precision in the detection of associations between the phenotype and genomic regions.
Collapse
|
24
|
Contribution of Inbred Singletons to Variance Component Estimation of Heritability and Linkage. Hum Hered 2018; 83:92-99. [PMID: 30391948 DOI: 10.1159/000492830] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/31/2018] [Accepted: 08/11/2018] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVES An interesting consequence of consanguinity is that the inbred singleton becomes informative for genetic variance. We determine the contribution of an inbred singleton to variance component analysis of heritability and linkage. METHODS Statistical theory for the power of variance component analysis of quantitative traits is used to determine the expected contribution of an inbred singleton to likelihood-ratio tests of heritability and linkage. RESULTS In variance component models, an inbred singleton contributes relatively little to a test of heritability but can contribute substantively to a test of linkage. For small-to-moderate quantitative trait locus (QTL) effects and a level of inbreeding comparable to matings between first cousins (the preferred form of union in many human populations), an inbred singleton can carry nearly 25% of the information of a non-inbred sib pair. In more highly inbred contexts available with experimental animal populations, nonhuman primate colonies, and some human subpopulations, the contribution of an inbred singleton relative to a sib pair can exceed 50%. CONCLUSIONS Inbred individuals, even in isolation from other members of a sample, can contribute to variance component estimation and tests of heritability and linkage. Under certain conditions, the informativeness of the inbred singleton can approach that of a non-inbred sib pair.
Collapse
|
25
|
Abstract
Recent developments in human genome genotyping and sequencing technologies, such as genome-wide association studies and whole-genome sequencing analyses, have successfully identified several risk genes of rheumatic diseases. Fine-mapping studies using the HLA imputation method revealed that classical and non-classical HLA genes contribute to the risk of rheumatic diseases. Integration of human disease genomics with biological, medical, and clinical databases should contribute to the elucidation of disease pathogenicity and novel drug discovery. Disease risk genes identified by large-scale genetic studies are considered to be promising resources for novel drug discovery, including drug repositioning and biomarker microRNA screening for rheumatoid arthritis.
Collapse
|
26
|
Simultaneous detection and estimation of trait associations with genomic phenotypes. Biostatistics 2016; 18:147-164. [PMID: 27496912 DOI: 10.1093/biostatistics/kxw033] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 01/14/2016] [Revised: 05/17/2016] [Accepted: 06/08/2016] [Indexed: 01/09/2023] Open
Abstract
Genomic phenotypes, such as DNA methylation and chromatin accessibility, can be used to characterize the transcriptional and regulatory activity of DNA within a cell. Recent technological advances have made it possible to measure such phenotypes very densely. This density often results in spatial structure, in the sense that measurements at nearby sites are very similar. In this article, we consider the task of comparing genomic phenotypes across experimental conditions, cell types, or disease subgroups. We propose a new method, Joint Adaptive Differential Estimation (JADE), which leverages the spatial structure inherent to genomic phenotypes. JADE simultaneously estimates smooth underlying group average genomic phenotype profiles and detects regions in which the average profile differs between groups. We evaluate JADE's performance in several biologically plausible simulation settings. We also consider an application to the detection of regions with differential methylation between mature skeletal muscle cells, myotubes, and myoblasts.
Collapse
|
27
|
Abstract
Development of free/libre open source software is usually done by a community of people with an interest in the tool. For scientific software, however, this is less often the case. Most scientific software is written by only a few authors, often a student working on a thesis. Once the paper describing the tool has been published, the tool is no longer developed further and is left to its own device. Here we describe the broad, multidisciplinary community we formed around a set of tools for statistical genomics. The GenABEL project for statistical omics actively promotes open interdisciplinary development of statistical methodology and its implementation in efficient and user-friendly software under an open source licence. The software tools developed withing the project collectively make up the GenABEL suite, which currently consists of eleven tools. The open framework of the project actively encourages involvement of the community in all stages, from formulation of methodological ideas to application of software to specific data sets. A web forum is used to channel user questions and discussions, further promoting the use of the GenABEL suite. Developer discussions take place on a dedicated mailing list, and development is further supported by robust development practices including use of public version control, code review and continuous integration. Use of this open science model attracts contributions from users and developers outside the “core team”, facilitating agile statistical omics methodology development and fast dissemination.
Collapse
|
28
|
Uncovering the Genetic Architectures of Quantitative Traits. Comput Struct Biotechnol J 2015; 14:28-34. [PMID: 27076877 PMCID: PMC4816193 DOI: 10.1016/j.csbj.2015.10.002] [Citation(s) in RCA: 7] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/10/2015] [Revised: 10/16/2015] [Accepted: 10/23/2015] [Indexed: 01/08/2023] Open
Abstract
The aim of a genome-wide association study (GWAS) is to identify loci in the human genome affecting a phenotype of interest. This review summarizes some recent work on conceptual and methodological aspects of GWAS. The average effect of gene substitution at a given causal site in the genome is the key estimand in GWAS, and we argue for its fundamental importance. Implicit in the definition of average effect is a linear model relating genotype to phenotype. The fraction of the phenotypic variance ascribable to polymorphic sites with nonzero average effects in this linear model is called the heritability, and we describe methods for estimating this quantity from GWAS data. Finally, we show that the theory of compressed sensing can be used to provide a sharp estimate of the sample size required to identify essentially all sites contributing to the heritability of a given phenotype.
Collapse
|
29
|
Pathway analysis for RNA-Seq data using a score-based approach. Biometrics 2015; 72:165-74. [PMID: 26259845 DOI: 10.1111/biom.12372] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/01/2014] [Revised: 06/01/2015] [Accepted: 06/01/2015] [Indexed: 11/27/2022]
Abstract
A variety of pathway/gene-set approaches have been proposed to provide evidence of higher-level biological phenomena in the association of expression with experimental condition or clinical outcome. Among these approaches, it has been repeatedly shown that resampling methods are far preferable to approaches that implicitly assume independence of genes. However, few approaches have been optimized for the specific characteristics of RNA-Seq transcription data, in which mapped tags produce discrete counts with varying library sizes, and with potential outliers or skewness patterns that violate parametric assumptions. We describe transformations to RNA-Seq data to improve power for linear associations with outcome and flexibly handle normalization factors. Using these transformations or alternate transformations, we apply recently developed null approximations to quadratic form statistics for both self-contained and competitive pathway testing. The approach provides a convenient integrated platform for RNA-Seq pathway testing. We demonstrate that the approach provides appropriate type I error control without actual permutation and is powerful under many settings in comparison to competing approaches. Pathway analysis of data from a study of F344 vs. HIV1Tg rats, and of sex differences in lymphoblastoid cell lines from humans, strongly supports the biological interpretability of the findings.
Collapse
|
30
|
Heritability estimation of osteoarthritis in the pig-tailed macaque (Macaca nemestrina) with a look toward future data collection. PeerJ 2014; 2:e373. [PMID: 24860700 PMCID: PMC4017820 DOI: 10.7717/peerj.373] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.2] [Reference Citation Analysis] [Abstract] [Key Words] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/16/2014] [Accepted: 04/17/2014] [Indexed: 11/20/2022] Open
Abstract
We examine heritability estimation of an ordinal trait for osteoarthritis, using a population of pig-tailed macaques from the Washington National Primate Research Center (WaNPRC). This estimation is non-trivial, as the data consist of ordinal measurements on 16 intervertebral spaces throughout each macaque's spinal cord, with many missing values. We examine the resulting heritability estimates from different model choices, and also perform a simulation study to compare the performance of heritability estimation with these different models under specific known parameter values. Under both the real data analysis and the simulation study, we find that heritability estimates from an assumption of normality of the trait differ greatly from those of ordered probit regression, which considers the ordinality of the trait. This finding indicates that some caution should be observed regarding model selection when estimating heritability of an ordinal quantity. Furthermore, we find evidence that our real data have little information for valid heritability estimation under ordered probit regression. We thus conclude with an exploration of sample size requirements for heritability estimation under this model. For an ordinal trait, an incorrect assumption of normality can lead to severely biased heritability estimation. Sample size requirements for heritability estimation of an ordinal trait under the threshold model depends on the pedigree structure, trait distribution and the degree of relatedness between each phenotyped individual. Our sample of 173 monkeys did not have enough information from which to estimate heritability, but estimable heritability can be obtained with as few as 180 related individuals under certain scenarios examined here.
Collapse
|
31
|
Abstract
Modern case-control studies typically involve the collection of data on a large number of outcomes, often at considerable logistical and monetary expense. These data are of potentially great value to subsequent researchers, who, although not necessarily concerned with the disease that defined the case series in the original study, may want to use the available information for a regression analysis involving a secondary outcome. Because cases and controls are selected with unequal probability, regression analysis involving a secondary outcome generally must acknowledge the sampling design. In this paper, the author presents a new framework for the analysis of secondary outcomes in case-control studies. The approach is based on a careful re-parameterization of the conditional model for the secondary outcome given the case-control outcome and regression covariates, in terms of (a) the population regression of interest of the secondary outcome given covariates and (b) the population regression of the case-control outcome on covariates. The error distribution for the secondary outcome given covariates and case-control status is otherwise unrestricted. For a continuous outcome, the approach sometimes reduces to extending model (a) by including a residual of (b) as a covariate. However, the framework is general in the sense that models (a) and (b) can take any functional form, and the methodology allows for an identity, log or logit link function for model (a).
Collapse
|