1
|
Benchmarking phasing software with a whole-genome sequenced cattle pedigree. BMC Genomics 2022; 23:130. [PMID: 35164677 PMCID: PMC8845340 DOI: 10.1186/s12864-022-08354-6] [Citation(s) in RCA: 3] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/28/2021] [Accepted: 01/24/2022] [Indexed: 12/30/2022] Open
Abstract
Background Accurate haplotype reconstruction is required in many applications in quantitative and population genomics. Different phasing methods are available but their accuracy must be evaluated for samples with different properties (population structure, marker density, etc.). We herein took advantage of whole-genome sequence data available for a Holstein cattle pedigree containing 264 individuals, including 98 trios, to evaluate several population-based phasing methods. This data represents a typical example of a livestock population, with low effective population size, high levels of relatedness and long-range linkage disequilibrium. Results After stringent filtering of our sequence data, we evaluated several population-based phasing programs including one or more versions of AlphaPhase, ShapeIT, Beagle, Eagle and FImpute. To that end we used 98 individuals having both parents sequenced for validation. Their haplotypes reconstructed based on Mendelian segregation rules were considered the gold standard to assess the performance of population-based methods in two scenarios. In the first one, only these 98 individuals were phased, while in the second one, all the 264 sequenced individuals were phased simultaneously, ignoring the pedigree relationships. We assessed phasing accuracy based on switch error counts (SEC) and rates (SER), lengths of correctly phased haplotypes and the probability that there is no phasing error between a pair of SNPs as a function of their distance. For most evaluated metrics or scenarios, the best software was either ShapeIT4.1 or Beagle5.2, both methods resulting in particularly high phasing accuracies. For instance, ShapeIT4.1 achieved a median SEC of 50 per individual and a mean haplotype block length of 24.1 Mb (scenario 2). These statistics are remarkable since the methods were evaluated with a map of 8,400,000 SNPs, and this corresponds to only one switch error every 40,000 phased informative markers. When more relatives were included in the data (scenario 2), FImpute3.0 reconstructed extremely long segments without errors. Conclusions We report extremely high phasing accuracies in a typical livestock sample. ShapeIT4.1 and Beagle5.2 proved to be the most accurate, particularly for phasing long segments and in the first scenario. Nevertheless, most tools achieved high accuracy at short distances and would be suitable for applications requiring only local haplotypes. Supplementary Information The online version contains supplementary material available at 10.1186/s12864-022-08354-6.
Collapse
|
2
|
Faux P, Druet T. A strategy to improve phasing of whole-genome sequenced individuals through integration of familial information from dense genotype panels. Genet Sel Evol 2017; 49:46. [PMID: 28511677 PMCID: PMC5434521 DOI: 10.1186/s12711-017-0321-6] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/20/2016] [Accepted: 05/05/2017] [Indexed: 11/21/2022] Open
Abstract
Background Haplotype reconstruction (phasing) is an essential step in many applications, including imputation and genomic selection. The best phasing methods rely on both familial and linkage disequilibrium (LD) information. With whole-genome sequence (WGS) data, relatively small samples of reference individuals are generally sequenced due to prohibitive sequencing costs, thus only a limited amount of familial information is available. However, reference individuals have many relatives that have been genotyped (at lower density). The goal of our study was to improve phasing of WGS data by integrating familial information from haplotypes that were obtained from a larger genotyped dataset and to quantify its impact on imputation accuracy. Results Aligning a pre-phased WGS panel [~5 million single nucleotide polymorphisms (SNPs)], which is based on LD information only, to a 50k SNP array that is phased with both LD and familial information (called scaffold) resulted in correctly assigning parental origin for 99.62% of the WGS SNPs, their phase being determined unambiguously based on parental genotypes. Without using the 50k haplotypes as scaffold, that value dropped as expected to 50%. Correctly phased segments were on average longer after alignment to the genotype phase while the number of switches decreased slightly. Most of the incorrectly assigned segments, and subsequent switches, were due to singleton errors. Imputation from 50k SNP array to WGS data with improved phasing had a marginal impact on imputation accuracy (measured as r2), i.e. on average, 90.47% with traditional techniques versus 90.65% with pre-phasing integrating familial information. Differences were larger for SNPs located in chromosome ends and rare variants. Using a denser WGS panel (~13 millions SNPs) that was obtained with traditional variant filtering rules, we found similar results although performances of both phasing and imputation accuracy were lower. Conclusions We present a phasing strategy for WGS data, which indirectly integrates familial information by aligning WGS haplotypes that are pre-phased with LD information only on haplotypes obtained with genotyping data, with both LD and familial information and on a much larger population. This strategy results in very few mismatches with the phase obtained by Mendelian segregation rules. Finally, we propose a strategy to further improve phasing accuracy based on haplotype clusters obtained with genotyping data. Electronic supplementary material The online version of this article (doi:10.1186/s12711-017-0321-6) contains supplementary material, which is available to authorized users.
Collapse
Affiliation(s)
- Pierre Faux
- Unit of Animal Genomics, GIGA-R and Faculty of Veterinary Medicine, University of Liège, 4000, Liège, Belgium.
| | - Tom Druet
- Unit of Animal Genomics, GIGA-R and Faculty of Veterinary Medicine, University of Liège, 4000, Liège, Belgium
| |
Collapse
|
3
|
|
4
|
Efficiency of haplotype-based methods to fine-map QTLs and embryonic lethal variants affecting fertility: Illustration with a deletion segregating in Nordic Red cattle. Livest Sci 2014. [DOI: 10.1016/j.livsci.2014.04.030] [Citation(s) in RCA: 1] [Impact Index Per Article: 0.1] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/15/2022]
|
5
|
|
6
|
Abstract
We herein present a haplotype-based method to perform genome-wide association studies. The method relies on hidden Markov models to describe haplotypes from a population as a mosaic of a set of ancestral haplotypes. For a given position in the genome, haplotypes deriving from the same ancestral haplotype are also likely to carry the same risk alleles. Therefore, the model can be used in several applications such as haplotype reconstruction, imputation, association studies or genomic predictions. We illustrate then the model with two applications: the fine-mapping of a QTL affecting live weight in cattle and association studies in a stratified cattle population. Both applications show the potential of the method and the high linkage disequilibrium between ancestral haplotypes and causative variants.
Collapse
Affiliation(s)
- Tom Druet
- Fonds National de la Recherche Scientifique, Brussels, Belgium
| | | |
Collapse
|
7
|
Fine-scale mapping of disease susceptibility locus with Bayesian partition model. Genes Genomics 2012. [DOI: 10.1007/s13258-011-0220-0] [Citation(s) in RCA: 0] [Impact Index Per Article: 0] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 10/28/2022]
|
8
|
Zhang Z, Guillaume F, Sartelet A, Charlier C, Georges M, Farnir F, Druet T. Ancestral haplotype-based association mapping with generalized linear mixed models accounting for stratification. ACTA ACUST UNITED AC 2012; 28:2467-73. [PMID: 22711794 DOI: 10.1093/bioinformatics/bts348] [Citation(s) in RCA: 32] [Impact Index Per Article: 2.5] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 12/24/2022]
Abstract
MOTIVATION In many situations, genome-wide association studies are performed in populations presenting stratification. Mixed models including a kinship matrix accounting for genetic relatedness among individuals have been shown to correct for population and/or family structure. Here we extend this methodology to generalized linear mixed models which properly model data under various distributions. In addition we perform association with ancestral haplotypes inferred using a hidden Markov model. RESULTS The method was shown to properly account for stratification under various simulated scenari presenting population and/or family structure. Use of ancestral haplotypes resulted in higher power than SNPs on simulated datasets. Application to real data demonstrates the usefulness of the developed model. Full analysis of a dataset with 4600 individuals and 500 000 SNPs was performed in 2 h 36 min and required 2.28 Gb of RAM. AVAILABILITY The software GLASCOW can be freely downloaded from www.giga.ulg.ac.be/jcms/prod_381171/software. CONTACT francois.guillaume@jouy.inra.fr SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Z Zhang
- Unit of Animal Genomics, GIGA-R, University of Liège, Liège, Belgium
| | | | | | | | | | | | | |
Collapse
|
9
|
de Roos APW, Schrooten C, Druet T. Genomic breeding value estimation using genetic markers, inferred ancestral haplotypes, and the genomic relationship matrix. J Dairy Sci 2011; 94:4708-14. [PMID: 21854945 DOI: 10.3168/jds.2010-3905] [Citation(s) in RCA: 23] [Impact Index Per Article: 1.6] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/06/2010] [Accepted: 05/10/2011] [Indexed: 11/19/2022]
Abstract
With the introduction of new single nucleotide polymorphism (SNP) chips of various densities, more and more genotype data sets will include animals genotyped for only a subset of the SNP. Imputation techniques based on unobserved ancestral haplotypes may be used to infer missing genotypes. These ancestral haplotypes may also be used in the genomic prediction model, instead of using the SNP. This may increase the reliability of predictions because the ancestral haplotype may capture more linkage disequilibrium with quantitative trait loci than SNP. The aim of this paper was to study whether using unobserved ancestral haplotypes in a genomic prediction model would provide more reliable genomic predictions than using SNP, and to determine how many loci in the genomic prediction model would be redundant. Genotypes of 8,960 bulls and cows for 39,557 SNP were analyzed with a hidden Markov model to associate each individual at each locus to 2 ancestral haplotypes. The number of ancestral haplotypes per locus was fixed at 10, 15, or 20. Subsequently, a validation study was performed in which the phenotypes of 3,251 progeny-tested bulls for 16 traits were used in a genomic prediction model to predict the estimated breeding values of at least 753 validation bulls. The squared correlation between genomic prediction and deregressed daughter performance estimated breeding value, when averaged across traits, was slightly higher when 15 or 20 ancestral haplotypes per locus were used in the prediction model instead of the SNP genotypes, whereas the prediction model using a genomic relationship matrix gave the lowest squared correlations. The number of redundant loci [i.e., loci that had less than 18 jumps (0.1%) from one ancestral haplotype to another ancestral haplotype at the next locus], was 18,793 (48%), which means that only 20,764 loci would need to be included in the genomic prediction model. This provides opportunities for greatly decreasing computer requirements of genomic evaluations with very large numbers of markers.
Collapse
|
10
|
He Y, Li C, Amos CI, Xiong M, Ling H, Jin L. Accelerating haplotype-based genome-wide association study using perfect phylogeny and phase-known reference data. PLoS One 2011; 6:e22097. [PMID: 21789217 PMCID: PMC3137625 DOI: 10.1371/journal.pone.0022097] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/17/2010] [Accepted: 06/17/2011] [Indexed: 11/18/2022] Open
Abstract
The genome-wide association study (GWAS) has become a routine approach for mapping disease risk loci with the advent of large-scale genotyping technologies. Multi-allelic haplotype markers can provide superior power compared with single-SNP markers in mapping disease loci. However, the application of haplotype-based analysis to GWAS is usually bottlenecked by prohibitive time cost for haplotype inference, also known as phasing. In this study, we developed an efficient approach to haplotype-based analysis in GWAS. By using a reference panel, our method accelerated the phasing process and reduced the potential bias generated by unrealistic assumptions in phasing process. The haplotype-based approach delivers great power and no type I error inflation for association studies. With only a medium-size reference panel, phasing error in our method is comparable to the genotyping error afforded by commercial genotyping solutions.
Collapse
Affiliation(s)
- Yungang He
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
- * E-mail: (YH); (LJ)
| | - Cong Li
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
| | - Christopher I. Amos
- Department of Epidemiology, University of Texas MD Anderson Cancer Center, Houston, Texas, United States of America
| | - Momiao Xiong
- Human Genetics Center, University of Texas School of Public Health, Houston, Texas, United States of America
- State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
| | - Hua Ling
- Center for Inherited Disease Research, Johns Hopkins University, Baltimore, Maryland, United States of America
| | - Li Jin
- Department of Computational Genomics, CAS-MPG Partner Institute for Computational Biology, Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences, Shanghai, China
- Key Laboratory of Computational Biology, CAS-MPG Partner Institute for Computational Biology, Chinese Academy of Sciences, Shanghai, China
- State Key Laboratory of Genetic Engineering and Ministry of Education Key Laboratory of Contemporary Anthropology, School of Life Sciences and Institutes of Biomedical Sciences, Fudan University, Shanghai, China
- * E-mail: (YH); (LJ)
| |
Collapse
|
11
|
Lam TY, Meyer IM. Efficient algorithms for training the parameters of hidden Markov models using stochastic expectation maximization (EM) training and Viterbi training. Algorithms Mol Biol 2010; 5:38. [PMID: 21143925 PMCID: PMC3019189 DOI: 10.1186/1748-7188-5-38] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/21/2010] [Accepted: 12/09/2010] [Indexed: 11/10/2022] Open
Abstract
Background Hidden Markov models are widely employed by numerous bioinformatics programs used today. Applications range widely from comparative gene prediction to time-series analyses of micro-array data. The parameters of the underlying models need to be adjusted for specific data sets, for example the genome of a particular species, in order to maximize the prediction accuracy. Computationally efficient algorithms for parameter training are thus key to maximizing the usability of a wide range of bioinformatics applications. Results We introduce two computationally efficient training algorithms, one for Viterbi training and one for stochastic expectation maximization (EM) training, which render the memory requirements independent of the sequence length. Unlike the existing algorithms for Viterbi and stochastic EM training which require a two-step procedure, our two new algorithms require only one step and scan the input sequence in only one direction. We also implement these two new algorithms and the already published linear-memory algorithm for EM training into the hidden Markov model compiler HMM-CONVERTER and examine their respective practical merits for three small example models. Conclusions Bioinformatics applications employing hidden Markov models can use the two algorithms in order to make Viterbi training and stochastic EM training more computationally efficient. Using these algorithms, parameter training can thus be attempted for more complex models and longer training sequences. The two new algorithms have the added advantage of being easier to implement than the corresponding default algorithms for Viterbi training and stochastic EM training.
Collapse
|
12
|
Coin LJM, Asher JE, Walters RG, El-Sayed Moustafa JS, de Smith AJ, Sladek R, Balding DJ, Froguel P, Blakemore AIF. cnvHap: an integrative population and haplotype–based multiplatform model of SNPs and CNVs. Nat Methods 2010; 7:541-6. [DOI: 10.1038/nmeth.1466] [Citation(s) in RCA: 42] [Impact Index Per Article: 2.8] [Reference Citation Analysis] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/01/2010] [Accepted: 05/05/2010] [Indexed: 11/09/2022]
|
13
|
Su SY, Asher JE, Jarvelin MR, Froguel P, Blakemore AIF, Balding DJ, Coin LJM. Inferring combined CNV/SNP haplotypes from genotype data. ACTA ACUST UNITED AC 2010; 26:1437-45. [PMID: 20406911 DOI: 10.1093/bioinformatics/btq157] [Citation(s) in RCA: 31] [Impact Index Per Article: 2.1] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/17/2023]
Abstract
MOTIVATION Copy number variations (CNVs) are increasingly recognized as an substantial source of individual genetic variation, and hence there is a growing interest in investigating the evolutionary history of CNVs as well as their impact on complex disease susceptibility. CNV/SNP haplotypes are critical for this research, but although many methods have been proposed for inferring integer copy number, few have been designed for inferring CNV haplotypic phase and none of these are applicable at genome-wide scale. Here, we present a method for inferring missing CNV genotypes, predicting CNV allelic configuration and for inferring CNV haplotypic phase from SNP/CNV genotype data. Our method, implemented in the software polyHap v2.0, is based on a hidden Markov model, which models the joint haplotype structure between CNVs and SNPs. Thus, haplotypic phase of CNVs and SNPs are inferred simultaneously. A sampling algorithm is employed to obtain a measure of confidence/credibility of each estimate. RESULTS We generated diploid phase-known CNV-SNP genotype datasets by pairing male X chromosome CNV-SNP haplotypes. We show that polyHap provides accurate estimates of missing CNV genotypes, allelic configuration and CNV haplotypic phase on these datasets. We applied our method to a non-simulated dataset-a region on Chromosome 2 encompassing a short deletion. The results confirm that polyHap's accuracy extends to real-life datasets. AVAILABILITY Our method is implemented in version 2.0 of the polyHap software package and can be downloaded from http://www.imperial.ac.uk/medicine/people/l.coin.
Collapse
Affiliation(s)
- Shu-Yi Su
- Department of Epidemiology and Biostatistics, School of Public Health, Imperial College, London W2 1PG, UK
| | | | | | | | | | | | | |
Collapse
|
14
|
Abstract
In population genetics, it is common to represent the haplotype diversity at a genomic region between multiple populations using well-constructed visual representations. This typically requires the chromosomes from these populations to be grouped according to some definition of haplotypic similarity. Here, we introduce a novel algorithm for clustering haplotypes with the specific aim of addressing haplotype diversity within or between populations. The algorithm allows for missing data in the haplotypes and appropriately downweighs single nucleotide polymorphisms with higher extent of missingness. By identifying the canonical haplotypes in a genomic region, defined as the haplotype forms, which most chromosomes are similar to, the algorithm maps each chromosome to either a unique canonical haplotype or as a mosaic of the identified canonical haplotypes. This mapping can subsequently be utilized for producing graphical visualizations of the haplotype clustering for understanding the extent of haplotype diversity in the region. The clustering application has been implemented in R for distribution as haplosim, and we also provide a visualization script hapvisual for graphical display of the clustering results. The outcome of such analysis can be informative in understanding the extent of haplotype diversity between populations, in addressing the reproducibility of established association signals across multiple populations, and also in the investigation of positive selection in the human genome.
Collapse
Affiliation(s)
- Yik Y Teo
- Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford OX3 7BN, UK.
| | | |
Collapse
|
15
|
Buil A, Martinez-Perez A, Perera-Lluna A, Rib L, Caminal P, Soria JM. A new gene-based association test for genome-wide association studies. BMC Proc 2009; 3 Suppl 7:S130. [PMID: 20017997 PMCID: PMC2795904 DOI: 10.1186/1753-6561-3-s7-s130] [Citation(s) in RCA: 19] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/28/2023] Open
Abstract
Genome-wide association studies are widely used today to discover genetic factors that modify the risk of complex diseases. Usually, these methods work in a SNP-by-SNP fashion. We present a gene-based test that can be applied in the context of genome-wide association studies. We compare both strategies, SNP-based and gene-based, in a sample of cases and controls for rheumatoid arthritis.We obtained different results using each strategy. The SNP-based test found the PTPN22 gene while the gene-based test found the PHF19-TRAF1-C5 region. That suggests that no single strategy performs better than another in all cases and that a certain underlying genetic architecture can be delineated more easily with one strategy rather than with another.
Collapse
Affiliation(s)
- Alfonso Buil
- Unitat de Genomica de Malalties Complexes, Institut de Recerca de l'Hospital de la Santa Creu i Sant Pau, Barcelona, 08025, Spain.
| | | | | | | | | | | |
Collapse
|
16
|
Liu Y, Li YJ, Satten GA, Allen AS, Tzeng JY. A regression-based association test for case-control studies that uses inferred ancestral haplotype similarity. Ann Hum Genet 2009; 73:520-6. [PMID: 19622101 PMCID: PMC2747372 DOI: 10.1111/j.1469-1809.2009.00536.x] [Citation(s) in RCA: 4] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Key Words] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/12/2023]
Abstract
Association methods based on haplotype similarity (HS) can overcome power and stability issues encountered in standard haplotype analyses. Current HS methods can be generally classified into evolutionary and two-sample approaches. We propose a new regression-based HS association method for case-control studies that incorporates covariate information and combines the advantages of the two classes of approaches by using inferred ancestral haplotypes. We first estimate the ancestral haplotypes of case individuals and then, for each individual, an ancestral-haplotype-based similarity score is computed by comparing that individual's observed genotype with the estimated ancestral haplotypes. Trait values are then regressed on the similarity scores. Covariates can easily be incorporated into this regression framework. To account for the bias in the raw p-values due to the use of case data in constructing ancestral haplotypes, as well as to account for variation in ancestral haplotype estimation, a permutation procedure is adopted to obtain empirical p-values. Compared with the standard haplotype score test and the multilocus T(2) test, our method improves power when neither the allele frequency nor linkage disequilibrium between the disease locus and its neighboring SNPs is too low and is comparable in other scenarios. We applied our method to the Genetic Analysis Workshop 15 simulated SNP data and successfully pinpointed a stretch of SNPs that covers the fine-scale region where the causal locus is located.
Collapse
Affiliation(s)
- Youfang Liu
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC
| | - Yi-Ju Li
- Center for Human Genetics, Duke University Medical Center, Durham, NC
- Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC
| | | | - Andrew S. Allen
- Department of Biostatistics and Bioinformatics, Duke University Medical Center, Durham, NC
| | - Jung-Ying Tzeng
- Bioinformatics Research Center, North Carolina State University, Raleigh, NC
| |
Collapse
|
17
|
Li M, Wang K, Grant SFA, Hakonarson H, Li C. ATOM: a powerful gene-based association test by combining optimally weighted markers. ACTA ACUST UNITED AC 2008; 25:497-503. [PMID: 19074959 DOI: 10.1093/bioinformatics/btn641] [Citation(s) in RCA: 41] [Impact Index Per Article: 2.4] [Reference Citation Analysis] [Abstract] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/14/2022]
Abstract
BACKGROUND Large-scale candidate-gene and genome-wide association studies genotype multiple SNPs within or surrounding a gene, including both tag and functional SNPs. The immense amount of data generated in these studies poses new challenges to analysis. One particularly challenging yet important question is how to best use all genetic information to test whether a gene or a region is associated with the trait of interest. METHODS Here we propose a powerful gene-based Association Test by combining Optimally Weighted Markers (ATOM) within a genomic region. Due to variation in linkage disequilibrium, different markers often associate with the trait of interest at different levels. To appropriately apportion their contributions, we assign a weight to each marker that is proportional to the amount of information it captures about the trait locus. We analytically derive the optimal weights for both quantitative and binary traits, and describe a procedure for estimating the weights from a reference database such as the HapMap. Compared with existing approaches, our method has several distinct advantages, including (i) the ability to borrow information from an external database to increase power, (ii) the theoretical derivation of optimal marker weights and (iii) the scalability to simultaneous analysis of all SNPs in candidate genes and pathways. RESULTS Through extensive simulations and analysis of the FTO gene in our ongoing genome-wide association study on childhood obesity, we demonstrate that ATOM increases the power to detect genetic association as compared with several commonly used multi-marker association tests.
Collapse
Affiliation(s)
- Mingyao Li
- Department of Biostatistics and Epidemiology, University of Pennsylvania School of Medicine, The Children's Hospital of Philadelphia, Philadelphia, PA 19104, USA.
| | | | | | | | | |
Collapse
|
18
|
Su SY, White J, Balding DJ, Coin LJM. Inference of haplotypic phase and missing genotypes in polyploid organisms and variable copy number genomic regions. BMC Bioinformatics 2008; 9:513. [PMID: 19046436 PMCID: PMC2647950 DOI: 10.1186/1471-2105-9-513] [Citation(s) in RCA: 21] [Impact Index Per Article: 1.2] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 08/26/2008] [Accepted: 12/01/2008] [Indexed: 12/18/2022] Open
Abstract
Background The power of haplotype-based methods for association studies, identification of regions under selection, and ancestral inference, is well-established for diploid organisms. For polyploids, however, the difficulty of determining phase has limited such approaches. Polyploidy is common in plants and is also observed in animals. Partial polyploidy is sometimes observed in humans (e.g. trisomy 21; Down's syndrome), and it arises more frequently in some human tissues. Local changes in ploidy, known as copy number variations (CNV), arise throughout the genome. Here we present a method, implemented in the software polyHap, for the inference of haplotype phase and missing observations from polyploid genotypes. PolyHap allows each individual to have a different ploidy, but ploidy cannot vary over the genomic region analysed. It employs a hidden Markov model (HMM) and a sampling algorithm to infer haplotypes jointly in multiple individuals and to obtain a measure of uncertainty in its inferences. Results In the simulation study, we combine real haplotype data to create artificial diploid, triploid, and tetraploid genotypes, and use these to demonstrate that polyHap performs well, in terms of both switch error rate in recovering phase and imputation error rate for missing genotypes. To our knowledge, there is no comparable software for phasing a large, densely genotyped region of chromosome from triploids and tetraploids, while for diploids we found polyHap to be more accurate than fastPhase. We also compare the results of polyHap to SATlotyper on an experimentally haplotyped tetraploid dataset of 12 SNPs, and show that polyHap is more accurate. Conclusion With the availability of large SNP data in polyploids and CNV regions, we believe that polyHap, our proposed method for inferring haplotypic phase from genotype data, will be useful in enabling researchers analysing such data to exploit the power of haplotype-based analyses.
Collapse
Affiliation(s)
- Shu-Yi Su
- Department of Epidemiology and Public Health, Imperial College, London, W2 1PG, UK.
| | | | | | | |
Collapse
|