1
|
Rules for resolving Mendelian inconsistencies in nuclear pedigrees typed for two-allele markers. PLoS One 2017; 12:e0172807. [PMID: 28253278 PMCID: PMC5333839 DOI: 10.1371/journal.pone.0172807] [Citation(s) in RCA: 2] [Impact Index Per Article: 0.3] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/01/2016] [Accepted: 02/09/2017] [Indexed: 11/18/2022] Open
Abstract
Gene-mapping studies, regularly, rely on examination for Mendelian transmission of marker alleles in a pedigree as a way of screening for genotyping errors and mutations. For analysis of family data sets, it is, usually, necessary to resolve or remove the genotyping errors prior to consideration. At the Center of Inherited Disease Research (CIDR), to deal with their large-scale data flow, they formalized their data cleaning approach in a set of rules based on PedCheck output. We scrutinize via carefully designed simulations that how well CIDR’s data cleaning rules work in practice. We found that genotype errors in siblings are detected more often than in parents for less polymorphic SNPs and vice versa for more polymorphic SNPs. Through computer simulations, we conclude that some of the CIDR’s rules work poorly in some circumstances, and we suggest a set of modified data cleaning rules that may work better than CIDR’s rules.
Collapse
|
2
|
Cheung CYK, Thompson EA, Wijsman EM. Detection of Mendelian consistent genotyping errors in pedigrees. Genet Epidemiol 2014; 38:291-9. [PMID: 24718985 DOI: 10.1002/gepi.21806] [Citation(s) in RCA: 14] [Impact Index Per Article: 1.4] [Reference Citation Analysis] [Abstract] [Key Words] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/24/2013] [Revised: 03/03/2014] [Accepted: 03/04/2014] [Indexed: 11/12/2022]
Abstract
Detection of genotyping errors is a necessary step to minimize false results in genetic analysis. This is especially important when the rate of genotyping errors is high, as has been reported for high-throughput sequence data. To detect genotyping errors in pedigrees, Mendelian inconsistent (MI) error checks exist, as do multi-point methods that flag Mendelian consistent (MC) errors for sparse multi-allelic markers. However, few methods exist for detecting MC genotyping errors, particularly for dense variants on large pedigrees. Here, we introduce an efficient method to detect MC errors even for very dense variants (e.g., SNPs and sequencing data) on pedigrees that may be large. Our method first samples inheritance vectors (IVs) using a moderately sparse but informative set of markers using a Markov chain Monte Carlo-based sampler. Using sampled IVs, we considered two test statistics to detect MC genotyping errors: the percentage of IVs inconsistent with observed genotypes (A1) or the posterior probability of error configurations (A2). Using simulations, we show that this method, even with the simpler A1 statistic, is effective for detecting MC genotyping errors in dense variants, with sensitivity almost as high as the theoretical best sensitivity possible. We also evaluate the effectiveness of this method as a function of parameters, when including the observed pattern for genotype, density of framework markers, error rate, allele frequencies, and number of sampled inheritance vectors. Our approach provides a line of defense against false findings based on the use of dense variants in pedigrees.
Collapse
Affiliation(s)
- Charles Y K Cheung
- Department of Biostatistics, University of Washington, Seattle, Washington, United States of America; Division of Medical Genetics, Department of Medicine, University of Washington, Seattle, Washington, United States of America
| | | | | |
Collapse
|
3
|
Wijsman EM. The role of large pedigrees in an era of high-throughput sequencing. Hum Genet 2012; 131:1555-63. [PMID: 22714655 PMCID: PMC3638020 DOI: 10.1007/s00439-012-1190-2] [Citation(s) in RCA: 76] [Impact Index Per Article: 6.3] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 03/26/2012] [Accepted: 06/07/2012] [Indexed: 12/13/2022]
Abstract
Rare variation is the current frontier in human genetics. The large pedigree design is practical, efficient, and well-suited for investigating rare variation. In large pedigrees, specific rare variants that co-segregate with a trait will occur in sufficient numbers so that effects can be measured, and evidence for association can be evaluated, by making use of methods that fully use the pedigree information. Evidence from linkage analysis can focus investigation, both reducing the multiple testing burden and expanding the variants that can be evaluated and followed up, as recent studies have shown. The large pedigree design requires only a small fraction of the sample size needed to identify rare variants of interest in population-based designs, and many highly suitable, well-understood, and available statistical and computational tools already exist. Samples consisting of large pedigrees with existing rich phenotype and genome scan data should be prime candidates for high-throughput sequencing in the search of the determinants of complex traits.
Collapse
Affiliation(s)
- Ellen M Wijsman
- Department of Biostatistics, University of Washington, Seattle, WA 98195-7720, USA.
| |
Collapse
|
4
|
Markus B, Birk OS, Geiger D. Integration of SNP genotyping confidence scores in IBD inference. ACTA ACUST UNITED AC 2011; 27:2880-7. [PMID: 21862568 DOI: 10.1093/bioinformatics/btr486] [Citation(s) in RCA: 11] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 01/22/2023]
Abstract
MOTIVATION High-throughput single nucleotide polymorphism (SNP) arrays have become the standard platform for linkage and association analyses. The high SNP density of these platforms allows high-resolution identification of ancestral recombination events even for distant relatives many generations apart. However, such inference is sensitive to marker mistyping and current error detection methods rely on the genotyping of additional close relatives. Genotyping algorithms provide a confidence score for each marker call that is currently not integrated in existing methods. There is a need for a model that incorporates this prior information within the standard identical by descent (IBD) and association analyses. RESULTS We propose a novel model that incorporates marker confidence scores within IBD methods based on the Lander-Green Hidden Markov Model. The novel parameter of this model is the joint distribution of confidence scores and error status per array. We estimate this probability distribution by applying a modified expectation-maximization (EM) procedure on data from nuclear families genotyped with Affymetrix 250K SNP arrays. The converged tables from two different genotyping algorithms are shown for a wide range of error rates. We demonstrate the efficacy of our method in refining the detection of IBD signals using nuclear pedigrees and distant relatives. AVAILABILITY Plinke, a new version of Plink with an extended pairwise IBD inference model allowing per marker error probabilities is freely available at: http://bioinfo.bgu.ac.il/bsu/software/plinke. CONTACT obirk@bgu.ac.il; markusb@bgu.ac.il SUPPLEMENTARY INFORMATION Supplementary data are available at Bioinformatics online.
Collapse
Affiliation(s)
- Barak Markus
- The Morris Kahn Laboratory of Human Genetics, Department of Virology and Developmental Genetics, NIBN, Ben Gurion University, Israel.
| | | | | |
Collapse
|
5
|
Chang YPC, Kim JDO, Schwander K, Rao DC, Miller MB, Weder AB, Cooper RS, Schork NJ, Province MA, Morrison AC, Kardia SLR, Quertermous T, Chakravarti A. The impact of data quality on the identification of complex disease genes: experience from the Family Blood Pressure Program. Eur J Hum Genet 2009; 14:469-77. [PMID: 16493446 DOI: 10.1038/sj.ejhg.5201582] [Citation(s) in RCA: 13] [Impact Index Per Article: 0.9] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
The application of genome-wide linkage scans to uncover susceptibility loci for complex diseases offers great promise for the risk assessment, treatment, and understanding of these diseases. However, for most published studies, linkage signals are typically modest and vary considerably from one study to another. The multicenter Family Blood Pressure Program has analyzed genome-wide linkage scans of over 12 000 individuals. Based on this experience, we developed a protocol for large linkage studies that reduces two sources of data error: pedigree structure and marker genotyping errors. We then used the linkage signals, before and after data cleaning, to illustrate the impact of missing and erroneous data. A comprehensive error-checking protocol is an important part of complex disease linkage studies and enhances gene mapping. The lack of significant and reproducible linkage findings across studies is, in part, due to data quality.
Collapse
Affiliation(s)
- Yen-Pei Christy Chang
- McKusick-Nathans Institute of Genetic Medicine, Johns Hopkins University School of Medicine, 733 North Broadway, Baltimore, MD 21205, USA
| | | | | | | | | | | | | | | | | | | | | | | | | |
Collapse
|
6
|
Kennedy J, Măndoiu I, Paşaniuc B. Genotype error detection using Hidden Markov Models of haplotype diversity. J Comput Biol 2009; 15:1155-71. [PMID: 18973433 DOI: 10.1089/cmb.2007.0133] [Citation(s) in RCA: 15] [Impact Index Per Article: 1.0] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/12/2022] Open
Abstract
The presence of genotyping errors can invalidate statistical tests for linkage and disease association, particularly for methods based on haplotype analysis. Becker et al. have recently proposed a simple likelihood ratio approach for detecting errors in trio genotype data. Under this approach, a SNP genotype is flagged as a potential error if the likelihood associated with the original trio genotype data increases by a multiplicative factor exceeding a user selected threshold when the SNP genotype under test is deleted. In this article we give improved error detection methods using the likelihood ratio test approach in conjunction with likelihood functions that can be efficiently computed based on a Hidden Markov Model of haplotype diversity in the population under study. Experimental results on both simulated and real datasets show that proposed methods have highly scalable running time and achieve significantly improved detection accuracy compared to previous methods.
Collapse
Affiliation(s)
- Justin Kennedy
- Computer Science and Engineering Department, University of Connecticut, Storrs, Connecticut 06269-2155, USA
| | | | | |
Collapse
|
7
|
Liu N, Zhang D, Zhao H. Genotyping error detection in samples of unrelated individuals without replicate genotyping. Hum Hered 2008; 67:154-62. [PMID: 19077433 DOI: 10.1159/000181153] [Citation(s) in RCA: 6] [Impact Index Per Article: 0.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 10/08/2007] [Accepted: 05/16/2008] [Indexed: 11/19/2022] Open
Abstract
OBJECTIVE Identifying genotyping errors is an important issue in genetic research, yet it has been relatively less studied in samples consisting of unrelated individuals. In this article, we consider several models of genotyping errors, which were originally proposed for pedigree data, for unrelated population samples with single nucleotide polymorphism (SNP) genotype data. The mathematical constraints are investigated for detecting genotyping errors without resampling replicates or genotyping relatives. METHODS For the various proposed genotyping error models, we unveil the conditions under which the parameters are identifiable. These results are verified through applications to simulated and real SNP data. RESULTS We show that, with constraints, two particular models provide both identifiable error rate and allele frequencies of an SNP for unrelated population data. The simulation study shows that these two models present unbiased estimates for the allele frequencies. One of the models also gives an unbiased estimate for the genotyping error rate. CONCLUSION While the Hardy-Weinberg equilibrium test can be used to detect genotyping errors, a key advantage of these models is the explicit estimates of genotyping error rates and allele frequencies. This work may help researchers to estimate error rates and to use the estimates in their analysis to increase power and decrease bias, without the extra work of genotyping family members or replicates.
Collapse
Affiliation(s)
- Nianjun Liu
- Department of Biostatistics, University of Alabama at Birmingham, Birmingham, Ala. 35294, USA.
| | | | | |
Collapse
|
8
|
C2 and CFB genes in age-related maculopathy and joint action with CFH and LOC387715 genes. PLoS One 2008; 3:e2199. [PMID: 18493315 PMCID: PMC2374901 DOI: 10.1371/journal.pone.0002199] [Citation(s) in RCA: 71] [Impact Index Per Article: 4.4] [Reference Citation Analysis] [Abstract] [Track Full Text] [Download PDF] [Figures] [Journal Information] [Subscribe] [Scholar Register] [Received: 12/07/2007] [Accepted: 04/11/2008] [Indexed: 11/19/2022] Open
Abstract
Background Age-related maculopathy (ARM) is a common cause of visual impairment in the elderly populations of industrialized countries and significantly affects the quality of life of those suffering from the disease. Variants within two genes, the complement factor H (CFH) and the poorly characterized LOC387715 (ARMS2), are widely recognized as ARM risk factors. CFH is important in regulation of the alternative complement pathway suggesting this pathway is involved in ARM pathogenesis. Two other complement pathway genes, the closely linked complement component receptor (C2) and complement factor B (CFB), were recently shown to harbor variants associated with ARM. Methods/Principal Findings We investigated two SNPs in C2 and two in CFB in independent case-control and family cohorts of white subjects and found rs547154, an intronic SNP in C2, to be significantly associated with ARM in both our case-control (P-value 0.00007) and family data (P-value 0.00001). Logistic regression analysis suggested that accounting for the effect at this locus significantly (P-value 0.002) improves the fit of a genetic risk model of CFH and LOC387715 effects only. Modeling with the generalized multifactor dimensionality reduction method showed that adding C2 to the two-factor model of CFH and LOC387715 increases the sensitivity (from 63% to 73%). However, the balanced accuracy increases only from 71% to 72%, and the specificity decreases from 80% to 72%. Conclusions/Significance C2/CFB significantly influences AMD susceptibility and although accounting for effects at this locus does not dramatically increase the overall accuracy of the genetic risk model, the improvement over the CFH-LOC387715 model is statistically significant.
Collapse
|
9
|
Ahn K, Haynes C, Kim W, Fleur RS, Gordon D, Finch SJ. The Effects of SNP Genotyping Errors on the Power of The Cochran-Armitage Linear Trend Test for Case/Control Association Studies. Ann Hum Genet 2007; 71:249-61. [PMID: 17096677 DOI: 10.1111/j.1469-1809.2006.00318.x] [Citation(s) in RCA: 30] [Impact Index Per Article: 1.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/30/2022]
Abstract
The questions addressed in this paper are: What single nucleotide polymorphism (SNP) genotyping errors are most costly, in terms of minimum sample size necessary (MSSN) to maintain constant asymptotic power and significance level, when performing case-control studies of genetic association applying the Cochran-Armitage trend test? And which trend test or chi2 test is more powerful under standard genetic models with genotyping errors? Our strategy is to expand the non-centrality parameter of the asymptotic distribution of the trend test to approximate the MSSN using a Taylor series linear in the genotyping error rates. We apply our strategy to example scenarios that assume recessive, dominant, additive, or over-dominant disease models. The most costly errors are recording the more common homozygote as the less common homozygote, and the more common homozygote as the heterozygote, with MSSN that become indefinitely large as the minor SNP allele frequency approaches zero. Misclassifying the heterozygote as the less common homozygote is costly when using the recessive trend test on data from a recessive model. The chi2 test has power close to, but less than, the optimal trend test and is never dominated over all genetic models studied by any specific trend test.
Collapse
Affiliation(s)
- Kwangmi Ahn
- Department of Applied Mathematics and Statistics, Stony Brook University, Stony Brook, NY 11794-3600, USA
| | | | | | | | | | | |
Collapse
|
10
|
Becker T, Valentonyte R, Croucher PJP, Strauch K, Schreiber S, Hampe J, Knapp M. Identification of probable genotyping errors by consideration of haplotypes. Eur J Hum Genet 2006; 14:450-8. [PMID: 16435001 DOI: 10.1038/sj.ejhg.5201565] [Citation(s) in RCA: 14] [Impact Index Per Article: 0.8] [Reference Citation Analysis] [Abstract] [MESH Headings] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Indexed: 11/09/2022] Open
Abstract
Undetected genotyping errors pose a problem in genetic epidemiological studies, as they may invalidate statistical analysis or reduce its power. Haplotype analysis requires an improved standard of the data, because a haplotype can be inferred correctly only if the genotypes of all its markers are correct. Here, we present a method that identifies probable genotyping errors in trio samples with the help of the estimated haplotype frequency distribution of the sample. If the likelihood of the most likely haplotype explanation depends strongly on just one genotype, in the sense that setting the genotype to be missing leads to a much more likely haplotype explanation, this genotype is considered as a potential genotyping error. We describe a method that systematically searches the whole data set for such potential errors. Based on the haplotype distribution of a real data set, we carry out a simulation study to estimate the sensitivity and specificity of the method. In addition, we apply our approach to the real data set itself. Potentially erroneous genotypes are re-determined via sequencing. The results of both the simulation study and of the application to the real data set show that a considerable proportion of true genotyping errors is detected and that the number of false-positive signals is acceptable. We conclude that it is indeed possible to identify probable genotyping errors by considering haplotypes. The method described here will be part of the next release of our FAMHAP software.
Collapse
Affiliation(s)
- Tim Becker
- Institute for Medical Biometry, Informatics and Epidemiology, University of Bonn, Sigmund-Freud-Strasse 25, D-53105 Bonn, Germany
| | | | | | | | | | | | | |
Collapse
|
11
|
Jakobsdottir J, Conley YP, Weeks DE, Mah TS, Ferrell RE, Gorin MB. Susceptibility genes for age-related maculopathy on chromosome 10q26. Am J Hum Genet 2005; 77:389-407. [PMID: 16080115 PMCID: PMC1226205 DOI: 10.1086/444437] [Citation(s) in RCA: 393] [Impact Index Per Article: 20.7] [Reference Citation Analysis] [Abstract] [MESH Headings] [Grants] [Track Full Text] [Journal Information] [Subscribe] [Scholar Register] [Received: 06/07/2005] [Accepted: 06/29/2005] [Indexed: 11/03/2022] Open
Abstract
On the basis of genomewide linkage studies of families affected with age-related maculopathy (ARM), we previously identified a significant linkage peak on 10q26, which has been independently replicated by several groups. We performed a focused SNP genotyping study of our families and an additional control cohort. We identified a strong association signal overlying three genes, PLEKHA1, LOC387715, and PRSS11. All nonsynonymous SNPs in this critical region were genotyped, yielding a highly significant association (P < .00001) between PLEKHA1/LOC387715 and ARM. Although it is difficult to determine statistically which of these two genes is most important, SNPs in PLEKHA1 are more likely to account for the linkage signal in this region than are SNPs in LOC387715; thus, this gene and its alleles are implicated as an important risk factor for ARM. We also found weaker evidence supporting the possible involvement of the GRK5/RGS10 locus in ARM. These associations appear to be independent of the association of ARM with the Y402H allele of complement factor H, which has previously been reported as a major susceptibility factor for ARM. The combination of our analyses strongly implicates PLEKHA1/LOC387715 as primarily responsible for the evidence of linkage of ARM to the 10q26 locus and as a major contributor to ARM susceptibility. The association of either a single or a double copy of the high-risk allele within the PLEKHA1/LOC387715 locus accounts for an odds ratio of 5.0 (95% confidence interval 3.2-7.9) for ARM and a population attributable risk as high as 57%.
Collapse
Affiliation(s)
- Johanna Jakobsdottir
- Departments of Biostatistics and Human Genetics, Graduate School of Public Health, Department of Health Promotion and Development, School of Nursing, and UPMC Eye Center, Department of Ophthalmology, School of Medicine, University of Pittsburgh, Pittsburgh
| | - Yvette P. Conley
- Departments of Biostatistics and Human Genetics, Graduate School of Public Health, Department of Health Promotion and Development, School of Nursing, and UPMC Eye Center, Department of Ophthalmology, School of Medicine, University of Pittsburgh, Pittsburgh
| | - Daniel E. Weeks
- Departments of Biostatistics and Human Genetics, Graduate School of Public Health, Department of Health Promotion and Development, School of Nursing, and UPMC Eye Center, Department of Ophthalmology, School of Medicine, University of Pittsburgh, Pittsburgh
| | - Tammy S. Mah
- Departments of Biostatistics and Human Genetics, Graduate School of Public Health, Department of Health Promotion and Development, School of Nursing, and UPMC Eye Center, Department of Ophthalmology, School of Medicine, University of Pittsburgh, Pittsburgh
| | - Robert E. Ferrell
- Departments of Biostatistics and Human Genetics, Graduate School of Public Health, Department of Health Promotion and Development, School of Nursing, and UPMC Eye Center, Department of Ophthalmology, School of Medicine, University of Pittsburgh, Pittsburgh
| | - Michael B. Gorin
- Departments of Biostatistics and Human Genetics, Graduate School of Public Health, Department of Health Promotion and Development, School of Nursing, and UPMC Eye Center, Department of Ophthalmology, School of Medicine, University of Pittsburgh, Pittsburgh
| |
Collapse
|